Accepted Manuscript Scalable regular pattern mining in evolving body sensor data Syed Khairuzzaman Tanbeer, Mohammad Mehedi Hassan, Ahmad Almogren, Mansour Zuair, Byeong-Soo Jeong PII: DOI: Reference:
S0167-739X(16)30085-1 http://dx.doi.org/10.1016/j.future.2016.04.008 FUTURE 3008
To appear in:
Future Generation Computer Systems
Received date: 1 December 2015 Revised date: 24 March 2016 Accepted date: 12 April 2016 Please cite this article as: S.K. Tanbeer, M.M. Hassan, A. Almogren, M. Zuair, B.-S. Jeong, Scalable regular pattern mining in evolving body sensor data, Future Generation Computer Systems (2016), http://dx.doi.org/10.1016/j.future.2016.04.008 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Highlights (for review)
Highlights:
* Mining regular patterns from body sensor data * Devising an incremental and interactive regular pattern mining tree structure * Mining regular patterns in a single run and one database scan * Efficiency and scalability of the mining approach are tested using real datasets
*Manuscript with source files (Word document) Click here to view linked References
Scalable Regular Pattern Mining in Evolving Body Sensor Data Syed Khairuzzaman Tanbeer1, Mohammad Mehedi Hassan2, Ahmad Almogren2, Mansour Zuair2 and Byeong-Soo Jeong1 1
Department of Computer Engineering, Kyung Hee University, South Korea Email: {tanbeer, jeong}@khu.ac.kr 2 College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia Email: {mmhassan, ahalmogren, zuair}@ksu.edu.sa Abstract. The recent emergence of body sensor networks (BSNs) has made it easy to continuously collect and process various health-oriented data related to temporal, spatial and vital sign monitoring of a patient. As such, discovering or mining interesting knowledge from the BSN data stream is becoming an important issue to promote and assist important decision making in healthcare. In this paper, we focus on mining the inherent regularity of different parameter readings obtained from different body sensors related to vital sign data of a patent for the purpose of following up health condition to prevent some kinds of chronic diseases. Specifically, we design and develop an efficient and scalable regular pattern mining technique that can mine the complete set of periodically/regularly occurring patterns in BSN data stream based on a user-specified periodicity/regularity threshold for the data and the subject. Various experiments in centralized and distributed environment were carried on both real and synthetic data to validate the efficiency of the proposed scalable regular pattern mining technique as compared to state-of-the-art approaches. Keywords: Body sensor network, Regular pattern mining, Healthcare, Decision support, Parallel and distributed mining
1
Introduction
Advances in the intelligent sensors, microelectronics, and wireless communications have enabled the development of body sensor networks (BSNs) that are used to collect and process physiological information of a patient, which can be used to extract knowledge of the health condition of the patient [1]. Recently, real-time activity recognition [2, 3, 4,] has become one of the most focused research areas that mainly use portable devices and BSN data to conduct ambulatory monitory on patient. The main goal of activity recognition is to monitor the activities of daily living for providing better healthcare, social care and/or proactive assistance to users (e.g., elderly, cognitively impaired people, and/or patients). However, in some scenarios it might be helpful to provide better assistance, if we have knowledge about the behavior profiles of the parameters sensed by the body sensors. For example, identifying the periodical changes in blood pressure of a patient can be useful information for doctors to provide proper treatment to a particular patient. Additionally, prediction of the change of blood pressure of the patient can be helpful in pro-active healthcare. Thus, discovering patterns having temporal relationship between the readings obtained from the BSN can make a great difference in handling/providing care to the user. In other words, discovering a shape of occurrence–i.e., whether the pattern occurs periodically, irregularly, or mostly in a specific time interval can be important criteria for analyzing BSN data. Nevertheless, finding such interesting knowledge from BSN data by using pattern matching [5, 6] or activity recognition [7, 8, 20, 21] algorithms may not be suitable, mainly because of the involvement of large volume and variety of BSN data streams, which include text as well as media data such as image, and video having high data Corresponding author Email addresses:
[email protected] (Mohammad Mehedi Hassan)
Preprint submitted to Future Generation Computer Systems
November 30, 2015
rate. Recently, data mining techniques have been utilized in discovering interesting knowledge from the BSN data [5, 6, 9, 11-14, 25]. Ali et al. [9] has developed a software architecture to find routine behavior based on patient’s activity pattern. It uses a frequent pattern mining [10] technique to obtain frequent activity patterns, which enables the observation of the inherent structure present in a patient’s daily activity. Gu et al. [5] exploited the notion of emerging patterns to identify the significant changes between the classes of data for a smooth and efficient recognition of daily living activity. Candás et al. [27] proposed an automatic data mining method to detect abnormal human behavior using physical activity measurements. However, it was limited in terms of detecting periodic changes of behavior. Machado et al. [40] designed a human activity recognition framework using on-body accelerometer sensors. Wang et al. [26] proposed a pattern based real-time algorithm to recognize complex, highlevel human activities. A close look at all of these pattern mining approaches may reveal that their ultimate goal is still to identify or classify subject’s activity. In this paper, our approach is different in the sense that, we focus on identifying or mining the inherent regularity of different parameter readings obtained from different body sensors for the purpose of following up patient’s health condition to prevent some kinds of chronic diseases. The pattern appearance behavior in transactional databases has been extensively studied by Tanbeer et al. in [15, 16, 17]. In [15], they introduced regular patterns, a new type of pattern that follows temporal occurrence regularity in a transactional database. This approach uses a regularity measure determined by the maximum occurrence interval of a pattern in a database and a regularity threshold to identify such patterns. This work also proposed a tree-based data structure, called RP-tree, to capture database information with two database scans. It only contains the information for regular items in the database. In case of BSN data stream scenario, however, the database is updated with a new block of data at regular time intervals (incremental). Moreover, finding an appropriate regularity threshold (interactive) to mine regular patterns is also a challenging task. The high value of regularity threshold may result in too many regular patterns or very low-value may result in too few regular patterns. Thus it may miss some important patterns of interest. Thus, the two databasebased RP-tree approach is not suitable in finding regularity in incremental and interactive sensor data streams. Hence, to find regularity in BSN data, we propose a novel approach, called the SDR-tree (Sensor Data Regularity-tree), to capture the updated sensor data information in a compact manner. Once the SDR-tree is constructed, we use an efficient pattern growth-based mining technique [10] to mine the inherent regularities in patient readings. Besides, as the BSNs generate huge volume of data, we may need to handle data and/or knowledge generated from multiple BSNs distributed through multiple sites. The traditional data mining methods may not be no longer valid in such data intensive and distributed environment. Therefore, in this paper, we also focus on how our proposed SDR-tree can cope with distributed environment in order to handle large databases. Study on both real and synthetic datasets (centralized or distributed environment) shows that finding inherent regularities in continuously updated BSN data with SDR-tree is more efficient than other state of the art algorithms like RP-tree. The remaining part of the paper is presented as follows. We first show an example scenario of regular pattern mining in Section 2. In Section 3 we describe the related works. In Section 4, we introduce the problem of finding pattern regularity on updated BSN. Section 5 presents the structure and mining technique of our proposed SDR-tree. Section 6 reports the experimental results and finally, Section 7 concludes the paper.
2
Example Scenario of Regular Pattern Mining in Healthcare
Let us assume we attach a set of body sensors to a patient in order to obtain health-related data from the patient at a regular basis. Each sensor would be deployed to sense a particular type of reading e.g., sensor 1, say S1, will be responsible to obtain heart rate reading of the patient; sensor 2, say S2, would be used to get the diastolic blood pressure data of the patient; and sensor 3, say S3 would be used to read body temperature of the patient and so on. Depending on the range and/or type of reading, the values read by a particular sensor can again be divided into several categories based on a pre-defined scale. For example, the heart rate readings (from S1) can be classified along a scale of following four (4) categories: 1. Very High (S1VH): above 100 beats/min 2. High (S1H): 70-99 beats/min 3. Normal (S1N): 40-69 beats/min
4. Low (S1L): below 40 beats/min Similarly, the diastolic blood pressure reading (from S2) of the patient can be classified into a set of categories as of: 1. Very High (S2VH): above 110 mmHg 2. High (S2H): 90 – 109 mmHg 3. High Normal (S2HN): 85-89 mmHg 4. Normal (S2N): 65-84 mmHg 5. Low Normal (S2LN): 60-64 mmHg 6. Low (S2L): 35- 59 mmHg 7. Very Low (S2VL): below 35 mmHg Again, the body temperature reading (from S3) can be classified into a set of categories such as: 1. Very High (S3VH): above 40 degrees Celsius 2. High (S3H): 39 – 39.9 degrees Celsius 3. Low High (S3LH): 38 – 38.9 degrees Celsius 4. Moderate (S3M): 37 – 38 degrees Celsius 5. Low (S3L): 36 – 36.9 degrees Celsius 6. Very Low (S3VL): below 36 degrees Celsius Thus, for a particular point of time, the sensor readings generated by all sensors (e.g., S1, S2 and S3) attached to patient’s body can be shown as a combination of readings from all sensors, say reading list. For example, if the readings from S1, S2, and S3 are S1N, S2HN, S3LH, respectively at time tn, the reading list for tn would be as follows: tn: S1N, S2HN, S3LH. Thus, the continuous 5 reading generated by the sensors for the patients can be represented as t1: S1H, S2N, S3H t2: S1N, S2HN, S3M t3: S1H, S2LN, S3H t4: S1N, S2HN, S3M t5: S1H, S2HN, S3H Once the temporal readings from the set of sensors are recorded in the form of above reading lists with timestamp information, we can apply our SDR-tree Miner to obtain the inherent regularity among the reading. For example, it can be observed from the above set of readings that the pattern < S1H, S3H > occurs at a regular periodicity (i.e, at t1, t3 and t5), which may indicate that the heart rate and body temperature of the patient rise at every two hours. Identifying the above types of inherent regularities in patient’s health-related readings can be significantly helpful for the care-givers in following up patient’s health condition, especially if the patient suffers from any chronic disease. This knowledge then can be analyzed further to obtain/generate high-level decisions about patient’s condition, treatment, and medication.
3
Related Works
Real-time activity recognition from body sensor networks or wearable sensor devices has become an important research area that can help to detect abnormal human behavior. There are many active works exist in this area [7, 8, 20, 21, 27, 28, 39, 40]. However, the problem of these methods to detect human behaviour changes is the use of supervised techniques that require human intervention. Recently, data mining techniques have become popular to measure human behaviour in real time from the BSN data [5, 6, 9, 11-14, 26, 27]. Candás et al. [27] proposed an automatic data mining method to detect abnormal human behavior using physical activity measurements. Abnormal human behaviour is detected as an increase or decrease of the physical activity according to the historical data. Wang et al. [26] proposed a pattern based real-time algorithm to recognize complex, high-level human activities. In general, most of the pattern mining algorithms used in the above works focus on mining frequent activity pattern to detect anomalies in behavior. They are limited in terms of detecting periodic changes of human behavior. Our goal is to mine the inherent regularity of different parameter readings obtained from different body sensors for the purpose of following up patient’s health condition to prevent some kinds of chronic diseases. There are some researches, which focus on mining periodic patterns [41], [42] and cyclic patterns [43] on various
Table 1. A SENSOR DATASET (SD) Id
Epoch
Id
Epoch
Id
Epoch
1 2 3
S4, S1, S3, S2, S1, S5 S6, S1, S5, S2
4 5 6
S1, S2, S5, S3 S1, S5, S2, S6 S4, S2, S3
7 8 9
S5, S3, S4 S4, S5, S6 S2, S3, S4
data sets such as time-series or sequential data sets. Periodic pattern mining problem in time-series data focuses on the cyclic behavior of patterns either in the whole or at some point of time-series. Chanda et al. [41] proposed an algorithm to generate flexible periodic patterns using suffix tree that can handle variable starting position for mining periodicity without recalculation. Sridevi et al. in [42] describes an approach called STNR (suffix tree-based noise resilient) algorithm to detect symbol and sequence periodicity, as well as segment periodicity in time-series data set. Hu et al. [43] presented the mining of cyclically repeated patterns with skewed repetition distribution in sequential data set. Although mining periodic and cyclic patterns are closely related to our work, it cannot be directly applied for finding regular patterns from BSN data sets which are similar to transactional databases and updated regularly. Recently, Tanbeer et al. in [15, 16, 17] proposed the Regular Pattern tree (RP-tree in short) to exactly mine the regular patterns from transactional databases. This approach requires two database scans and uses a regularity measure determined by the maximum occurrence interval of a pattern in a database and a regularity threshold to identify such patterns. In case of BSN data stream scenario, however, the database is updated with a new block of data at regular time intervals (incremental). Also the regularity threshold needs to set appropriately (interactive) for finding regular patterns. Thus, the previous RP-tree approach is no longer suitable to find a regular pattern from BSN data set. Although there are some works on the problem of incremental and interactive pattern mining [29-37], however, all of them focus on frequent pattern mining. Some of them use a tree structure with single scan over the database, and others use multiple scans. Again, some of them consider both of incremental and interactive mining, while the others carry out either one. However, none of the above algorithms can be applicable to efficiently mining regular patterns from BSN data set, which is close to transactional databases. Therefore, there is a need to develop an efficient mining technique to address the problem of incremental and interactive regular pattern mining from body sensor data. Besides, BSNs generate huge volume of data, primarily due to high density of sensor nodes and data generation rate. Also, in many occasions (e.g., data analysis and predictions) we may need to handle data and/or knowledge generated from multiple BSNs distributed through multiple sites or cloud-based implementations [44-48]. The traditional data mining methods that assume that data are centralized, memory-resident, and static may not be no longer valid in such data intensive and distributed environment. In [44, 45], the authors proposed the integration of BSNs and cloud computing for the next-generation large-scale BSN applications. In [46], the authors describe a parallel frequent data mining strategy under the cloud computing environment. However, there are few efforts that consider distributed pattern mining for body sensor data in distributed platform. Therefore, in this paper, we also focus on how our proposed SDR-tree based sensor regular pattern mining approach can cope with distributed environment in order to handle large databases.
4
Problem Definition
Similar to the problem definition in [17], we present the basic notations and definitions of the regular pattern mining in body sensor database. Let L = {s1, s2, … , sn} be a set of body sensors in a particular body sensor network. A set X = {sj, … , sk}⊆L, where j ≤ k and j, k [1, n] , is called a pattern of sensors. A body sensor database, SD, over L, is defined to be a set of epochs T = {t1, … , tm}, where each epoch t = (tid, Y) is a tuple where tid represents the timeslot-id of sensor event occurrence (we assume that the time space is divided into equal sized slots) and Y is a pattern of eventdetecting sensors that report events within the same time slot. If X ⊆ Y, it is said that t contains X or X occurs in t and such timeslot-id is denoted as t Xj , j [1, m] . Therefore, TX = {t Xj ,..., tkX } , j, k [1, m] and j ≤ k is the set of all timeslot-ids where X occurs in SD.
Definition 1 (a period of X). Let t Xj1 and t Xj j [1,(m 1)] be two consecutive timeslot-ids in TX. The number of timeslots (or the time difference) between t Xj1 and t Xj is defined as a period of X, say pX (i.e., pX = t Xj 1 t Xj , j [1,(m 1)] ). For the simplicity of period computation, a ‘null’ epoch with no sensor data is considered at the beginning of SD, i.e., tf = null, where tf represents the first epoch to be considered. Similarly, tl, the last epoch to be considered, is the m-th epoch in SD, i.e., tl = tm. For instance, in the body sensor database in Table 1 the set of epochs where pattern “S2,S6” appears is TS2,S6= {3, 5}. Therefore, the periods for “S2,S6” are 3 (= 3 - tf), 2 (= 5 - 3), and 4 (= tl - 5), where tf = 0 and tl = 9. The above occurrence periods present relevant information about the appearance behavior of a pattern. As discussed in [17], a pattern will not be regular if, at any stage in database, it appears after sufficiently large period. The largest occurrence period of a pattern, therefore, can provide the upper limit of its periodic occurrence characteristic. Hence, the measure of the characteristic of a pattern of being regular in a SD (i.e., the regularity of that pattern) can be defined as follows. Definition 2 (regularity of pattern X). Let for a TX, PX be the set of all periods of X i.e., PX = { p1X ,..., prX } , where r is the total number of periods of X in SD. Then, the regularity of X can be denoted as reg(X) = Max( p1X ,..., prX ) . For example, in the database of Table 1, reg(S2,S6) = 4 i.e., Max(3, 2, 4). Therefore, a pattern is called a regular pattern if its regularity is no more than a user-given maximum regularity threshold called max_reg λ, with 1 ≤ λ ≤ |SD|. The regularity threshold can be set as the percentage of database size e.g., max_reg = 10% of |SD| may indicate λ = 0.1× |SD|. Regular pattern mining problem, given a λ and a SD, is to discover the complete set of regular patterns having regularity no more than λ in the SD. RPSD refers to the complete set of all regular patterns in a SD for a given max_reg. Let SD+ and SD- respectively denote the set of added and deleted epochs to and from SD. The updated database denoted as USD, is obtained from SD db , or SD db , or SD db db . Given SD, SDi+/- (i be the number of updates on SD) and a λ, incremental sensor regular pattern mining is to discover the RPUSD. And interactive sensor regular pattern mining, is to find the RPSD or RPUSD with the change of λ, but keeping the database fixed.
5
SDR-TREE: DESIGN, CONSTRUCTION AND MINING
The proposed Sensor Data Regularity-Tree (SDR-tree), is designed to capture complete (updated) BSN data with a single scan of sensor network readings. It captures all information of each sensor epoch in a compact structure, that allows us avoid repeated scanning of the sensor database. 5.1
SDR-tree Structure
Before discussing the construction process of an SDR-tree in detail, we provide a brief description of the SDR-tree structure. Similar to an RP-tree, the SDR-tree has a root node referred to as the “null”, and a set of sub-trees (children of the root). It also maintains a header table called sensor data table (SD-table) to capture information for each distinct sensor with relative regularity in the SD. A separate pointer from each sensor in the SD-table points to the first node in the SDR-tree that carries the sensor. Like an RP-tree, there are two types of nodes in an SDR-tree: the ordinary node and the tail-node. While both nodes maintain parent, children, and node traversal pointers, the tail-nodes additionally keep track of all epochs (in a tid-list) where it is the last node. Thus, N[t1, t2, …, tn], represents a tail-node, where N is the name of sensor node and ti, i [1, n ] is the timeslot-id of an epoch where N is the last sensor (n be the total number of epochs from the root down to the node). It is important to note that, neither an ordinary node nor a tail-node in an SDR-tree does maintain support count value in it.
Algorithm 1: SDR-tree construction Input: SD: the sensor database; Output: An SDR-tree 1 begin 2 create the root, R of an SDR-tree, T, and label it as "null"; 3 for each epoch ti in SD do 4 if ti ≠ NULL then 6 sort ti according to given order; 7 call Insert_SDR-tree(ti, R); 8 end 9 end 10 end 11 Procedure Insert_SDR-tree(ti, R) 12 begin 13 let the sorted sensor list in ti be [y|Y], where y is the first sensor and Y is the remaining list; 14 set the m-field for y in the SD-table; 15 if R has a child C such that C.sensor = y.sensor then 16 select C as the current node; 17 else 18 create a new node C as child of R; 19 end 20 if y = the tail-sensor of ti then 21 if C = an ordinary node then 22 assign a tid-list to C; 23 end 24 add the tid of ti in C's tid-list; 25 else 26 call Insert_SDR-tree(Y, C); 27 end 28 end
Fig. 1: SDR-tree construction algorithm
Even though the node structures of an RP-tree and our SDR-tree are similar, the structure, and construction and maintenance processes of the SD-table and the SDR-tree significantly differ from those of the R-table in an RP-tree. Unlike the R-table of an RP-tree, the SD-table in an SRD-tree consists of five fields in sequence (i, r, tl, m, p): (i) sensor name (Si), (ii) the regularity of Si (r), (iii) the most recent tid where Si occurred, (iv) a one-bit flag (m) to indicate any changes for Si and (v) a pointer to the SDR-tree for Si (p). Again, the regularity (r) and tl for each sensor is calculated after constructing the SDR-tree and traversing it once - as explained in the next section. The m field is set only if the sensor data is modified (i.e., appeared or deleted) in any epoch in the current database (e.g., either original SD or SD+). The pointer p facilitates a fast traversal to the tree for sensor Si.
SD-table i r tl m p s1 ×
SD-table
{} s1
s2 s3 s4
×
{}
i r tl m p s1 × s2 × s3 × s4 × s5 × × s6
s4:1
s5 s6
s1
s2
s3
s5
s5:2,4
(a) SDR-tree after inserting tid = 1
s3 s4 s5 s6
s4
s5
s4:6,9
s5:7
s6:8
s4
s6:3,5
{}
s1
3 9 × 2 9 × 5 9 × 2 8 × 3 8 ×
s3
(b) SDR-tree after inserting tid = 9
SD-table i r tl m p s1 4 5 × s2
s4:1
s2 s3
s2
s3
s5:2,4
s4:1
s5
s2
s3
s3
s4
s4:6,9
s5:7
s4
s5
s6:8
s6:3,5
(c) SDR-tree after refreshing the SD-table
Fig. 2: Construction of an SDR-tree for the SD of Table 1 5.2
SDR-tree Construction
The construction of an SDR-tree is similar to that of the FP-tree [10] and RP-tree [12]. However, unlike the FP-tree and RP-tree, it uses (i) single database scan and (ii) captures the complete database information in a compact fashion. Moreover, an SDR-tree can be constructed without prior knowledge of the regularity threshold. The singlepass construction also allows the SDR-tree to arrange sensors according to any canonical order, determined by the user prior to the tree construction-such as lexicographic or alphabetical order, or according to some specific order on sensor properties (e.g., weights, values, or some constraints). Once the sensor order is determined (say, for SD), all sensors will follow this order in our SDR-tree for subsequent updated databases (e.g., SD db1 , SD db1 db2 , …). With this setting (i.e., the canonical order), an SDR-tree holds the following property: Property 1. Sensors in an SDR-tree are arranged in a fixed global (canonical) order. The SDR-tree construction algorithm is presented in Algorithm 1 in Fig. 1. Let us visit the following SDR-tree construction example for the SD in Table 1 in lexicographic order (in Fig. 2) by following the algorithm. The construction of the SDR-tree starts with a “null” root node (line 2). As shown in Fig. 2(a), the first epoch {S4, S1} (i.e., tid = 1) is sorted in lexicographic order (line 3) and inserted into the tree (lines 7, 11-28). The tid information of the epoch is recorded in the tail node “S4:1” (line 24). This figure also shows the status of the SD-table which sets the m-field for both sensors (i.e., ‘S1’ and ‘S4’), indicating that these two sensors appeared in the current SD (line 14). To simplify the figures, the node traversal pointers are not shown. After inserting all the epochs in a similar fashion, the final SDR-tree is given in Fig. 2(b). As mentioned before, once the SDR-tree is constructed, we use the sensor pointers from the SD-table to traverse the tree and calculate the regularity (r) of each sensor in the SD-table. We call this process of updating the SD-table entries as refreshing the SD-table. To assist this process, we assign a temporary array for each sensor in the SD-table and accumulate the tid(s) in its tail-node(s) in the array by traversing the whole tree once. This process of accumulating tid(s) starts from the bottom-most sensor of the SD-table and ends with the top-most sensor. Continuing with our running example, after visiting all the tail-nodes of the last sensor ‘S6’ in the SD-table, the contents of the temporary arrays for sensors ‘S1’, ‘S2’, ‘S4’, ‘S5’, and ‘S6’ (i.e., sensors from tail-nodes up to the root) are S1:{3, 5}, S2:{3, 5}, S4:{3, 5, 8}, S5:{3, 5, 8}, and S6:{3, 5, 8}. We repeat the whole process for each sensor in the SD-table. Thus, the temporary array of every sensor will contain the complete list of its tids, when we
finish the tree traversal for the top-most sensor in the SD-table. For example, the set of epochs for sensor ‘S1’ we get, TS1 = {1, 2, 3, 4, 5}. Then, it is trivial calculation to find the PS1 from TS1, which gives reg(S1) = 4 and tl value of ‘S1’ = 5. Similarly, for ‘S3’, since TS3= {2, 4, 6, 7, 9}, reg(S3) = 2, and tl = 9. Finally, Fig. 2(c) shows the final status of the SDR-tree and the SD-table with the regularity and the last tid of each sensor. The SDR-tree update mechanism discussed above is also effective on updating the tree on deletion of epoch(s). To keep the tree updated and ready-to-mine condition after each epoch (say t) deletion, we follow the following steps: First, we visit each tail nodes, remove the tid of t from its tid-list (if it contains that tid), and decrement tids in the list by one for the tids greater than tid of t. Second, if the tid-list contains only 0 (zero) or no tid, remove the path from the tail-node up to the root. Third, we refresh the SD-table to reflect the updated information. Since only tail-nodes in an SDR-tree keeps epoch information, adjusting only the tid values in tid-lists of tailnodes guarantees complete update of the SDR-tree for epoch deletions from the database. The SD-table refreshing process terminates the SDR-tree construction, and makes the tree readily available for mining and/or for further updates. To reflect the next update for each sensor, all m-fields in the SD-table are reset before the update of the SDR-tree, or as the mining operation completes. Based on the SDR-tree construction technique discussed above, we have the following property and lemmas of an SDR-tree. Let for each epoch t in an SD, sensor(t) be the set of all sensors in t and is called the full sensor projection of t.
1. S4, S1, 2. S3, S2, S1, S5 3. S6, S1, S5, S2 4. S1, S2, S5, S3 5. S1, S5, S2, S6 6. S4, S2, S3 7. S5, S3, S4 8. S4, S5, S6 9. S2, S3, S4 10. S1, S5, S4 11. S1, S2 12. S3, S5, S4 13. S4, S6
{} SD-table i r tl m p s1 4 10 × s2 3 11 × s3 2 9 s4 5 11 × s5 2 10 × s6 3 8
DB
+ db1
s1
s3
s4
s4:1 s3 s4:11 s4
s2
s3
s2
s4:6,9
s5
s5
s5:7 s6:8
s5:10 s5:2,4
s6:3,5
+ db2 (b) SDR-tree after inserting db+1
(a) Increment of SD
{} SD-table i r tl m p s1 5 10 × s2 3 11 × s3 2 9 s4 5 11 × s5 2 10 × s6 3 8
{}
SD-table s1
s2
s3
s4
i r tl m p 5 10 s2 3 11 s3 2 13 × s4 5 12 × s5 2 12 × s6 3 13 ×
s1
s2
s1
s4:1 s3 s4:11 s4
s2
s3
s4:6,9
s5
s5
s5:7 s6:8
s5:10 s5:2,4
s6:3,5
(c) SDR-tree after refreshing the SD-table
s3 s4
s4:1 s3 s4:11 s4
s2
s3
s4:6,9
s5 s5:10
s5:2,4
s6:3,5
(d) SDR-tree after db+2
Fig. 3: The SDR-tree on increment of SD
s6:13
s5:7, 12
s5 s6:8
Property 2. An SDR-tree maintains sensor(t) for each epoch in an SD only once. Lemma 1. Given a sensor epoch database SD, sensor(t) of all epochs in SD can be derived from the SDR-tree for the SD. Proof. Based on the SDR-tree construction mechanism and Property 2, sensor(t) of each epoch t is mapped to only one path in the SDR-tree and any path from the root up to a tail-node maintains the complete projection for exactly n epochs (where n is the total number of entries in the tid-list of the tail-node). Lemma 2. The size of an SDR-tree (without the root node) on a sensor epoch database SD is bounded by . Proof. According to the SDR-tree construction technique and Lemma 1, each epoch t contributes at best one path of the size |sensor(t)| to an SDR-tree. Therefore, the total size contribution of all transactions can be at best. However, since there are usually a lot of common prefix patterns among the epochs, the size of an SDR-tree is normally much smaller than . However, it is tempting to assume that the SDR-tree may be memory inefficient, as it explicitly maintains tids in it. But, we argue that the an SDR-tree achieves memory efficiency through (i) keeping tid-information only at the tail-nodes and (ii) avoiding the support count field of each node. Moreover, various efficient frequent pattern mining tree structures in literature were designed maintaining the tid information in it [18]. To a certain extent, some of these studies additionally maintain support count and/or the tid information [23], [24] in each tree node. Furthermore, with modern technology, main memory space is no longer a big concern. Hence, we made the same realistic assumption as in many studies [15], [22] that we have enough main memory space (in the sense that the trees can fit into the memory). Therefore, with one SD scan, the SDR-tree maintains the complete SD information in a compact manner. Once the SDR-tree is constructed, we use an FP-growth-based pattern growth mining technique to discover the complete set of regular patterns from it for the current database. Before discussing the mining process, we present, in the next subsection, how SDR-tree can efficiently handle the database updates.
5.3
The SDR-tree in Incremental Database
By exploiting the Property 1, note that items in an SDR-tree are arranged according to a fixed global order, which, in turn allows its easy maintenance in incremental databases. We use the following example to illustrate SDR-tree’s maintenance procedure upon the increment of database. At the later part of this subsection, we discuss how the SDR-tree can be efficiently updated upon deletion of the epoch from the database. Let the DB in Table 1 is updated by two blocks of epochs ( db1 and db2 ), each block consists of one or more epochs, as shown in Fig. 3(a). This figure demonstrates the status of our SDR-tree after inserting the epochs in db1 . Since the SDR-tree always maintains a fixed global order, new epochs in db1 (i.e., tids 10 and 11) can be inserted in the same order following its construction process discussed in the previous subsection. It is important to note that, m-field values for ‘S1’, ‘S2’, ‘S4’, ‘S5’, and ‘S6’ in the SD-table are again set in Fig. 3(b), which specify the appearance of only these sensors in db1 . Later in this section, we explain how such information (i.e., status in mfield) in SD-table significantly reduces the mining cost during incremental mining. To obtain the updated regularity of each sensor, we need to refresh the SD-table once again. We save substantial amount of effort in this process by refreshing the tree only for the sensors that got set value in the m-field (i.e., for sensors ‘S1’, ‘S2’, ‘S4’, ‘S5’, and ‘S6’). With the help of the contents in the temporary array, and the previous values in r and tl fields, it is rather trivial to obtain the updated regularity of each of such sensor. For other sensors in the SD-table (e.g., ‘S3’, and ‘S6’), we consider only the tl value of each sensor by the tid of tcur (i.e., tcur = 11), and at the same time the updated regularity of the sensor is calculated using its previous r and tl values. After refreshing the
SD-table, the updated SDR-tree and corresponding SD-table for db1 are presented in Fig. 3(c). Similar to Fig. 3(c), Fig. 3(d) illustrates the status of the SDR-tree and corresponding SD-table after the update for db2 .
The SDR-tree update mechanism discussed above is also effective on updating the tree on deletion of epoch(s). To keep the tree updated and ready-to-mine condition after each epoch (say t) deletion, we follow the following steps: First, we visit each tail nodes, remove the tid of t from its tid-list (if it contains that tid), and decrement tids in the list by one for the tids greater than tid of t. Second, if the tid-list contains only 0 (zero) or no tid, remove the path from the tail-node up to the root. Third, we refresh the SD-table to reflect the updated information. Since only tail-nodes in an SDR-tree keeps epoch information, adjusting only the tid values in tid-lists of tailnodes guarantees complete update of the SDR-tree for epoch deletions from the database. For example, consider the epochs with tids 6, and 7 in the database of Fig. 3(a) are deleted. The above tree adjustment process upon epoch deletion updates for all tids after these epochs (i.e., tid = 8 becomes tid = 6, tid = 9 becomes tid = 7 and so on) in the tid-lists of tail-nodes in the tree. Moreover, the SD-table refreshing mechanism will update the regularity of sensors ‘S1’, ‘S2’, ‘S3’, ‘S4, ‘S5’, and ‘S6’ at the same time. Accordingly, we obtain the updated ready-to-mine SDR-tree. It may be assumed that SD-table refreshing mechanism of SDR-tree may need more computation cost compared to scan the database twice as in RP-tree. However, we argue that the cost of refreshing the SD-table by traversing the SDR-tree once is much less than that by scanning the database a second time, since reading epoch information from the memory-resident tree is much faster than scanning it from the database. Also note that, we need to traverse the full SDR-tree for the initial tree only, and for each update, we significantly reduce the tree traversal cost by traversing the tree for the recently updated sensors only (based upon the m-field value in the SD-table). The RP-tree, on the other hand, requires scanning the whole database twice at each update, which results in high computation cost in calculating the updated regularity for each item and restricts its efficient use in incremental and interactive mining. In the next subsection, we discuss the regular sensor pattern mining process from the updated SDR-tree. 5.4
SDR-growth: Mining an SDR-tree
As mentioned before, once the SDR-tree is constructed, we mine regular patterns from it using a pattern growthbased approach. We call this mining technique SDR-growth. Similar to the FP-growth [10] mining approach, our SDR-growth recursively mines the SDR-tree of decreasing size to generate regular sensor patterns by creating conditional pattern-bases and corresponding conditional trees without additional database scan. Before discussing the SDR-growth mining process, we explore the following important property and lemma of an SDR-tree. Property 3. Each tail-node in an SDR-tree maintains the occurrence information of all the nodes in the path (from that tail-node to the root) in the epochs of its tid-list.
Algorithm 2: The SDR-growth algorithm Input: SDR-tree: an SDR-tree constructed on the SD; max_reg: the regularity threshold Output: RP: set of regular sensor patterns for max_reg 1 begin 2 for each sensor α from the bottom of the SD-table do 3 call Build_PB(SDR-tree, α); 4 call Mine(PBα, α); 5 end 6 end 7 Procedure Mine(PBα, α) 8 begin 9 call Build_CT(PBα); 10 if CTα ≠ NULL then 11 for each sensor β in SD-tableα of CTα do 12 generate pattern β = β α as a regular sensor pattern, add it to RP, and store the last tid; 13 call Build_PB(CTα, β); 14 call Mine(PBα, β); 15 end 16 end 17 end 18 Procedure Build_PB(SDR-tree, α) 19 begin 20 for each tail-node Nα of α in SDR-tree do 21 push-up the tid-list of Nα to its parent; 22 project path Pα from the parent of Nα up to the root in the conditional pattern-base of α, PBα with the tid-list of Nα; 23 map the tid-list of Nα to temporary arrays for all sensors in Pα; 24 update for Nα into the SD-table; 25 end 26 update entries for α from the SD-table; 27 construct a SD-table, SD-tableα for the PBα; 28 calculate the regularity of each sensor x in SD-tableα from the contents of the respective temporary array Tx; 29 return PBα; 30 end 31 Procedure Build_CT(PBα) 32 begin 33 for each sensor β in SD-tableα do 34 if reg(β) > max_reg then 35 for each node Nβ in PBα do 36 if Nβ is a tail-node then 37 push-up the tid-list of Nβ to its parent; 38 end 39 delete Nβ; 40 end 41 update SD-tableα for β; 42 end 43 end 44 let CTα be the conditional tree constructed as such from PBα; 45 return CTα; 46 end
Fig 4: SDR-tree mining algorithm Lemma 3. Let Z = {a1, a2, ….., an} be a path in an SDR-tree where node an, being the tail-node, carries the tid-list of the path. If the tid-list is pushed-up to node an-1, then node an-1 maintains the occurrence information of path Z′ = {a1, a2, ..., an-1} for the same set of epochs in the tid-list without any loss. Proof: Based on Property 2, the tid-list in node an explicitly maintains the occurrence information of Z′ for the same set of epochs. Therefore, the same tid-list at node an-1 exactly maintains the same information for Z′ without any lose.
Using the features revealed by the above property and lemma, the SDR-growth algorithm, presented in Fig. 4, proceeds to construct the conditional pattern-base PBi for sensor Si, starting from the bottom-most sensor in the SDtable, by projecting only the prefix sub-paths of nodes labeled Si in the SDR-tree. During this projection, the algorithm {} SD-table
s1
s2
{} s3
i r tl m p s1 s2 s3 s4 s5
s2
4 5 × 3 9 ×
s5:3,5
s3
2 9 × 5 9 × 2 8 ×
s4:1
s3
s4:6,9
s4
s4
s5:8
SD-tableS6
s2
s5:8
i r tl p s2
4 5
s5
3 8
S5:3,5
s5:7
s5:2,4
(a) SDR-tree after forming PBS6
(b) PBS6 for λ = 3
{} SD-tableS6 i
r tl p
s5
3 8 s5:3,5,8
(c) CTS6 for λ = 3
Fig. 5: Conditional pattern-base and conditional tree construction with the SDR-tree of Fig. 2(c) for λ = 3 only includes regular sensors. Determination of whether a sensor is regular can easily be done by a simple lookup (an O(1) operation) at the SD-table. There is no worry about possible omission or doubly counting of sensors. Since Si is the last sensor in SD-table, each node labeled Si in the SDR-tree must be a tail-node. Therefore, SDRgrowth pushes-up the tid-lists of all such tail-nodes to respective parent nodes in the SDR-tree and in PBi. Thus, the parent node is converted to a tail-node, if it was an ordinary node; otherwise (i.e., if the parent is not a tail-node), the tid-list is merged with its previous tid-list. All nodes labeled Si in the SDR-tree and the entry for Si in SD-table are, thereafter, updated. Similar to the SD-table refreshing technique, to compute the regularity and the last occurring epoch of each sensor Sj in the SD-tablei (i.e., the SD-table for PBi), the SDR-growth algorithm refreshes the SD-tablei during constructing the PBi. Therefore, computing reg(ij) from Tij by generating Pij becomes rather trivial calculation. Fig. 5(a) represents the status of the SDR-tree of Fig. 2(c) after creating the conditional pattern-base of ‘S6’ (i.e., the bottom-most sensor in the SD-table) PBS6 for λ = 3. The entry of ‘S6’ in the SD-table, and all nodes representing sensor ‘S6’ (i.e., nodes “S6:3,5” and “S6:8” in Fig. 2(c)) in the SDR-tree are deleted. The tid-list of each of such node is pushed-up to respective parent node of ‘S5’ in the example. The PBS6 is constructed by projecting prefix sub-paths of nodes “S6:3,5” and “S6:8”. Fig. 5(b) shows the structure of PBS6 after the projections of the prefix sub-paths. Note that, nodes of only the regular sensors in each sub-path are accumulated in the PBS6. For example, nodes of sensors ‘S5’ and ‘S2’ for the node “S6:3,5”, and that of sensor ‘S5’ for “S6:8” together constructs the PBS6. Fig. 5(b) also shows the status of the SD-tableS6 which we obtain by executing the refresh SD-table operation for the PBS6. In the next step, the SDR-growth constructs the conditional tree for Si CTi from PBi by removing the non-regular sensor nodes respectively from SD-tablei and PBi. The tid-list of the deleted node is pushed-up to its parent node, as done before. The conditional tree for ‘S6’ CTS6 is created by removing sensor ‘S2’ from the SD-tableS6 and node “S2” from the PBS6, since the regularity of ‘S2’ in SD-tableS6 is greater than 3, the regularity threshold. The CTS6 is shown in Fig. 5(c). From the CTi we create and store the set of regular patterns prefixing the sensor Si. Along with creating the regular sensor pattern, the algorithm also store the last tid related to each mined pattern. From the CTS6 the pattern “S5,S6” with reg(S5,S6) = 3 is generated, and the value of tl of ‘S5’ (i.e., 8) is explicitly
stored. The whole process of conditional pattern-base and conditional tree constructions is repeated until the SDtable becomes empty. Before the update of the database, the m-field for each sensor in the SD-table is reset. Therefore, while mining after the next database increment (say, after inserting db1 ) we mine only for the sensors m-fields of which in SDtable are found set. Since we store all regular patterns generated in previous mining operation with respective last tid, it is easy to update the regularity of other sensors in the SD-table considering tcur = the last tid in db1 . While the database is fixed, mining regular sensor patterns for multiple max_reg values (say, λ1, λ2, λ3, …) can easily be performed with the already constructed SDR-tree for the database. In such case, we do not need to rebuild the SDR-tree for each max_reg, since it contains the full database information. Thus, the SDR-tree holds the important ‘build-once-mine-many’ property of interactive mining. Through the above mining process the complete set of regular patterns for a given max_reg can be generated from an SDR-tree constructed on a database. The SDR-growth is complete due to taking only regular sensors into consideration and performing the mining operation from bottom to top. Moreover, the SDR-tree, with its important feature of using previous mining information, offers an efficient technique in mining regular patterns from incremental and stream databases. The next section describes an additional functionalities of our proposed SDR-tree. 5.5
SDR-tree Mining in Distributed Environment
The performance and scalability have been considered as bottle necks of most data mining techniques. Distributed mining typically can provide a framework for scalability, while parallel mining may offer better efficiency. Our SDR-tree is capable of effectively addressing these two issues by adapting itself into parallel and distributed environment. This adaptation of the proposed approach requires a minor tuning in its mining phase, while using the same tree construction and tree refreshing strategies. Before discussing details about this feature from SDR-tree, let us explain the concept of mining sensor regular patterns in parallel and distributed environment. We consider that the underlying sensor database is larger than the volume of data that the available resources (e.g., computation power, and disk space) in a local site can process within a reasonable time. Hence, the whole computations is divided into multiple similar parts and multiple systems located in multiple sites are assigned to solve each part in parallel. The underlying database is also divided into multiple (similar sized) parts and assigned to multiple systems. We develop an SDR-tree for each local site with the data assigned to it, and apply SDR-tree mining technique to generate sensor regular patterns for the local site. All locally generated sensor regular pattern sets are then accumulated to discover true global sensor regular patterns from them. A more formal problem definition is as follows. A homogeneous distributed system of n sites, denoted as Site1, Site2, …, Siten, is considered. The sensor database SD is horizontally divided into n partitions as sd1, sd2, …, sdn, assuming that each partition sdp is assigned to a site Sitep where . Let regp(X) denote the regularity count of sensor pattern X in sdp. Therefore, for a sensor pattern X, we call reg(X) and regp(X) as the global regularity count and local regularity count in sdp, respectively. For a given max_reg, X is globally regular if reg(X) ≤ max_reg. The algorithm to mine sensor regular patterns using our SDR-tree in parallel and distributed environment is presented in Fig 6. At each site, we create a local SDR-tree with the locally available data (line 3), and mine all regular sensor patterns from the local SDR-tree (line 4). We apply the mining technique discussed in the previous section on all local SDR-trees in parallel with a minor difference. While mining in distributed environment, in addition to finding the sensor regular patterns we keep track of the first and the last occurrences epoch ids for each discovered pattern. Recall from Section 3.4 that, in the conditional tree construction phase we inherently generate (through the tid push-up mechanism) the list of all epoch ids for a pattern, and use this list to check whether the pattern is a regular pattern or not. For example, in Fig. 5(c) the conditional tree, CTS6 for sensor S6 contains the list of tids (i.e., timeslot ids) related to its each extension, which is S5, in this case. Finally, from this CTS6 we generate sensor pattern “S5,S6”
Algorithm 3: SDR-tree mining in parallel and distributed environment Input: n: the total number of sites; sdp: a sensor database at Sitep ; max_reg: the regularity threshold Output: RP: the set of global sensor regular patterns 1. begin 2. for each site Sitep, do 3. create local SDR-treep using sdp using Algorithm 1; 4. discover RPp, the set of local sensor regular patterns, from SDR-treep using SDR-growth; 5. end 6. let, be the timeslot id of the first epoch 7. 8. 9. 10.
11. 12.
containing sensor pattern X in Sitep; let be the timeslot id of the last epoch containing X in Sitep; let the combined regularity of a sensor pattern X among all sites, regc(X); assign ; assign , where RPp is the set of sensor regular patterns in Sitep, and RPc be the combined sensor regular patterns; for each sensor pattern in do for each occuring Sitep do
13.
, where
14. 15.
;
end , pattern
16. 17. 18. 19. 20. end 21. end
if (
where , be the regularity of sensor X in RPp in Sitep; ) then ; add X in RP;
end
Fig 6: SDR-tree mining in parallel and distributed environment as a regular sensor pattern by calculating its regularity from the final tid-list {3, 5, 8} and comparing the regularity with the max_reg. In order to facilitate parallel and distributed mining, at each site and for each regular sensor pattern, we explicitly record the first and the last tids as additional information, along with the pattern itself (lines 6 and 7). The first and the last tids of a pattern can easily be obtained from the final tid-list of the pattern from corresponding conditional tree. For example, assume CTS6 in Fig. 5(c) is constructed in a site of the distributed environment, then the first and the last tids of regular sensor pattern “S5,S6” will be recorded as 3 and 8, respectively. The same algorithm is used in all sites in parallel to record this additional information along with each regular sensor pattern generated in each site. Lemma 4. If a sensor pattern X is found regular in Sitep, but not in Siteq, where sites, then X cannot be a global regular pattern.
,
and n be the number of
Proof: Based on the definition of regularity calculation presented in Section 2, X cannot be a regular sensor pattern, if at any stage in the database, X appears at an interval greater than the regularity threshold. If X is found non-regular in
Siteq, it guarantees that at any stage in the overall global database X appeared at an interval greater than the regularity threshold. Hence, even if X is found regular in any (or more) of other sites, X cannot be globally regular sensor pattern. Once the local regular sensor patterns are generated at each site, we accumulate only the common patterns among all sites (line 10) and discard other patterns from further consideration, according to Lemma 4. For each of these patterns, we then compute the differences of the last tid and the first tid of every consecutive site in which the pattern occurs (we call such a site as an occurring site for that pattern). The maximum difference between all the calculated differences has been computed (line 13). Finally, the maximum value among all the regularity values from all sites and all such tid differences for all the sites where the patterns occur are computed as the global regularity of the pattern (line 15), and used to identify whether the pattern globally regular or not (lines 16, 17). The algorithm stores all the global regular sensor patterns (line 18) in the discovered regular sensor pattern set, RP. The next section reports our experimental results. Table 2. Dataset Characteristics
Dataset
Size (MB) #Trans #Items Max TL T10I4D100K 3.93 100000 870 29 chess 0.34 3196 75 37 kosarak 30.50 990002 41270 2498
6
EXPERIMENTAL RESULTS
6.1
Accuracy of the proposed technique
Avg TL 10.10 37.00 8.10
Before conducting various performance experiments, we tested the mining accuracy of our SDR-growth by comparing the regular sensor patterns discovered by the SDR-growth, with those obtained directly from the same set of epochs with a sequential code. For this purpose, we have also developed a sequential algorithm, as shown in Fig. 7, in order to find all regular sensor patterns for a given max_reg on a database. This sequential algorithm is a naïve approach without using an efficient mining technique. Hence, it took extremely longer time to find all regular sensor patterns, especially for large (e.g., kosarak) and dense (e.g., chess) datasets. We are not reporting this time, as we are interested only on the patterns returned by the algorithm. The set of regular sensor patterns returned by this sequential algorithm for a specific max_reg on T10I4D100K, chess, and kosarak were found the same as the patterns discovered by our SDR-growth for the same max_reg on the same database, respectively. This experiment, therefore, justifies the accuracy of our SDR-growth in discovering regular sensor patterns from SDR-tree on a database.
Algorithm 4: Finding sensor regular pattern without mining (sequential approach) Input: SD: a sensor database; max_reg: the regularity threshold Output: RP: the set of sensor regular patterns
technique
22. begin 23. RP // RP be the set of patterns 24. for each epoch ti in SD do 25. if ti ≠ NULL then 26. for each non-empty subset, p in ti do 27. if then 28. add ti at the end of TIDp; // TIDp be the epoch list for p 29. else 30. add p to RP; 31. add ti at the end of TIDp; 32. end 33. end 34. end 35. end 36. for each p in RP do 37. calculate reg(p)from TIDp; // through finding the max interval 38. if reg(p) >= max_reg then 39. discard p from RP; 40. return RP; 41. end 42. Fig. 7. Sequential algorithm to find regular sensor patterns 43. 44. Now, to45. evaluate the performance of the proposed SDR-tree, various experiments were conducted on different datasets.46. We used several real (e.g., chess, kosarak) datasets and a synthetic dataset (e.g., T10I4D100K) which is commonly 47.used in frequent pattern mining experiments. The T10I4D100K is a synthetic dataset developed by the IBM Almaden Quest research group and obtained from http://cvs.buu.ac.th/ mining/Datasets/synthesis_data/. The 48. other three datasets are real and have been obtained from the UCI Machine Learning Repository (University of 49. California Irvine, CA). We use these datasets because they maintain similar characteristics to sensor epochs. 50. However, for space consideration, we only report the results for a subset of them. To show the performance SDR51. tree mining in distributed environment, we consider multiple sites with identical configuration. Each site consists of 52. a 2.66 GHz CPU with 2 GB memory and running on Windows 7. Communications between sites are established 53. through 54. a message passing interface. All programs are written in Visual C++. We assume that each database is distributed 55.among the sites, and that the processor in the site has complete access to its portion of the database. We first compare 56. the performance of the SDR-tree over RP-tree in traditional centralized environment. Then we show its performance in distributed environment. 57. 58. 59. 60. 6.2 Compactness of the SDR-tree 61. In the62. first experiment, we report the results of compactness test for our SDR-tree on chess, T10I4D100K and kosarak63. datasets in Fig. 8. For each of the datasets, we constructed the SDR-tree and measured the amount of memory64. it requires in each case to store the whole database content. T10I4D100K and kosarak are reasonably large datasets65. with a large number of transactions, and chess, on the other hand, is a small dataset with long transactions. The results 66. depict that even though the SDR-tree captures the full database information, its size can easily be handled67. with the currently available memory. 68. 69. 70. 71. 72. 73. 74.
Chess) T1014D100K Kosarak
100000
Memory (KB)
10000
1000
100
10
1
Dataset
Fig. 8. Compactness of the SDR-tree
6.3
Experiments on Incremental Mining
We show the results on the effectiveness of SDR-tree in incremental regular pattern mining on chess (with 3,196 transactions), kosarak (with around 1M transactions) and on T10I4D100K (with 100K transactions) datasets. To obtain an incremental setup for the datasets, we first divided kosarak into 5 consecutive slots of 200K transactions in each, chess into 3 consecutive slots of around 1K transactions in each and T10I4D100K into four consecutive slots of 25K transactions in each. Then we varied the number of incremental updates and tested the effect on runtime for both RP-tree and SDR-tree. We also varied the max_reg values at each update. The results are shown in Figs. 9, 10 and 11 over kosarak, chess and T10I4D100K datasets, respectively.
800
35
RP-tree ( = 0.06%) SDR-tree ( = 0.06%) RP-tree ( = 0.02%) SDR-tree ( = 0.02%)
700
30
600
25
Time (sec)
Time (sec)
RP-tree ( = 0.3%) SDR-tree ( = 0.3%) RP-tree ( = 0.5%) SDR-tree ( = 0.5%)
500 400
20
15
300
10
200 0
1
2
3
4
No. of database update (×200K)
Fig. 9. Incremental mining on kosarak dataset
5
1
2
3
No. of database update (×1K)
Fig. 10. Incremental mining on chess dataset
For the RP-tree, at each update of database the whole process is executed from the scratch. However, in the case of our SDR-tree, we just perform the SDR-tree update operation and corresponding mining at each update of database. Note that, Fig. 11 are opposite to those of Fig. 9 and Fig. 10. It reflects that, the overall runtime depends upon the size of RPDB as shown in Table 3 for two max_reg values for T10I4D100K , chess, and kosarak. This table also shows that for a fixed database, the number of regular pattern increases by the increase of max_reg in all datasets.
800
RP-tree ( = 0.2%) SDR-tree ( = 0.2%) RP-tree ( =0.4%) SDR-tree ( = 0.4%)
700
Time (Sec)
600
400
200
100
0 0
1
2
3
4
No. of database update (×25K)
Fig. 11. Incremental mining on T10I4D100K dataset
As shown in the graphs, for lower max_reg values and smaller database sizes both RP-tree and SDR-tree show similar performance. Our SDR-tree, in contrast, uses only a single scan to insert only the incremented portion of database in the already constructed SDR-tree and then performs the mining operation. Therefore, the above experiments demonstrate that our SDR-tree significantly outperforms RP-tree in incremental regular pattern mining on incremental and stream databases. Table 3. No. of Regular Patterns chess λ1 = 0.2%, λ2 = 0.4% |UDB| 1K λ1 767 λ2 3071
6.3
2K
3K
15 1023
39 559
T10I4D100K λ1 = 0.3%, λ2 = 0.6% 25 K 00 10
Kosarak λ1 = 0.1%, λ2 = 0.2% |UDB|
|UDB| 50 75 100 100K K K K 08 28 79 11 99 217 309 38
500K
1M
104 472
342 1597
Experiments on Interactive Mining
We performed an experiment on the viability of SDR-tree for handling interactive mining process. In this process, the user executes the mining operation on a fixed database with several regularity thresholds. We varied the max_reg over kosarak and T10I4D100K datasets and calculated the overall runtime required for the SDR-tree and the RP-tree. In Fig. 12, we only report the results when values of max_reg are in increasing order. Because for the opposite case (i.e., when values of max_reg are in decreasing order) the set of regular patterns can trivially be achieved from the result set of previous run with higher max_reg value. The x-axes in the graphs indicate the increasing values of max_reg. Unlike the RP-tree, our SDR-tree needs to construct the tree only once, and it can be used for all consecutive mining. So, we build the trees for the first max_reg (i.e., 0.02% for kosarak and 0.3% for T10I4D100K) value in each case. Other values in next data points report only mining time with respective max_reg for the SDR-tree. For RP-tree, in contrast, we need to consider the tree construction time with two database scans plus the mining time at each max_reg. As a result, the SDR-tree consistently obtains higher efficiency in interactive mining between the two, as shown in Fig. 12. Therefore, the above experiments demonstrate that our SDR-tree significantly outperforms RP-tree in incremental and interactive regular pattern mining.
800
700
RP-tree SDR-tree
600
RP-tree SDR-tree
700 600
500
Time (Sec)
Time (Sec)
500 400
300
400 300
200
200
100
100 0 0.25
0 0.02
0.04
0.06
0.08
0.10
0.30
0.35
0.40
0.45
0.50
0.55
0.60
max_reg (%)
max_reg (%)
Fig. 12 Interactive mining with the SDR-tree
6.4 SDR-tree Performance in Distributed Environment We experimented the execution time of the SDR-tree over the datasets like T10I4D100K and kosarak by varying the no. of processors in multiple sites. Figs. 13 and 14 show the results from the experiments for two datasetsT10I4D100K and kosarak. The results show that there is not much significant imbalance in loads among all the processors for the two datasets. The time shows in the graphs is the total time for local SDR-tree construction and mining. For T10I4D100K, we see a linear decrease in the execution time, when the no. of processors were increased. However, for the kosarak dataset, we see a rapid decrease in the execution time. From these results, it is clear that the execution time of the proposed SDR-tree mining will be lower with the increased no. of processors, and an improved linear scale-up can be achieved for large datasets due to the reduction on the tree construction cost and parallelizing the mining operation.
T10I4D100K ( = 0.4%)
No. of Processors
5
4
3
2 0
10
20
30
40
50
Time (Sec)
Fig. 13. Execution time of SDR-tree by varying no. of processors on T10I4D100K dataset
No. of Processors
5
Kosarak ( = 0.06%)
4
3
2 0
100
200
300
400
500
600
700
800
Time (S) x 100
Fig. 14. Execution time of SDR-tree by varying no. of processors on kosarak dataset
7
Conclusions
In this paper, we have introduced the concept of incremental and interactive regular pattern mining, a new interesting pattern mining problem, for body sensor data, and proposed a novel tree structure, called SDR-tree to efficiently capture the database content to facilitate a pattern growth-based mining technique. The experimental results demonstrate that the easy-maintenance feature of our SDR-tree provides the time and space efficiency in regular pattern mining upon update of database. Moreover, during mining on updated database with SDR-tree, the use of already mined results significantly reduces the consecutive mining cost. However, the limitation of the proposed method is that it detects regular patterns by considering only vital parameters. There are some factors that can influence these parameters, for example, heart rate is influenced by factors such as the physical exercise, or sitting, standing, lying, etc. In future; we will consider these issues while detecting regular patterns from BSN data.
Acknowledgement This project was funded by the National Plan for Science, Technology and Innovation (MAARIFAH), King Abdulaziz City for Science and Technology, Kingdom of Saudi Arabia, Award Number (12-INF2885-02).
References 1. Barroso, A., Benson, J., et al., The DSYS25 sensor platform, In Proceedings of the ACM Sensys 2004, Baltimore, 2004 2. Gaber, M. M., Gama, J., Krishnaswamy, S., Gomes, J. B., & Stahl, F. Data stream mining in ubiquitous environments: state‐of‐the‐art and current directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(2), 116138, 2014 3. Bulling, A., Blanke, U., & Schiele, B., A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR), 46(3), 33, 2014 4. D. Minnen, T. Starner, I. Essa, and C. Isbell, Discovering Characteristic Actions from On-Body Sensor Data, Proc. 10th IEEE International Symposium on Wearable Computers, pp. 11-18, 2006 5. T. Gu, L. Wang, Z. Wu, X. Tao, and J. Lu, A Pattern Mining Approach to Sensor-Based Human Activity Recognition, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 9, pp. 1359-1372, 2011
6. Hemalatha, C. S., & Vaidehi, V. Frequent bit pattern mining over tri-axial accelerometer data streams for recognizing human activities and detecting fall. Procedia Computer Science, 19, 56-63, 2013 7. P. Rashidi, D. J. Cook, Mining Sensor Streams for Discovering Human Activity Patterns over Time, Proc. 2010 IEEE International Conference on Data Mining, pp. 431-440, 2010 8. C. Lombriser, N.B. Bharatula, D. Roggen, and G. Troster, On-Body Activity Recognition in a Dynamic Sensor Network, Proc. International Conference on Body Area Networks (BodyNets), 2007 9. R. Ali, M. ElHelw, L. Atallah, B. Lo and G-Z. Yang, Pattern Mining for Routine Behaviour Discovery in Pervasive Healthcare Environments, Proc. the 5th International Conference on Information Technology and Application in Biomedicine, China, pp. 241-244, May 30-31, 2008 10. J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. ACM SIGMOD International Conference on Management of Data, pp. 1–12, 2000 11. M. C. Suman, K. Prathyusha, A Body Sensor Network Data Repository with a Different Mining Technique, International Journal of Engineering Science & Advanced Technology, vol. 2, Issue 1, pp. 105 – 109, 2012 12. Mooney, C. H., & Roddick, J. F. Sequential pattern mining--approaches and algorithms. ACM Computing Surveys (CSUR), 45(2), 19, 2013 13. F. Maqbool, S. Bashir, and A. R. Baig, E-MAP: Efficiently Mining Asynchronous Periodic Patterns, International Journal of Computer Science and Network Security, vol. 6, no. 8A, pp. 174 – 179, 2006 14. Amphawan, K., Lenca, P., & Surarerks, A. Mining top-k regular-frequent itemsets using database partitioning and support estimation. Expert Systems with Applications, 39(2), 1924-1936, 2012 15. S. K. Tanbeer, C. F. Ahmed, B.-S. Jeong, and Y.-K. Lee, CP-tree: A Tree Structure for Single-Pass Frequent Pattern Mining, Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD ‘08), pp.1022-1027, 2008 16. S. K. Tanbeer, C. F. Ahmed, B.-S. Jeong, and Y.-K. Lee, Mining Regular Patterns in Transactional Databases, IEICE Trans. on In-formation & Systems, vol. E91-D, no. 11, pp. 2568 – 2577, 2008 17. S. K. Tanbeer, C. F. Ahmed, and B.-S. Jeong, Mining Regular Patterns in Incremental Transactional Databases, Proc. 12 th Int. Asia-Pacific Web Conf., pp. 375-377, 2010 18. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, Catch the Moment: Maintaining Closed Frequent Itemsets Over a Data Stream Sliding Window, Knowledge and Information System, vol. l0, no. 3, pp. 265–294, 2006. 19. Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed, Byeong-Soo Jeong, Mining Regular Patterns in Data Streams, Database Systems for Advanced Applications Lecture Notes in Computer Science Volume 5981, 2010, pp 399-413 20. Giancarlo Fortino, Roberta Giannantonio, Raffaele Gravina, Philip Kuryloski, Roozbeh Jafari: Enabling Effective Programming and Flexible Management of Efficient Body Sensor Network Applications. IEEE T. Human-Machine Systems 43(1): 115-133 (2013) 21. N. Raveendranathan, S. Galzarano, V. Loseu, R. Gravina, R. Giannantonio, M. Sgroi, R. Jafari, and G. Fortino. From modeling to implementation of Virtual Sensors in Body Sensor Networks. IEEE Sensors Journal, Vol.12, No.3, pp. 583-593, Mar. 2012. 22. C. K. Leung, Q. I. Khan, Z. Li, and T. Hoque, “CanTree: A Canonical-Order Tree for Incremental Frequent-Pattern Mining,” Knowledge and Information Systems, vol. 11, no. (3), pp. 287-311, 2007. 23. X. Zhi-Jun, C. Hong, and C. Li, “An Efficient Algorithm for Frequent Itemset Mining on Data Streams,” Proc. International Conference on Management of Data, pp. 474 – 491, 2006. 24. M. J. Zaki, C.-J. Hsiao, “Efficient Algorithms for Mining Closed Itemsets and Their Lattice Structure,” IEEE Trans. Knowledge and Data Engineering, vol.17, no.4, pp.462–478, April 2005. 25. S. K. Tanbeer, M. M. Hassan, M. Alrubaian, & B. S. Jeong. Mining Regularities in Body Sensor Network Data. Proc. International Conference on Internet and Distributed Computing Systems, pp. 88 – 99, 2015. 26. Wang, L., Gu, T., Tao, X., & Lu, J.. A hierarchical approach to real-time activity recognition in body sensor networks. Pervasive and Mobile Computing, 8(1), 115-130, 2012 27. Candás, J. L. C., Peláez, V., López, G., Fernández, M. Á., Álvarez, E., & Díaz, G.. An automatic data mining method to detect abnormal human behaviour using physical activity measurements. Pervasive and Mobile Computing, 15, 228-241, 2014 28. Hemalatha, C. S., Vaidehi, V., & Lakshmi, R.. Minimal infrequent pattern based approach for mining outliers in data streams. Expert Systems with Applications, 42(4), 1998-2012. 2015 29. Yun, U., & Ryang, H. Incremental high utility pattern mining with static and dynamic databases. Applied Intelligence, 42(2), 323-352, 2015 30. Lin C-W, Lan G-C, Hong T-P. An incremental mining algorithm for high utility itemsets. Expert Syst Appl 39(8):7173– 7180, 2012 31. Hui, L., Chen, Y. C., Weng, J. T. Y., & Lee, S. Y. Incremental mining of temporal patterns in interval-based database. Knowledge and Information Systems, 1-26, 2015 32. Tseng VS, Shie B-E, Wu C-W, Yu PS. Efficient algorithms for mining high utility itemsets from transactional databases. IEEE Trans Knowl Data Eng 25(8):1772–1786, 2013
33. Song W, Liu Y, Li J. Mining high utility itemsets by dynamically pruning the tree structure. Appl Int 40(1):29–43, 2014 34. . Mallick B, Garg D, Grover PS. Incremental mining of sequential patterns: Progress and challenges. Int Data Anal 17(3):507–530, 2013 35. Caldersa T, Dextersb N, Gillisc JJM, Goethalsb B. Mining frequent itemsets in a stream. Inf Syst 39:233–255, 2014 36. Mallick B, Garg D, Grover PS. Incremental mining of sequential patterns: Progress and challenges. Int Data Anal 17(3):507–530, 2013 37. Yun U, Ryang H, Ryu K. High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates. Expert Syst Appl 41(8):3861–3878, 2014 38. Lin, K. W., & Chung, S. H.. A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments. Future Generation Computer Systems. 2015 39. Y. Lee, W. Chung, Automated abnormal behavior detection for ubiquitous healthcare application in daytime and nighttime, in: IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), 2012, pp. 204–207. 40. Machado, I. P., Gomes, A. L., Gamboa, H., Paixão, V., & Costa, R. M.. Human activity data discovery from triaxial accelerometer sensor: Non-supervised learning sensitivity to feature extraction parametrization. Information Processing & Management, 51(2), 204-214. 2015 41. Chanda, A. K., Saha, S., Nishi, M. A., Samiullah, M., & Ahmed, C. F. An efficient approach to mine flexible periodic patterns in time series databases. Engineering Applications of Artificial Intelligence, 44, 46-63, 2015 42. Sridevi, S., Saranya, P., & Rajaram, S.. Mining Undemanding and Intricate Patterns with Periodicity in Time Series Databases. In Artificial Intelligence and Evolutionary Algorithms in Engineering Systems (pp. 785-792). Springer India. 2015 43. Hu, Y. H., Tsai, C. F., Tai, C. T., & Chiang, I. C. A novel approach for mining cyclically repeated patterns with multiple minimum supports. Applied Soft Computing, 28, 90-99, 2015 44. Fortino, G., Parisi, D., Pirrone, V. and Di Fatta, G., BodyCloud: A SaaS approach for community body sensor networks. Future Generation Computer Systems, Vol. 35, pp.62-79. 2014 45. Hassan, M.M., Lin, K., Yue, X. and Wan, J., A multimedia healthcare data sharing approach through cloud-based body area network. Future Generation Computer Systems, Online First, DOI:10.1016/j.future.2015.12.016, January 2016 46. Sheng, X.C., Xue, X.F. and Cheng, Y.P., Research on the Parallel Frequent Data Mining Strategy under the Cloud Computing Environment. In Applied Mechanics and Materials, Vol. 719, pp. 924-928, 2015 47. Hu, L., Zhang, Y., Feng, D., Hassan, M.M., Alelaiwi, A. and Alamri, A., Design of QoS-Aware Multi-Level MAC-Layer for Wireless Body Area Network. Journal of medical systems, 39(12), pp.1-11, 2015 48. Zhang, Y., Qiu, M., Tsai, C.W., Hassan, M.M. and Alamri, A., Health-CPS: Healthcare Cyber-Physical System Assisted by Cloud and Big Data, IEEE Systems Journal, Online First, DOI:10.1109/JSYST.2015.2460747, 2015
*Biographies (Text)
Biography of Authors
Syed Khairuzzaman Tanbeer received his B.S. degree in Applied Physics and Electronics and M.S. degree in Computer Science from University of Dhaka, Bangladesh in 1996 and 1998 respectively. He received his Ph.D. degree in Computer Engineering from Kyung Hee University, South Korea in February 2010. Since 1999, he has been working as a faculty member in Department of Computer Science and Information Technology, Islamic University of Technology, Dhaka, Bangladesh. His research interests include data mining and knowledge engineering.
Mohammad Mehedi Hassan is currently an Assistant Professor of Information Systems Department in the College of Computer and Information Sciences (CCIS), King Saud University (KSU), Riyadh, Kingdom of Saudi Arabia. He received his Ph.D. degree in Computer Engineering from Kyung Hee University, South Korea in February 2011. He received Best Paper Award from CloudComp conference at China in 2014. He also received Excellence in Research Award from CCIS, KSU in 2015. He has published over 100+ research papers in the journals and conferences of international repute. He has served as, chair, and Technical Program Committee member in numerous international conferences/workshops like IEEE HPCC, ACM BodyNets, IEEE ICME, IEEE ScalCom, ACM Multimedia, ICA3PP, IEEE ICC, TPMC, IDCS, etc. He has also played role of the guest editor of several international ISI-indexed journals. His research areas of interest are cloud federation, multimedia cloud, sensor-cloud, Internet of things, Big data, mobile cloud, cloud security, IPTV, sensor network, 5G network, social network, publish/subscribe system and recommender system. He is a member of IEEE
Ahmad Almogren has received PhD degree in computer sciences from Southern Methodist University, Dallas, Texas, USA in 2002. Previously, he worked as an assistant professor of computer science and a member of the scientific council at Riyadh College of Technology. He also served as the dean of the college of computer and information sciences and the head of the council of academic accreditation at Al Yamamah University. Presently, he works as an associate professor and the vice dean for the development and quality at the college of computer and information sciences at King Saud University in Saudi Arabia. He has served as a guest editor for several computer journals. His research areas of interest include mobile and pervasive computing, computer security, sensor and cognitive network, and data consistency.
Mansour Zuair is currently an Assistant Professor in the Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia. He received his M.S. & Ph.D. degree in Computer Engineering from Syracuse University, his B.S. degree in Computer Engineering from King Saud University He served as CEN chairman from
2003-2006, vice dean 2009-2015 and dean 2016-now. His research interest is in the areas of computer architecture, Computer Networks and Signal Processing.
Byeong-Soo Jeong received his B.S. degree in computer engineering from Seoul National University, Korea in 1983, M.S. degree in computer science from the Korea Advanced Institute of Science and Technology, Korea in 1985 and Ph.D. degree in computer science from Georgia Institute of Technology, Atlanta, USA in 1995. In 1996, he joined Kyung Hee University, Korea. e is now an associate professor at the College of Electronics & Information at Kyung Hee University. From 1985 to 1989, He was a research staff at the Data Communications Corp., Korea. From 2003 to 2004, he was a visiting scholar at Georgia Institute of Technology, Atlanta. His research interests include database systems, data mining, and mobile computing.
*Biographies (Photograph)
Syed Khairuzzaman Tanbeer
Mohammad Mehedi Hassan
Ahmad Almogren
Mansour Zuair
Byeong-Soo Jeong1