Context Based Positive and Negative Spatio-Temporal Association Rule Mining

Context Based Positive and Negative Spatio-Temporal Association Rule Mining

Knowledge-Based Systems 37 (2013) 261–273 Contents lists available at SciVerse ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier...

2MB Sizes 10 Downloads 107 Views

Knowledge-Based Systems 37 (2013) 261–273

Contents lists available at SciVerse ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Context based positive and negative spatio-temporal association rule mining Muhammad Shaheen a,⇑, Muhammad Shahbaz b, Aziz Guergachi c a

Department of Computer Science, FAST, National University Peshawar, Pakistan University of Engineering & Technology, Lahore, Pakistan c Information Technology Management, Ryerson University, Toronto, ON, Canada b

a r t i c l e

i n f o

Article history: Received 31 January 2012 Received in revised form 6 August 2012 Accepted 10 August 2012 Available online 24 August 2012 Keywords: Association rule mining Context Positive and negative associations Spatial associations Spatio temporal associations

a b s t r a c t This paper proposes a new approach to mine context based positive and negative spatial association rules as they might be applied to hydrocarbon prospection. Many researchers are currently using an Apriori algorithm on spatial databases but this algorithm does not utilize the strengths of positive and negative association rules and of time series analysis, hence it misses the discovery of very interesting and useful associations present in the data. In dense spatial databases, the number of negative association rules is much higher compared to the positive rules which need exploitation. Using positive and negative association rule discovery and then pruning out the uninteresting rules consumes resources without much improvement in the overall accuracy of the knowledge discovery process. The associations among different objects and lattices are strongly dependent upon the context, particularly where context is the state of entity, environment or action. We propose an approach to spatial association rule mining from datasets projected at a temporal bar in which the contextual situation is considered while generating positive and negative frequent itemsets. An extended algorithm based on the Apriori approach is developed and compared with existing spatial association rule algorithms. The algorithm for positive and negative association rule mining is based on Apriori algorithm which is further extended to include context variable and simulate temporal series spatial inputs. The numerical evaluation shows that our algorithm is more efficient at generating specific, reliable and robust information than traditional algorithms. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction The increase in horizontal and vertical sizes of data in purpose built repositories makes databases an interesting laboratory for ascertaining hidden associations among various attributes that have been recorded for one specific reason or another. Scientists and engineers collect and work on terabytes of data received from space borne instruments and other remote sensing systems [18]. It is estimated that about 80% of corporate databases integrate spatial information [26] along with other entities. Data mining is an emerging domain that relies on the analysis of historical data to ensure that the amount of available data is directly proportional to the quality of the knowledge derived from that data. As the amount of data increases, the reliability of patterns extracted from data increases. For example, the pattern extracted from 100 records of customers may not reflect stable customer behavior, while a pattern extracted from thousands of records is considered to be a more reliable gauge of customer ⇑ Corresponding author. Address: Industrial Area, HayatAbad Peshawar, Pakistan. Tel.: +92 331 4525045. E-mail addresses: [email protected], [email protected] (M. Shaheen), [email protected] (M. Shahbaz), [email protected] (A. Guergachi). 0950-7051/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.knosys.2012.08.010

behavior. Data mining is a structured and modeled approach comprised of discrete and iterative steps, originating from raw data and concluding with patterns and predictions [14,31]. The use of association rules is one of the most widely applied techniques for data exploitation in different domains. Association rules are popular due to their wider application on different types of data such as numeric, ordinal, spatial and multimedia data. Association rules are obtained from a dataset on the basis of frequent and infrequent itemset analysis. Association rule mining examines the database to determine frequent itemsets. The basic form of the association rule is that if event X occurs, it is more likely for event Y to occur. It is written as X P Y where X is the antecedent and Y is the consequent. Association rule X P Y is normally described with two measures: support and confidence. Support refers to ‘‘how often the transactional records in database contain X and Y together’’ or the frequency of the subsets of a dataset and confidence is a measure of the accuracy of the rule [16,17]. An association rule is said to be a negative association rule if it is referred to as a negative relation between two itemsets. It is written as A P :B, :A P B and :A P :B [28]. Spatial association rule mining can be defined as the extraction of implicit associations, spatial relations or other special patterns not explicitly stored in spatial databases. If these associations are extracted by time series analysis of spatial data over a certain

262

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

period of time, the process will be called spatio temporal association rule mining. The existing techniques for this kind of mining are generally based on the Apriori algorithm [17]. Spatial association rule mining is more complex; therefore, the methods for spatial association rule mining are different [6,24]. While most researchers use a popular data mining approach such as Apriori based association rule mining on spatial data, Apriori has limitations including repeated scans of the data when finding frequent itemsets [20]. Another notable fact is that researchers mostly focus on mining positive association rules neglecting that negative association rules can be valuable generally in all cases, but particularly for spatial data [28]. Researchers observed various limitations on the Apriori algorithm for spatial data. As mentioned earlier, multiple scans of a database can cause a large number of iterations particularly when the dataset is as dense as it is in spatial databases. To eliminate multiple scans of a database, P-Tree [1], T-Tree [1] and FP-Trees [29] are used. It has also been observed that positive association rules, generally referred to as association rules, are the subset of total associations among data items within a dataset [25]. Positive rules can well represent positive associations among data items while negative associations are ignored, but negative associations can lead to potentially interesting results. The extraction of positive and negative association rules from datasets produces a very large number of rules. To overcome this, researchers incorporate pruning strategies in frequent-items algorithms; for example, Sharma et al. [11] used an ‘interestingness’ measure. Spatio temporal databases inherit the complexity of spatial data with the additional complexity of its evaluation over different periods of time. The computation of spatial data over time is expensive in computation and real time processing [23]. Spatio temporal datasets are generally evaluated using a combination of Apriori technique/modified Apriori techniques with underlying complexities of spatial association rule mining. In the extant literature, we found that spatio temporal association rules are mined without considering the context of the event under consideration. Context is the variable that represents the state of the entity, environment and action. For example, an association is observed when studying satellite images and how the color of vegetation pales whenever there are hydrocarbons in the vicinity of that vegetation. The association can be false if the pale color of the vegetation is because of abnormal temperatures. In this case, temperature is the context variable which, if ignored, would present false associations. There are certain sets of contextual circumstances that affect the state of the system from which association rules are to be mined. In such a situation, association rule mining does not truly represent the associations among different entities. Context should be a prominent consideration in spatio temporal datasets. Geng and Hamilton [10] extracted rules for all combinations of time and place hierarchy levels for market-basket analysis. These rules are generated according to the context, systematically derived from time and place hierarchies. This approach has not considered context as a variable which affects the accuracy of association rules because of the change in context. Second, the algorithm was designed only for market basket analysis. We propose an approach of spatio temporal association rule mining in which context is considered to be an influential variable to generate positive and negative association rules. An extended algorithm based on the Apriori approach is developed and compared with existing spatial association rule algorithms. The algorithms are applied on spatial topographic data to predict the presence of hydrocarbons beneath earth’s surface before engaging in deep drilling. The process of prediction is called hydrocarbon prospection. The algorithm is also applied on spatial data from the hydrocarbon pipeline industry to predict patterns in network assets.

The paper is organized as follows. Sections 2 and 3 describes the extant literature on spatial/spatio-temporal and positive/negative association rule mining. The problem statement is defined formally in Section 4, In Section 5, an algorithm for extracting spatio temporal positive and negative association rules with existing pruning strategies is proposed. This section also details experimentation and testing, and a conclusion is presented in Section 6.

2. Spatial/spatio-temporal association rule mining Spatial data mining is an expanding domain and it is widely accepted for the extraction of implicit knowledge, spatial relations and other patterns that are not explicitly stored in spatial databases. A spatial association rule is a rule which describes the implications of one set of features by another set of features in spatial databases [8]. In spatial association rules, spatial objects are related to each other by using spatial predicates and connectors [20] such as close to (), nearby (), adjacent_to (), etc. For example, it has been observed that 80% of ignited rocks that are closer to a river and nearer to vegetated fields contain hydrocarbons, as shown in Eq. (1) where close_to, nearby and adjacent_to are spatial predicates.

close toðA; riverÞ ^ nearbyðA; vegetationÞ ^ Is aðA; iginited rockÞ ! adjacent toðA; potential trapÞ Support : 60%; Confidence : 80%

ð1Þ

The number of predicates in consequent and antecedent of the spatial association rule can vary depending upon the rule to be represented.

NearbyðB; ignited rockÞ ! far awayðB; potential trapÞ Support : 50%; Confidence : 80%

ð2Þ

Spatial association rules can have both the spatial and non-spatial predicates whereas a spatial predicate is used to represent a spatial relationship which can have an association with non-spatial predicate as in the following equation:

NearbyðA; ignited rockÞ ! color ofðA; blueÞ Support : 60%; Confidence : 100%

ð3Þ

These rules represent positive associations among different features of objects. Negative association rules can also be of potential interest, particularly in spatial association rule mining. A negative association rule can show, for example, that a potential trap has never been found adjacent to wet sand, as shown in the following equation:

NearbyðA; wet sandÞ ! :adjacent toðA; potential trapÞ Support : 50%; Confidence : 90%

ð4Þ

Spatial association rule mining is an area of potential interest for researchers. Further details of spatial association rule mining can be found in [8,11,19,24,27]. Spatial databases can render the attributes to depict the existence of an object in space. Space and time are considered in tandem to describe the existence of and changing patterns of matter and energy. A spatio temporal process is a multi dimensional space in which spatial states, spatial and nonspatial attributes and spatial patterns vary continually across time [13,30]. Such a process is represented by a set of states at different time intervals. The variation in space across time can reflect patterns in the process of evolution. The number of states in the spatio temporal process depends on the size of the time granularity and time step. A spatio temporal association rule is extracted from multiple instances of spatial databases for which each instance is recorded after a particular time interval. The selection of time

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

Fig. 1. Spatio-temporal association rule mining.

granularity and time step is dependent upon the application domain. Fig. 1 describes the process of spatio temporal association rule mining. Spatio temporal association rule mining has wider acceptance in different applications like climate change pattern detection, hydrocarbon reservoir characterization, natural resource accumulation and urban feature mapping, etc. The literature on spatio temporal association rule mining has been reviewed by [2,4,7,15,23].

3. Literature review This paper is about developing a new Apriori-based algorithm which not only extracts positive association rules from spatio temporal databases but also extracts negative association rules. The algorithm is unique in the sense that a context variable is included for the first time as an influencing variable. The proposed algorithm is an extension of an existing algorithm of positive and negative association rule mining developed by Wu et al. [28]. Wu et al. presented a comprehensive definition and mathematical formulation of negative association rules and emphasized its importance in spatial and non-spatial databases. They also offered a method to compute support and confidence for negative association rules. Since negative association rules increase the size of output rules set, they used an interestingness measure for pruning

263

uninteresting rules. On the basis of these measures, an algorithm for extracting and pruning frequent and infrequent itemsets is developed here. The extracted association rules are then divided in four types on the basis of dependence variable. The developed algorithm is based upon the Apriori algorithm developed by Agarwal and Srikant [17]. The Apriori algorithm is based upon a support and confidence measure and is considered to be part of a supportconfidence framework for association rule mining. In the Apriori algorithm, the itemset is scanned in multiple passes and threshold by support value. The rules have specific support and confidence values and are left behind at the end of passes. The interestingness of Apriori-based rules may be based upon chi-square test as proposed by Srikant and Agarwal [21]. There are other interestingness measures for Apriori-based rules as summarized by Shaharanee et al. [5] and Geng and Hamilton [10]. Another interestingness measure which gauges the probability of finding the consequent in any random basket is proposed by Marakas [3]. Tamir and Singer [22] classified the interestingness measure into subjective and objective interestingness measures. The objectives measures are based upon a statistical method and includes generality, peculiarity, diversity and surprisingness. The subjective measures are based upon the domain’s expert understandings. These measures include novelty, utility, applicability, etc. Similar to prune association rules, in order to reduce the number of output rules and improve their utility, interestingness measures (confidence gain) is proposed. Wu et al.’s [28] algorithm was applied on spatial data by Sharma et al. [11] in 2005. They applied multi-level spatial mining methods to extract rules from spatial predicates. Spatial association rules can reflect object/predicate relationships. This study was confined to an Indian district and the relationships based on spatial predicates like close_to and far_away are computed. A modified form of positive and negative association rule mining algorithm is given in the paper which is specifically designed for spatial associations. A few more similar efforts related to the application and enhancement of association rule mining on spatial data is cited from the literature. Bembinek and Rybinski [19] proposed spatial association rule mining based on Delaunay diagrams. Delaunay diagrams determine neighborhood and are used to find collocations in spatial data in the said study. Each site in spatial data is considered to be a Delaunay node/vertex. Two sites are considered to be neighbor sites if the voronoi cells [12] corresponding to those sites share a common boundary. In this way, the spatial data image is converted into a number of voronoi cells joined on the basis of neighborhood value. The frequent itemsets of collocations which satisfied certain threshold and confidence are extracted to form spatial rules. A spatio-temporal association rule is one in which there is a spatio-temporal relationship between antecedent and consequent of the rule. Mennis and Wei Liu [7] applied spatio temporal association rule mining on urban growth data of Denver, USA. Four spatio temporal variables are focused on extracted polygons (e.g., change in landcover, change in percent minority, change in poverty and density of developed land over a certain period of time). The results are then presented in a multi-level hierarchy. Verhein and Chawla [2] offer a germane definition of spatio temporal association rule stating that the rule preserves semantics over space and time. In light of this definition, the concepts of spatial support, temporal coverage and temporal support are given. Sources, sinks and thoroughfares are defined as spatio temporal regions in mobile device networks. A unique rule pruning property is devised for high traffic regions. The algorithm developed on the basis of the above terms is applied to sparse and compact datasets and good precision is observed in the results. Geospatial and temporal data have also been mined by Shu et al. [4] for the sake of environmental change monitoring. In order to assess regional vegetation and

264

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

climate change, a framework based upon kringling interpolation, wavelet multi resolution analysis, fuzzy c means clustering and Apriori rules are developed by Shu et al. [4]. In this study, weather observation data, precipitation and air temperature data are collected at multiple time intervals from Chinese sites. The data is interpolated and clustered in the conceptual development phase. The association rules are extracted which are then pruned on the basis of user constraints and specifications. Compeita et al. [15] notes that efforts to develop spatio temporal association rule mining method are based on conventional knowledge discovery techniques. The concepts of ‘localizer’ and ‘miner’ are proposed wherein localizer deals with spatial and temporal dimensions of data and miner processes the data on the basis of these spatio temporal relationships. The study also focused on better visualization of spatio temporal association rules and provided two interactive interfaces. A literature review of this generally overlooked phenomenon reveals that most of the extant studies are related to the application of association rule mining on sparse and dense spatial data. There has been some research on the extraction of both positive and negative association rules from spatial data and then to prune these rules at multiple levels. We can cite very few works related to spatio temporal association rule mining. Even these works are generally related to the applications of spatio temporal association rule mining. A very important aspect affecting the results of spatio temporal association rules has not been considered before. In a study by Tang et al. [9], context is considered for spatio temporal market basket analysis. A set of contexts is derived from time and space hierarchy and those association rules are extracted which meet the support and confidence in all contexts. The decision makers could analyze the market baskets at different hierarchy levels. The proposed study considers context variable to represent changing states of environment, time and situation based actions and different states of entities. Any change in the context variable changes the extracted association rule and this change, beyond a certain limit, invalidates the rule. The existing algorithm of positive and negative association rule mining [11,28] is enhanced by including different cases/possibilities for the context variable and is presented later.

4. Problem definition Consider a spatial database, SD, containing geographical spatial and non-spatial data of different sites hydrocarbon’s pipeline network assets over multiple time periods. Let X = {X1, X2, X3, . . . , Xn} be the set of objects contained in the database which are spatially located on the satellite image. Let S be a subset of items in X. Let TI = {TI1, TI2, TI3, . . . , TIk} are the time intervals on which the data is collected and site = {site1, site2, site3, . . . , sitep} are the sites from which the data is collected. TIi and Sitej represents ith and jth instances of time interval and site. Let contextm[k] be the array structure which stores the values of the influencing parameter, for example, context where m is the name of the context variable and k is the array subscript where k represents time interval. The Apriori algorithm of association rule mining [16] and modified Apriori algorithm for positive and negative association rule mining is based on the following. For an association For an association rule X P Y, support is a measure of proportion of transactions that contain the itemset XUY.  The rule is considered valid only when the value of support satisfies – For a positive association rule:

SuppðXUYÞ > ms ðms ¼ minimum supportÞ

ð5Þ

(Minimum support is the threshold set by the user which should be satisfied for a rule to be selected). – For a negative association rule:

SuppðXÞ > ms; SuppðYÞ > ms and SuppðXUYÞ < ms

ð6Þ

 Interestingness measure for both positive and negative association rule satisfies jSuppðXUYÞ  SuppðXÞSuppðYÞj > mi where ðmi ¼ minimum interestingnessÞ ð7Þ

In a spatio temporal database, if both positive and negative association rules are considered, then the number of rules will become enormous. An interestingness measure is defined to select appropriate rules. The problem is that the rules that are not interesting are removed immediately. There is no mechanism to deal with those rules which are falsely generated at some TIi because of an abnormal value of contextm. The scenario is depicted in Fig. 2. In the example of Fig. 2 a scenario is presented to emphasize the need of context variable. In the given scenario, rules are generated at single site at three different time intervals. Those rules are considered to be strong association rules which lies in the intersection set of all three association rule sets. The rule which is a member of intersection set of TI1, TI2 and TI3 shows its prominence by frequency of its occurrence. The rules reflect certain patterns of vegetation and rocks, etc., which lead to the presence of potential hydrocarbon trap in that vicinity. For example, R1 which is extracted at TI1 shows that if the color of vegetation is pale and a rock is situated near to the vegetation then it is probable to have potential hydrocarbon trap nearby. A similar set of rules is collected from the same site at three different time intervals. A rule (R3) at TI1 and TI3 shows that the color of water is blue while the same rule with different name (R4) at TI2 denies the fact and states that the color of water is muddy. Because of this difference the rule is named as R4 at TI2. It can also be observed that the reason for change of water color was in fact the lowered temperature. At TI2, the temperature is 20 which do not lie in the normal range (30–50). This is evident of the fact that the change in value of context variable eliminated a precious rule from the list of association rules. In the absence of context variable, the situation depicted above will ignore the rule R3 whereas the exclusion of R3 rule from the list was because of abnormal value of temperature. The rule R4 at time interval TI2 is similar to R3 at time interval TI1 and TI3 except for the difference in the color of the water. We can see that the context of TI2 (i.e., temperature at TI2 ) does not lie in within the normal range, hence affecting the color of the water. The accuracy of the association rule is affected by an influencing factor which is named, ‘context’. The rule R4 is not there in the intersection set TI1 ^ TI2 ^ TI3 whereas this would be a valid rule if the context does not exert influence. The same rule lies in the intersection set TI1 ^ TI3. The rule is ignored because it does not meet the criteria of minimum support as highlighted in the figure. This criterion should be redesigned both for positive and negative association rules on the basis of context as an influencing variable. The support-confidence framework is modified to accommodate context as an affecting factor and to mine both positive and negative association rules on the basis of a new support value. 5. A method for mining context based spatio temporal association rules We developed a context based method to mine both positive and negative association rules from spatial databases. The

265

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

Fig. 2. Spatio-temporal association rule mining without context.

algorithm for mining positive and negative association rules has already been devised by Wu et al. [28] and implemented on spatial databases by Sharma et al. [11]. The same method is extended here to mine association rules on the basis of context. Positive and negative association rules are incorporated in support-confidence framework on the basis of the following principles [28]. A positive association rule A P B is of potential interest if:

fipiðIÞ ¼ SuppðIÞ > ms ^ 9X; Y : XUY ¼ I ^ fipisðX; YÞ

1. A and B are disjoint itemsets, i.e., A \ B = Ø 2. Supp (A) > ms; Supp (B) > ms; Supp (AUB) > ms; Supp(A) defines support of A and ms represents minimum support. 3. Supp(A P B) = Supp(AUB). 4. Conf(A P B) = Supp(AUB)/Supp(A); Conf(A P B) represents confidence of rule A P B.

Fipi represents frequent itemset of potential interest. ms is the user specified minimum support, mc is the user specified minimum confidence, / represents an empty set and mi is the user specified minimum interest. The interestingness measure is also incorporated in the support-confidence framework for negative association rule mining in the following way below [28]. J is an infrequent itemset of potential interest (iipis) if:

Similarly, the negative association rule A P :B is of potential interest if:

where

fipisðX; YÞ ¼ X \ Y ¼ / ^ f ðX; Y; ms; mc; miÞ ¼ 1

ð8Þ

f ðX; Y; ms; mc; miÞ ¼

SuppðXUYÞ þ Conf ðXUYÞ þ InterestðXUYÞ  ðms þ mc þ miÞ ðSuppðXUYÞ  msÞ þ ðConf ðXUYÞ  mcÞ þ ðInterestðXUYÞ miÞ

iipisðJÞ ¼ SuppðJÞ < ms ^ 9X; Y : XUY ¼ J ^ iipisðX; YÞ 1. 2. 3. 4.

A and B are disjoint itemsets, i.e. A \ B = Ø Supp(A) > ms; Supp(B) > ms; Supp(AUB) < ms Supp(A P :B) = Supp(AU:B) Conf(A P :B) = Supp(AU:B)/Supp(A)

This will result in an exponential number of association rules which can be pruned by using the interestingness measure, as given in (Eq. (7)) [28].

InterestðX; YÞ ¼ jSuppðXUYÞ  SuppðXÞSuppðYÞj The integration of the interestingness measure in a supportconfidence framework for positive association rule mining will result in:

iipisðX; YÞ ¼ X \ Y ¼ / ^ gðX; :Y; ms; miÞ ¼ 2 gðX; :Y; ms; mc; miÞ ¼ f ðX; Y; ms; mc; miÞ þ

ð9Þ

SuppðXÞ þ SuppðYÞ  2ms þ 1 jSuppðXÞ  msj þ jSuppðYÞ  msj þ 1

To incorporate context variable in support-confidence framework, we need to understand that context can be different for different sets of circumstances. The selection of context variable is application dependent with the assumption that the user will select any parameter as a context variable which holds a positive value. Once the context variable is selected, a valid range of values from initial_value_contextmrange to final_value_contextmrange will be specified for contextm. The selection of initial and final values for

266

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

defining the range of context variable is also an expert’s job. The initial value of context range is the value which, if decremented, would bring a negative change in the frequency of an item’s occurrence. Similarly, the final value is the one which, if incremented, causes a positive change in the frequency of an item’s occurrence. As for example, temperature is considered to be a context variable in Fig. 2. The valid range for a context variable is from 30 to 50 in this application. If the temperature lies within the range, no major change in the signature of the selected objects is observed. A decrease in the initial value of temperature (i.e., 30) will decrease the frequency of the occurrence of certain spatial objects. If the value of contextm lies within the range of the initial to final values, the support is calculated using the support method mentioned above. If the value of contextm does not lie in the normal range, then there can be four possibilities—these are detailed in Sections 5.1 and 5.2. To summarize the above before going into the details of algorithm, the user will explicitly define: 1. What parameter is considered to be the context variable? 2. What is the normal range of the context variable? 3. The smallest value of a context variable range is the initial value, if any decrement in the value cause a negative change in the frequency of item’s occurrence. 4. The largest value of context variable range is the final value, if any increment in the value causes a positive change in the frequency of item’s occurrence. 5. The scenario demonstrated in points 3 and 4 will be inverted according to the change in frequency of an item’s occurrence.

lated and added to the support value to get a new support value.

positiv e difference contextm ¼ ðinitial

v alue contextm range

 contextm ½kÞ  100=contextm ½k

ð10Þ

SuppðX ¼> YÞ ¼ SuppðXUYÞ þ ðSuppðXUYÞ  Positiv e Difference context m =100Þ

ð11Þ

Case II: The value of contextm[k] is greater than the final_value_ contextmrange. The percentage difference of final_value_contextmrange and contextm[k] are then calculated and the resulting percentage difference of the old support value is calculated. The difference is subtracted from the old support value to get a new support value. The variable negative_difference_contextm is used to store the value of the difference.

negativ e difference contextm ¼ ðcontext m ½k  final v alue contextm rangeÞ  100=contextm ½k

ð12Þ

SuppðX ¼> YÞ ¼ SuppðXUYÞ  ðSuppðXUYÞ  Negativ e Difference contextm =100Þ

ð13Þ

5.2. Negative association rules 5.1. Positive association rules Once the normal range for the context variable is defined, there can be two possibilities for both positive and negative association rules. Either the value of the context variable lies in the normal range or vice versa. If the value lies in the normal range, then association rules are extracted using a simple positive and negative association rule mining algorithm as explained above and proposed in [11]. In the other case, when the values do not lie in the normal defined range, there can be two more possibilities: (1) The value of context variable will be greater than the upper limit of the provided range; or (2) The value of context variable will be less than the lower limit of the provided range. Both values indicate abnormality in context. When the value of context variable is greater than the upper limit of provided range, the co-occurrence of item in itemset increases. This is compensated by subtracting that increase from the support value because the increase in the value of the context variable caused the support value to increase by a small amount. Compensation is made by subtracting the percent increase from support. In the second case, a decrement in the value of the context variable is compensated by adding a percent decrement in the support value. This is illustrated in the following four cases: Case I: The value of contextm[k] is lower than initial_value_contextmrange. In this case support is calculated by finding the percentage difference of contextm[k] and initial_value_contextmrange and storing it in positive_difference_contextm. The positive_difference_contextm percent value of support is calcu-

Case III: For negative association rules, positive_difference_contextm is calculated by using the same equation (i.e., Eq. (10)). The positive_difference_contextm value is then subtracted from the original support because negative association rule ensure nonco-occurrence of the items together with an underlying assumption that the support values for both individual items of itemset are greater than the specified threshold.

SuppðX ¼> YÞ ¼ SuppðXUYÞ  ðSuppðXUYÞ  Positiv e Difference contextm =100Þ

ð14Þ

Case IV: Similarly, the negative_difference_contextm percent value is added to the normal support value to eliminate the effect of contextual change.

SuppðX ¼> YÞ ¼ SuppðXUYÞ þ ðSuppðXUYÞ  Negativ e Difference contextm =100Þ

ð15Þ

In the view of above, we derived that the support of the spatial association rule is influenced by context. Hence, the value of support is compensated by the percentage difference in order to remove the effects of context. It is obvious that the addition and subtraction of a context variable in support value may not provide full compensation but most of the rules which were eliminated because of an abnormal value of context variable are observed to be part of this solution. Similarly, 40% of the rules which were falsely included with previous support are not part of this solution. After calculating the support value, uninteresting rules are pruned using the interestingness measure devised by Wu et al. [28], as in Eq. (7).

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

5.3. Algorithm

267

final_value_contextmrange) ⁄ 100/contextm

Name: Context based positive and negative spatio temporal association rule mining Inputs: SD: spatial database; Icontextm: current value of context variable; IVCR: Initial value for context variable range; FVCR: Final value for context variable range; TI: Time intervals; ms: minimum support; mc: minimum confidence; and mi: minimum interestingness. Outputs: FISp: Positive Frequent Itemsets; IFISn: Negative Infrequent Itemsets; (1). Let FISp = 0; IFISn = 0;//Initialize FISpand IFISn (2). Set IVCR = input1; Set FVCR = input2// Initialize the initial and final values of context variable range (3). For (i = 1; i<=n; i++)// Data is extracted at multiple time intervals from 1..n Begin L1 = get_predicates (SD,TIi); //get_predicate() extracts predicates from spatial database at different time intervals TIi FISp = FISpU L1; (4). For(j = 2; (Lj1 – Ø);j + +) do Begin Ak = get_candidate_set (Lj1); //get_candidate_set() extracts the itemsets to extract rules For each object s in S do begin As = get_subsets (A,s); //get_subsets() extracts dataset from candidate set that satisfies s For each itemset A in Asdo A.count = A.count + 1; End //new support value  

(5). Lj = {cjcCAk ^ supp_context (c,IVCR,FVCR, Icontextm >= ms); (6). Nj = Ak  Lj; // Pruning of uninteresting itemsets in Lj (7). For each itemset m in Lj do If NOT (fipi (m)) then Lj = Lj  {m}; FISp = FISpU Lj; // Pruning of uninteresting itemsets in Nj (8). For each itemset n in Nj do If NOT (iipi (n)) then If (supp_context (c, IVCR, FVCR, Icontextm)) final_value_contextmrange, contextm) then begin Negative_Difference_contextm = (contextm 

R = r + (r ⁄ Negative_Difference_contextm/100) End begin For (Negative_rules) positive_difference_contextm = (initial_value_contextmrange  contextm) ⁄ 100/ contextm R = r  (r ⁄ Positive_Difference_contextm/100)// new support value End Elseif (contextm > final_value_contextmrange, contextm) then begin Negative_Difference_contextm = (contextm  final_value_contextmrange) ⁄ 100/contextm R = r  (r ⁄ Negative_Difference_contextm/100) End Return

The procedure discovers all frequent and infrequent itemsets from the spatial database on the basis of a context variable. Users may apply a Context variable at their discretion and this can be changed depending upon the need for rule mining and the given set of circumstances. In the first step of the proposed algorithm, FISp and IFISn are initialized. In step (2), the range for context variable is set which spans from the initial value to the final value. In step (3), predicates from a spatial database are extracted and stored in order of time intervals. In step (4), candidate sets and subsets are obtained as per the Apriori algorithm. In step (5), the function of support value calculation is called to produce frequent itemsets and in step (6), infrequent itemsets are produced. In steps (7) and (8), uninteresting itemsets from both the frequent and infrequent itemsets are eliminated. The output is produced in the final step (9). The procedure is illustrated in the following flowchart (see Fig. 3). Two conditional controls are used in the above flowchart. The value of the context variable can lie in two ranges. If the value lies in the normal range, the conventional procedure of positive and negative association rule mining is used. In the case of an abnormal value, either the value will be greater than the final value of range or it will be less than initial value of the range. Both cases are dealt with according to the procedure given in Sections 5.1 and 5.2. 5.4. Experimentation The algorithm is implemented on multiple thematic maps of different times which have been collected from the hydrocarbon pipeline industry and exploration companies. The pipeline network is mapped onto a geographic information system (GIS). Using cartographic mapping, rectification is performed in ESRI’s ArcGIS 9.2 whereas the spatial database is managed in Microsoft SQL Server. The thematic maps consist of multiple layers including pipeline, sales metering stations, cathodic protection stations, town border stations, reducers, buried and over ground valves, and different types of consumers and parcels. The GIS designed for exploration purposes also contains multiple layers including rocks, vegetation, soil, water, concrete structures, land, shadows, residential areas, roads and topography. All these layers are stored at the backend in the form of individual tables along with spatial tables. The details of layers and pertaining spatial data of both GIS are detailed in Tables 1 and 2. The rational of this experiment is to compare the performance of proposed algorithm with two commonly used algorithms for

268

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

Fig. 3. Flowchart for mining context based positive and negative spatio temporal association rules.

association rule mining. The performance of the association rule mining algorithm can be measured on the basis of the number of extracted rules. In the case of spatial data, the number of association rules is sometimes misleading and undesirable hence the performance might be compared on the basis of the rules and

against the quality of the association rules. The confidence measure can help predict the quality of the rule. The execution time of the algorithms is also compared to the rank of its usability. The proposed algorithm is specifically designed for spatio temporal data for which various satellite images collected over a

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

ative association rules [27]. All these performances are compared with respect to the following:

Table 1 Spatial and non-spatial data of pipeline network. Sr# Name of layer 1. 2. 3.

4. 5. 6. 7. 8. 9. 10. 11.

No. of No. of nonNo. of records on the attributes spatial records basis of spatial combinations

Sales metering stations Town border stations Cathodic protections stations Pipeline Buried valves

12

3456

19

4850

17

8440

10 14

54,462 9450

Overhead valves Reducers Industrial consumers High pressure commercial Domestic consumers Commercial consumers

14 6 22

4568 2575 12,784

18

112,244

8



10

128,246

Spatial combination operators: adjacent_to, nearby, far_away, close_to, parallel_to

No. of records: 581,504 No. of intervals: (TI[n] = 12)

Table 2 Spatial and non-spatial data of exploration GIS. Name of Layer

No. of attributes

No. of nonspatial records

No. of records on the basis of spatial combinations

1. 2. 3. 4. 5.

Rocks Vegetation Soil Water Concrete structures

10 7 21 12 8

860 820 852 848 810

Spatial combination operators: adjacent_to, nearby, far_away, close_to, parallel_to

6. 7. 8.

Land Shadows Residential area Topography Roads

22 6 10

880 800 850

No. of records: 6124 No. of intervals: (TI[n] = 7)

6 8

842 833

Sr#

9. 10.

269

certain period of time may better reflect the execution performance of the algorithm. The records summarized in Tables 1 and 2 are mapped to a spatial database having spatial attributes such as adjacent_to, close_to, far away, and nearby and non-spatial attributes such as length_of_groundbed_CP, size_of_TBS_manifold, lubrication_condition_valves, and total_no_of_CPs. A total number of 581,504 spatial records are oriented at 12 different time intervals from the period 1990–2010 in pipeline GIS and 6124 records of exploration GIS. After object recognition from a satellite image, the spatial data are converted to show spatial relationships among different objects. The predicate clauses are decided on the basis of distance, geometry and shape of object. Satellite images of a hydrocarbon reserve and its classified image are shown in Figs. 4 and 5. Another image mapped with spatial predicates is shown in Fig. 6. The spatial records are stored in separate schema of the database with respect to the agreed upon predicates. These images are then classified to recognize different objects and features on the satellite image. The schema, having predicate based records, also contain the occurrence of different objects on the image. The performance of algorithm is compared with the Apriori algorithm as well as with the algorithm to mine positive and neg-

a. b. c. d.

No. of spatio temporal association rules. Execution time. Confidence of rules. Relevance of rules for domain experts.

6. Results and conclusion We study the results of our algorithm with respect to the number of rules, average confidence of rules, total time taken to extract association rules (execution time) and the opinion of domain experts from the domain of hydrocarbon prospection. The extraction of association rules on the basis of context is a novel approach and has its own impediments. That is why the comparison may not manifest the significance of the approach. The comparison is in fact made to compare the use of an existing technique on Spatio temporal data to the proposed technique. These results are compared with the results produced by Apriori and positive/negative association rule mining algorithm. In Fig. 7, the three algorithms are compared on the basis of number of rules. Three cases of the proposed algorithm are considered in this plot (1) If the value of the context variable (cv) becomes less than the initial value of the range (ci) (2) If the value of cv is greater than that of the final value of the context range (cf) (3) If the value of the cv lies between the normal range. The algorithm produced a maximum number of rules with assumption (1) up to the 0.5 value of minimum support, whereas it produced comparatively fewer rules with the assumption (2) because, in that case, the value of the context variable exceeds the upper limit of the range. The insight brought by the plot of Fig. 7 can be evaluated against different aspects. The greater the number of association rules, the greater the patterns extracted from the database. Increase in support value causes the algorithm’s results to show a divergence towards a single point because there are very few co-occurrences in real datasets with much larger support. The Apriori algorithm produced the fewest rules because it ignores negative association rules. The results of PNARM lies almost in between two cases of CBPNARM. The average number of rules in PNARM seems to be greater than number of rules extracted from CBPNARM but the significance of the CBPNARM algorithm can be viewed by analyzing both the plots of Figs. 7 and 8. In the plot (Fig. 8), the confidence of rules extracted using our algorithm has the highest projection of all at various support values. The greater the value of confidence, the greater the accuracy of rules with the exception of the dependence of the variables. The confidence of rules gives a true projection up to a certain value of support because the increase in support returns very few cooccurrences from the data, hence limiting a broader capability to evaluate the algorithm. CBPNARM produced rules with greater confidence up to 0.5 support even when compared with the Apriori algorithm. The rules produced by PNARM are at the lowest confidence which demonstrates the significance of considering contextual information, especially for positive and negative spatial association rule mining—as far as PNARM produced the largest number of rules but with a smaller confidence value. The algorithm is an extended form of PNARM inheriting its total capability with an additional capability of context based mining. The execution time of the proposed algorithm is slightly higher as per our expectations. The additional module takes some extra time which seems to be negligible with the increase in minimum support. In Fig. 9, the three algorithms are compared on the basis of execution time. The evaluation of rules from domain experts of industrial sector revealed that most of the rules are relevant. However, there are

270

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

Fig. 4. Remotely sensed image of prospect site.

Fig. 5. Classified image of prospect site.

few rules related to surface analysis and geographical analysis which revealed some new patterns which were previously unknown to the domain experts. For example the frequent pattern indicating color of rock, color of vegetation and spread of water from boundaries are indicators for hydrocarbon prospect was previously unknown to the domain experts. The results revealed that

almost 77% of the rules were quite similar for domain experts which they were able to extract after detailed geographical, geochemical and surface analysis. Association rule mining is considered to be one of the most widely used data mining techniques. It is likely to be used for extracting associations among large spatial datasets applied in

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

271

Fig. 6. Predicate labels on classified image.

Fig. 7. No of rules plot of Apriori, positive/negative ARM and context based ARM (proposed algo.).

natural resource exploration, network management, geological mapping and crime analysis, etc. The algorithm presented in this paper is an extended form of the algorithm presented for positive and negative association rule mining. A new factor, ‘context’, is introduced in this research which is important to consider in min-

ing association rules. The proposed algorithm offers a method for dealing with context, which explicitly affects the accuracy of association rules. The selection of context variable(s) and its range depends on user input which should be executed carefully. For implementing the algorithm, spatial data is collected at multiple

272

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273

Fig. 8. Average confidence value plot of Apriori, positive/negative ARM and context based ARM (proposed algo.).

Fig. 9. Execution time plot of Apriori, positive/negative ARM and context based ARM (proposed algo.).

instances of time and structured systematically on the basis of context. The algorithm extracts the rules from a large spatial database at the given value of support and confidence. Our evaluation of the algorithm shows that it is more accurate in terms of its granularity of output rules and confidence. The granularity of output rules is consistent and the rules are produced with greater confidence. Though the execution time of the algorithm is greater than that of previous algorithms, the extracted patterns are decision-oriented, specific and clear. The increase in execution time is due to the inclusion of an external factor (i.e., a context which is not part of the adopted procedural approach and can be considered as an external influencing factor). The paper has several possible extensions. One is to optimize the execution time by introducing any pre-pruning strategy which selects a subset of association rules to be tested against context variables. Another possibility is to define a mechanism for calculating the unique support value of the rules based on context; or, to evaluate confidence value on the basis of context and to extract

rules on the basis of confidence value. Further still, this study could be extended to prune positive and negative association rules on the basis of context instead of using the interestingness measure. Finally, the rules can be classified and each class might be evaluated against different values of context. Acknowledgement We are indebted to Ms. Eva Woyzbun from Ryerson University, Canada for her valuable error/omission rectification of this manuscript. This project is completed with the financial support of Higher Education Commission, Pakistan and partially supported by NSERC, Canada in valuable error/ omission rectification of this manuscript. References [1] F. Coenen, P. Leng, S. Ahmed, Data structure for association rule mining, IEEE Transactions on Knowledge and Data Engineering. 16 (6) (2004) 774–779.

M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273 [2] F. Verhein, S. Chawla, Mining spatio-temporal association rules, sources, sinks, stationary regions and thoroughfares in object mobility databases, LNCS 3882 (2006) 187–201. [3] G. Marakas, Decision Support Systems, second ed., Prentice Hall, NJ, 2003. [4] H. Shu, X. Zhu, S. Dai, Mining association rules in geographical spatio-temporal, data the international archives of photogrammetry, Remote Sensing and Spatial Information Sciences 37 (B2) (2008) 225–228. [5] I.N.M. Shaharanee, F. Hadzic, T.S. Dillon, Interestingness measures for association rules based on statistical validity, Knowledge-Based Systems 24 (3) (2010) 386–392. [6] J. Chen, F. Huang, R. Wang, Y. Jin, A research about spatial association rule mining based on concept lattice, in: International Conference on Wireless Communication, Networking and Mobile Computing, China, 2007, pp. 5979– 5982. [7] J. Mennis, J. Wei Liu, Mining association rules in spatio-temporal data: an analysis of urban socioeconomic and land cover change, Transactions in GIS 9 (1) (2005) 5–17. [8] K. Koperski, J. Han, Discovery of spatial association rules in geographic information databases, Lecture Notes in Computer Science 951 (1995) 47–66. [9] K. Tang, Y.L. Chen, H.W. Hu, Context-based market basket analysis in multiple store environment, Science Direct; Decision Support Systems 45 (2008) 150– 163. [10] L. Geng, H.J. Hamilton, Interestingness measures for data mining: a survey, ACM Computing Surveys 38 (3) (2006) 1–32. [11] L.K. Sharma, O.P. Vyas, U.S. Tiwary, R. Vyas, A Novel Approach of Multilevel Positive and Negative Association Rule Mining for Spatial Databases, SpringerVerlag, Berlin, 2005. pp. 620–629. [12] M. deBert, O. Schwarzkopf, M. Von Kreveld, M. Overmars, Computational Geometry: Algorithms and Applications, Springer, Heidelberg, 2000. [13] M. Shaheen, M. Shahbaz, Z. Rehman, A. Guergachi, Mining Sustainability to predict optimal hydrocarbon exploration rate, in: IASTED AIA 2010 Conference Austria, 2010, pp. 394–400. [14] M. Shaheen, M. Shahbaz, A. Guergachi, Z.ur. Rehman, Mining sustainability indicators to classify hydrocarbon development, Knowledge-Based Systems 24 (8) (2011) 1159–1168. [15] P. Compieta, S.D. Martino, M. Bertolotto, F. Ferrucci, T. Kechadi, Exploratory spatio-temporal data mining and visualization, Journal of Visual Languages and Computing 18 (2007) 255–279. [16] P. Laube, M. Berg, M. Kreveld, Spatial support and spatial confidence for spatial association rules, Lecture Notes in Geoinformation and Cartography (2008) 575–593.

273

[17] R. Agarwal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the 20th VLDB Conference, Chile, 1994, pp. 487–499. [18] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: ACM SIGMOD Conference, 1993, pp. 207–216. [19] R. Bembinek, H. Rybinski, FARICS: a method of mining spatial association rules and collocations using clustering and Delaunay diagrams, Springer Journal of Intelligent Information System 33 (2009) 41–64. [20] R. Vyas, L. Kumar Sharma, U.S. Tiwary, Exploring spatial ARM for geo-decision support system, Journal of Computer Science 3 (11) (2007) 882–886. [21] R. Srikant, R. Agrawal, Mining generalized association rules, Future Generation Computer Systems 13 (1997) 161–180. [22] R. Tamir, Y. Singer, On a confidence gain measure for association rule discovery and scoring, The VLDB Journal 15 (1) (2006) 40–52. [23] S.U. Calargan, A. Yazici, Fuzzy association rule mining from spatio temporal data, LNCS 5072 (2008) 631–646. [24] S. Shehkar, P. Zhang, Y. Huang, R. R Vatsavai, Trends in spatial data mining, in: H. Kargupta, A. Joshi, K. Sivakumar, Y. Yesha (Eds.), Data Mining: Next Generation Challenges and Future Directions, MIT/AAAI Press, 2003. [25] T. Herawan, M.M. Deris, A soft set approach for association rule mining, Knowledge-Based Systems 24 (1) (2009) 186–195. [26] U.M. Fayyad, G.G. Grinstein, Introduction, in: Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, Los Altos, CA, 2001, pp. 1–17. [27] V. Bogorny, B. Kujjpers, L.O. Alvares, Reducing uninteresting spatial association rules in geographic databases using background knowledge: a summary of results, International Journal of Geographical Information Science 22 (4) (2008) 361–386. [28] X. Wu, C. Zhang, S. Zhang, Efficient mining of both positive and negative association rules, ACM Transactions on Information Systems 22 (3) (2004) 381–405. [29] Z. He, S. Deng, X. Xu, An FP tree based approach for mining all strongly correlated item pairs, Lecture Notes in Computer Science (2006) 735–740. [30] Z. Xuewu, S. Fenzhen, D. Yunyan, Association rule mining on spatio-temporal processes, Wireless Communications, Networking and Mobile Computing (2008) 1–4. [31] M. Shaheen, M. Shahbaz, A. Guergachi, Z. Rehman, Data mining applications in hydrocarbon exploration, Artificial Intelligence Review 35 (1) (2010) 1–18.