A hybrid spatial data clustering method for site selection: The data driven approach of GIS mining

Expert Systems with Applications 36 (2009) 3923–3936 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

Download PDF

3MB Sizes 16 Downloads 169 Views

Report

Full Text

Expert Systems with Applications 36 (2009) 3923–3936

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A hybrid spatial data clustering method for site selection: The data driven approach of GIS mining Bo Fan School of International and Public Affairs, Shanghai Jiaotong University, Shanghai 200030, China

a r t i c l e

i n f o

Keywords: Spatial clustering k-Means algorithm GIS Site selection

a b s t r a c t This article applies customer service to be the research background. Spatial data mining method is proposed to solve site selection of the service center. Firstly, a new data model for recording all the information of customer management is given, which transforms the traditional model-driven strategy to dataoriented method. Secondly, a hybrid spatial clustering method named OETTC–MEANS–CLASA algorithm is proposed. It has the advantages of applying k-means algorithm to reduce the result space and using simulated annealing method (CLASA) as result-searching strategy to ﬁnd more qualiﬁed solutions. On the basis of GIS functions, we design deeper analytical function to take spatial obstacle factors, spatial environmental factors, spatial terrain factors, spatial trafﬁc factors and cost factors into account. The result of the experiment declares that the algorithm does better at the both aspects of perform efﬁciency and result quality. 2008 Elsevier Ltd. All rights reserved.

1. Introduction

1.1. The application of GIS in location problem

The site selection of the service center is the key factor that decides the efﬁciency of service response. The current research of location problem usually applies the linear programming to reach the optimization of time response and cost of the models (Aboolian, 2007; Berman & Krass, 2002; Jain & Vazirani, 2001; Zhang, 2007), etc. But the above researches have two common shortages: (1) It should involve many spatial restriction factors, such as obstacle factors, environmental factors, trafﬁc factors, terrain factors, etc. which require tremendous amount of variables and restrictions if it is solved by linear programming models. So it is quite difﬁcult to build such a huge linear model for site selection. (2) The essence of the location problem is to ﬁnd the global optimal k-centers or k distributions which is well known to be NP-hard (Domı´nguez et al., 2008; Owen & Daskin, 1998), etc. the linear model is not good enough to get the more qualitative results of k-centers/distributions. In this article, we choose customer management as the application ﬁeld for illustration our method of the site selection of service centers, proposes a data-driven method instead of the model-oriented method, and explore a process for ﬁnding service sites of the enterprise from the huge amount of customer locations which are recorded as two-dimension points in GIS (geographical information system) (Koret & Koret, 1997). The huge amount of spatial points are regarded as service receivers which might be the current and potential customers of an enterprise.

Geographical information systems (GIS) have occupied the attention of many researches involving a number of academic ﬁelds including geography, civil engineering, computer science, land use planning, and environmental sciences (Domı´nguez et al., 2008). GIS can support a wide range of spatial queries that can be used to support location studies. Model application and model development are the major impact of GIS on the ﬁeld of location science. These systems are designed to store, retrieve, manipulate, analyze, and map geographical data. GIS can serve as the source of input data for a location model. The applications of GIS in location problem only limit to the basic functions of visualization, querying, and preliminary analytical functions of overlapping, buffering, network analysis (Cheng, Li, & Yu, 2007; Koret & Koret, 1997; Nikolakaki, 2004). Location–allocation model can also be integrated in GIS software to realize the resulting presentation in a real map (Cheng & Chang, 2001; Nathanail, 1998; Vlachopoulou et al., 2001; Yeh & Chow, 1996). In all, the current researches focus on the basic functions of GIS, and the deeply analytical functions of spatial data have not been sufﬁciently developed.

E-mail address: [email protected] 0957-4174/$ - see front matter 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.02.056

1.2. The application of spatial clustering in location problem Clustering is the organization of a data set into homogenous and/or well separated groups with respect to a distance or, equivalently, a similarity measure (Tran et al., 2003). Spatial clustering, which groups data for ﬁnding all distribution patterns and interesting correlation among geographical data set, has numerous appli-

3924

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

cations in pattern recognition (Ayala, Epifaniob, Simó, & Zapatera, 2006; Domingo, Ayala, & Dı´az, 2002), spatial data analysis (Demir et al., 2007; Hu & Sung, 2006; Lin, 2004), image processing (Chu, Roddick, & Pan, 2001), market research, etc. (Ester & Kriegel, 1998; Han, Kamber, & Tung, 2000; Han & Kamber, 2000). Spatial data clustering is an important component of spatial data mining and further exploration of GIS functions (Ester & Kriegel, 1998). Spatial clustering method can be classiﬁed into four categories: partitioning method, hierarchical method, density-based method and grid-based method (Ester & Kriegel, 1998; Zhang & Rushton, 2008). Since our research is to ensure the minimization of the overall travel distance of all the customers in the city. The clustering algorithms of hierarchical method, density-based method and grid-based method mainly focus on ﬁnding natural clusters which do not guarantee the minimization of distances to cluster centers. The partitioning algorithm is a good choice as a solution, the two typical types of partitioning algorithm are k-means and k-mediod (Han et al., 2000; Han & Kamber, 2000). The k-means algorithm uses the mean value of objects in a cluster as the cluster center. The objective criterion used in algorithm is squared-error function deﬁned as E¼

k X X

jx mi j2 :

i¼1 x2C i

In the above formula, x is the point in space representing the given object, and mi is the mean value of cluster Ci. The method can be described as: Step 1: k-means algorithm arbitrarily choose k-centers/ distributions as initial solutions. Step 2: k-means algorithm assigns each object to its nearest center forming a new set of cluster. Step 3: all the centers of those new clusters are then computed by taking the mean of all the objects in each cluster. The steps 2–3 are repeated until the criterion of function E does not change after an iteration. The k-means algorithm is relatively scalable and efﬁcient in processing large data set because the computational complexity of the algorithm is O(n k t), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Normally, k n, t n. The method often terminates at local optimum (Han et al., 2000; Han & Kamber, 2000). Except for that, k-means algorithm is also very sensitive to noise and outlier data points since a small of such data can substantially inﬂuence the mean value. k-Mediod method uses the most centrally located object(mediod) in a cluster to be the cluster center instead of taking the mean value of the objects in a cluster (Tung, Hou, & Han, 2001). Because of this, k-mediod is less sensitive to noise and outlier data, but it results in a higher running time. The typical k-mediod algorithm is called CLARANS (clustering large application based on randomized search) (Han et al., 2000; Han & Kamber, 2000; Ng & Han, 2002; Chu, Roddick, & Pan, 2002). When searching for a better center in step 3 of k-means algorithm, CLARANS tries to ﬁnd a better solution by randomly picking one of the k-centers and replacing it with anther randomly chosen object from other (n k) objects, and if no better solution is found after a certain number of attempts, the local optimum is assumed to be reached. Though CLARANS is more effective than other k-mediod algorithm, its computational cost is as high as O(n n), where n is the number of objects. 1.3. Section arrangement On the basis of GIS function and the research of spatial data clustering method, we intent to improve the current method for better solving the problem of site selection. In Section 2, a data model will be proposed to comprehensively organize the information of the management process of customer service which is the key aspect to realize our idea of data-driven method. The section includes the modeling method and operating method of the spatial data cube for introducing the spatial analytical function to the

solution of site selection. In Section 3, the computation method of spatial distance between service center and service receiver is proposed, it takes the obstacle factor and the location weight into account. In Section 4, a hybrid spatial clustering method named OETTC–MEANS–CLASA algorithm is designed to realize the mining oriented method for location selection of customer service center. Obstacle factor, environmental factor, trafﬁc factor, terrain factor and cost factor are taken into consideration in the process of the algorithm. Finally a series of experiments will be carried out to test the efﬁciency and the quality of our algorithm. 2. Data model for spatial clustering 2.1. Spatial data model of customer location The distribution of spatial objects in urban district is vital for the location selection of customer service center, the principle of which is to ensure the minimization of overall travel distance of all the customers. Data model of customer management has the function of spatially enabled analysis and can organize huge amount of customer attributes as multidimensional model which is described in the right part of Fig. 1. Spatial dimensions and non-spatial dimensions of the model are joined by SDE (spatial data engineer) which can integrate the non-spatial attributes of a customer in transaction database and spatial attributes of the same customer in GIS (Nebojsa, 1997; Papadias et al., 2001). The model can not only provides fundamental analytical function of GIS and OLAP(On-line analytical processing), but also presents an interface for developing deeply analytical functions to realize the purpose of scientiﬁc site selection. 2.1.1. The model of spatial data cube Spatial data cube model can be deﬁned as: MD = (D, H, M, q, C).The spatial data cube for recording the customers’ fact is shown as Fig. 1. Deﬁnition 1 (Dimension of spatial data cube). D represents the dimensional attribute set of a customer O. D is made up of M non-spatial dimension named SD and N spatial dimensions named ND, ND # D, SD # D. Where ND = {ND1, ND2, . . . , NDm}, SD = {SD1, SD2, . . . , SDn}. For example, ‘‘district dimension” and ‘‘residential area dimension” belong to spatial dimensions, ‘‘time dimension”, and ‘‘commodity dimension” belong to non-spatial dimensions. The dimensional value Di of O is denoted as O(Di). Di(H) = {H1, H2, . . . , Hn}, where Di(H) is the value set of hierarchies for dimension Di. n is the number of concept hierarchies for Di, each dimension of data cube can be generalized into different concept level. For example, if Di = ‘‘Time dimension”, then H2 H3 (year month day). Di(H) = H1 Di(Hj) = (bij,1, bij,x, . . . , bij,t), where bij,x is the detailed value of Di(Hj), t is the number of the value of Di(Hj). For example Di(Hj) = ”year”, bij,x = ‘‘the year of 1999”. Deﬁnition 2 (Measure of spatial data cube). M = {NM, SM} – M represents the measure of spatial data cube, and includes numerical measure NM and spatial measure SM. NM = (NM1, NM2, . . . , NMn) – NM is the numerical measure of spatial data cube, and is a set of numerical values, e.g. ‘‘the expenditure of a customer”, etc. in Fig. 1. SM = (SM1, SM2, . . . , SMn) – SM is the spatial measure of spatial data cube, and is a huge set of object pointer (Stefanovie, Han, & Koperski, 2000), e.g. ‘‘object pointer points to the location of customer’s residential-area/ofﬁce-area of customer” in Fig. 1. Deﬁnition 3 (Functions of the measures). q = (f1(NM), . . . , fr(NM)) – q is the function operations of numerical measure. For example, ‘‘the total expenditure in a certain month of a customer” can be computed by the functions of fi(NM) = Sum().

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

3925

Fig. 1. Spatial data cube for recording and operating customer information.

C = (F1(SM), . . . , Fs(SM)) – C is the function operations of spatial measure. For examples, the function of Fi(SM) = Set() is to get the set of the corresponding points to analyze a chosen part of geographical objects. The function of Fk (SM) = Spatial-predicate() is to describe the topological relation between the spatial entity and the surrounding geographical objects. 2.1.2. Operation of spatial data cube Deﬁnition 4 (Operation of numerical measure). Spatial data cube integrates the spatial attributes and non-spatial attributes for customer management. The cube operation can realize agile analysis of numerical value aggregation by ‘‘grouping function” of variable combination of dimensions and hierarchies. The formula is ^"i,j,x{O[Di(Hj )]j[Di(Hj)]hbij,x} ? fr (NM). For example, if the values of Di(i = 1, 2, . . . , n) are given as ‘‘Time dimension and Commodity dimension”. Hj(j = 1, 2, . . . , m) is valued as ‘‘year, commodity category” accordingly. The values of bij,x (x = 1, 2, . . . , t) are ‘‘Year of 1999, Coffee” accordingly, h is ‘‘=”, fr(NM) is the function deﬁned as Sum(). Then ^"i,j,x{O[Di(Hj)]j[Di(Hj)]hbij,x} ? fr(NM) represents (The year of 1999, ‘‘nestle” coffee) ? (expenditure). The question can be described as ‘‘the total expenditure of nestle coffee of the each customer at the year of 1999”. The aggregation value ‘‘expenditure” also be used as decisional attribute for discriminating the proﬁt contribution of a customer to the enterprise. And the functions of rolling up, drilling down, slicing and so on can also be utilized to select any the part of the data cube which the user is interested in through OLAP operation (Han et al., 2000; Han & Kamber, 2000). Deﬁnition 5 (Operation of spatial measure). Spatial data cube integrates the spatial attributes and non-spatial attributes for customer management. The cube operation can also realize agile analysis of spatial value by ‘‘grouping function” of variable combination of dimensions and hierarchies. The formula is ^"i,j,x{O[Di(Hj)]j[Di(Hj)]hbij,x} ? Fs(SM). For example, if the values of Di(i = 1, 2, . . . , n) are given as ‘‘Time dimension and Commodity dimension”. Hj(j = 1, 2, . . . , m) is valued as ‘‘year, commodity category” accordingly. The values of bij,x (x = 1, 2, . . . , t) are ‘‘The year of 1999, nestle coffee” accordingly, h is ‘‘=”, Fs(SM) is the function deﬁned as Set().

Then ^"i,j,x{O[Di(Hj)]j[Di(Hj)]hbij,x} ? Fs(SM) represents (The year of 1999, nestle coffee) ? (Location set of customers). The function ‘‘Set()” can provide the coordinate points which represent the locations of the selected customers. Cube operation facilities interactive exploratory data analysis. User often like to traverse ﬂexibly trough a database, select any portions of relevant data, analyze data at different levels of abstraction, and present results in different forms (any part of task related information of customer). Cube operation provides such a tool for drilling, pivoting, ﬁltering, dicing and slicing any sets of data in data cube, for analyzing data at different levels of abstraction, and for providing ﬂexibly to the task of data mining (Han et al., 2000; Han & Kamber, 2000). 2.2. Spatial topological relation predicate of customer service location Deﬁnition 6 (Spatial-predicate). Spatial-predicate is a kind of function operation of spatial measure (i.e. spatial-predicate() as the C function in Deﬁnition 3) in the data cube. It is the description of spatial topological relationship between two geographical objects (e.g. Touch(x, hill) means that the boundary of spatial object x touches a hill). The general expression of spatial relations can be transformed into facts of the kind hSpatialPredicatei(RefObj, TaskRelevantObj) (Wang & Xie, 2005).Various kinds of spatial-predicates can be involved in spatial data mining method to describe the surrounding information of a reference object. The family of spatial-predicates includes Closeto(), Touch(), Adjacency(), Contain(), it Disjoin(), Overlap(), etc. Spatial-predicates have the ability to describe the surrounding information of a reference object, and are the important factors for the location selection of customer service center. The chosen site must satisfy a series of restrictive conditions which mainly include the spatial environment factors, spatial trafﬁc factors and spatial terrain factors, etc. causing by the distribution of existing spatial facilities. All the spatial factors of restrictive conditions can be represented by the spatial relation between two geographical objects. In other words, huge amount of variables and restrictions of site selection can be transformed into the form of spatialpredicates. For an example, the site of a service center needs to be built by touching a broad road, which can be denoted by the spatial-predicate, Adjacency(x, broad road).

3926

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

The surrounding information of a reference object should be denoted by a more general concept instead of a detailed value. For example, Overlap(x, high-hillside) shows that the position of reference object x overlaps a high hillside where high is the non-spatial attribute of hillside, such position is not ﬁt for the establishment of customer service center. So it appears to be more important to build a reasonable concept tree of the non-spatial attributes of surrounding objects, which can mapping a detailed value into a certain concept level to get more general knowledge, e.g. a numerical altitude of a hillside is meaningless, but the concepts of high, medium and low of the hillside are meaningful description for the problem of site selection. The scientiﬁc methods of building concept tree should closely associate with its application ﬁeld (Ladner, Petry, & Cobb, 2003; Han et al., 2000; Han & Kamber, 2000). The following algorithm is a general computation method of spatial topological predicates:

Algorithm 1: Computation of spatial-predicates Input: (1) Restriction factors (ResFac) which are represented by spatial-predicates / The restriction factors including environmental factors, trafﬁc factors and terrain factors of site selection, such as Not-close-to(x, railway), Touch(x, wide-road) and Overlap(x, low-hillside)./ (2) Reference objects (RefObj) /Spatial objects (e.g. locations of customers) which can be got by the operation of spatial data cube which is described in Deﬁnition 5./ (3) Task relevant objects (TaskRelevantObj) /Surrounding objects of the reference objects (RefObj) which can be got by the operation of GIS./ (4) Concept trees of the non-spatial attribute of task relevant objects (TaskRelevantObj) /Concept trees of non-spatial attributes of the surrounding objects./ Output: {Spatial-predicates} Method: Step 1: MBR-Spatial-Predicates = Computing-MBR-Predicates(MBR-RefObj, MBR-TaskRelevantObj, ResFac); /Applying MBR method to coarsely estimate the task related predicates, MBR (minimal boundary rectangle) is the index of geographical object and can approximately simulate the real spatial entity of RefObj and TaskRelevantObj. MBR method can estimate the relationship of two spatial objects by the form of hSpatialPredicatei (RefObj, TaskRelevantObj) (Jankowski, 1995; Clementini et al., 2000)./ Step 2: Detailed-Spatial-Predicates = (MBR-Spatial-Predicates, Detailed value of the TaskRelevantObj) /MBR-Spatial-Predicates can denote the relationship between RefObj and TaskRelevantObj, and the detailed value of non-spatial attributes of TaskRelevantObj involved in MBR-SpatialPredicates is also very important in the process of site selection. So Detailed-Spatial-Predicates show the information of MBR-SpatialPredicates and their detailed value of the TaskRelevantObj./ Step 3: Spatial-Predicates = Mapping (Detailed-Spatial-Predicates, Concept trees) /Mapping the detailed spatial-predicates to the form of general spatial-predicates according to the concept tree of non-attributes./ Step 4: Return {Spatial-predicates}.

3. The calculation of spatial distance 3.1. Spatial distance The method of distance calculation in current stage often computes direct Euclidean distance between two coordinate points. While in real world, there are often obstacles which can not be directly traversed between two objects such as buildings, rivers, hills, etc. which is shown in the left part of Fig. 1. In this paper, we ﬁrst take obstacle factors into account in our clustering method. Deﬁnition 7 (Direct Euclidean distance). Given (1) a set P of n objects {p1, p2, . . . , pn}, "pi represents a customer location, and (2) a set O with of m non-intersecting objects {o1, o2, . . . , om} in a two dimension region R, with each

obstacle oi represented by a simple polygon(Tung & Han, 2001). The direct Euclidean distance between two points pi and pj, denotes qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ as dðpi ; pj Þ ¼

ðpi x pj xÞ2 þ ðpi y pj yÞ2 .

Deﬁnition 8 (Obstructed distance (Zaiane & Lee, 2002)). The obstructed distance between two objects, denotes as d0 (pi, pj), is deﬁned as length of the shortest Euclidean path from pi to pj without cutting any obstacles. If oi is topological polygon, denoted as P(V, E), where V is the set of points forms the polygon of obstacle, V = {v1, v2, v3, . . . , vk}, whereas E is set of lines which forms the polygon of obstacle, E = {e1, e2, . . . , ek}, ei is the line which links points vi and vi+1, 1 6 i 6 k. Deﬁnition 9 (Linkage strategy). The obstacle has two typical classes: concave object and convex object. The linkage strategy of concave objects is different from that of convex objects. If two points pi and pj is obstructed by concave objects, the concave point should not be linked. Whereas, if two points pi and pj is obstructed by convex objects, the convex point must be linked by a line. A series of paths will be linked from point P to all the convex points as described in Fig. 2. Deﬁnition 10 (Visibility). ‘‘Visibility” is a kind of relation between two points, if the line between two points is not intersected by an obstacle which is represented by P(V, E), then the two points have the character di, dj 2 D, i – j, of ‘‘visibility”. Given D = {d1, d2, . . . , dn}, i, j 2 [1, . . . , n], l is the line which links di and dj, if "ei 2 E, :$p 2 l \ ei, then di and dj is regarded as ‘‘visibility”. BSP_tree is an effective method to check out ‘‘visibility” of two objects (Tung et al., 2001). 3.2. The calculation algorithm of spatial distance Algorithm 2: Calculation of obstructed distance Input: (1) Location map of the customers /The distribution of customer locations in digital map./ (2) Obstacle map – P(V, E) /GIS themes of obstacles./ Output: Distance Method: Step 1: P :¼ Set() /Operating spatial data cube to get the set of customers’ locations P = {p1, . . . , pi, . . . , pn}, which is deﬁned as Deﬁnition 5./ Step 2: Result :¼ Check-Visibility (pi, pj, P(V, E)) /Applying BSP-tree (Chawla and Shekhar, 1999) to check out ‘‘visibility” between two objects./ Step 3: If Result :¼ true Then Shape :¼ Checkshape (P(V, E)) /Checking out the shape of obstacles./ Step 4: If Shape :¼ Convex Then Constructconvexgraph (pi, P(V, E), pj) /Adopting the linkage strategy of convex obstacles./ Calculate_Dijkstra_distance (pi, pj) /Applying Dijkstra (Marianov and Revelle, 1996) algorithm which is the well known model to calculate the shortest distance of d0 (pi, pj) with obstacles./ Step 5: Else /Means the obstacle shape :¼ Concave./ Constructconcavegraph (pi, P(V, E), pj) /Adopting the linkage strategy of concave obstacles as Deﬁnition 9./ Calculate_Dijkstra_distance (pi, pj) /Applying Dijkstra algorithm to calculate the distance of d0 (pi, pj) with obstacles./ Endif. Endif Step 6: Calculate_Euclidean_distance (pi, pj) /If there is no obstacles between two objects, then calculate direct Euclidean distance of d(pi, pj) as the description of Deﬁnition 7./ Step 7: Return (Distance).

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

3927

Deﬁnition 11 (Square-error of obstructed distance (Han et al., 2000)). The spatial clustering with the calculation of obstructed distance is to partition the given data set into k-clusters (Cl1, Cl2, . . . , Clk), such that the following square-error function, E, is minimized. E¼

k X X 0 ðd ðp; ci ÞÞ2 i¼1 p2C i

where ci is the center of cluster Cli that is determined by the clustering method.

Fig. 2. Visibility with obstacles.

Deﬁnition 12 (The weight of spatial object). If a point p has the w, then w p = w (x, y) = {(x, y)1, (x, y)2, . . . , (x, y)wj w = 1, 2, . . . , n}, which means each customer location has its own service level according to its property, e.g. an inferior customer and a VIP customer are both service objects of the enterprise, but their respective service level is obviously different. So we propose the weight adding method to represent the service level of a customer. For example, if the weight(service level) of a customer is 5, which is equal to ﬁve customers each with the weight 1, then the four more same points are generated and added to the location of the customer onto digital map in GIS.

4. Hybrid spatial data clustering algorithm for site selection 4.1. OETTC–MEANS–CLASA algorithm

Fig. 3. The clustering result of two typical methods.

Tung, Hou and Han have proposed the method of ‘‘clustering with obstructed distance” and COD–CLARANS algorithm, which combine the algorithm of CLARANS and the method of obstructed

Fig. 4. The correlation of the algorithms.

3928

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

distance calculation to solve the problem of site selection. The reason why to discard the k-means method is that the mean value of a set of points is not well deﬁned when obstacles are involved. For example, in the right picture of Fig. 3. The mean value (the center) of the points is on the location of an obstacle, so it is impossible for other points to reach the mean value(the center) in the cluster. On the other hand, the k-mediods method chooses an object within the cluster as a center and texts the eligibility by switching the center to another object in the set of candidate points (mediods), which ensures that the problem in the left of Fig. 3 does not occur. But k-mediods method can be time consuming even for a moderate number of objects and a small number of mediods. According to the theory and experimental researches, the optimal mediod always exists near the mean value of k-means (Chu et al., 2001; Chu et al., 2002). Though the mean value lies on the location of an obstacle, we can absorb the advantages of low computational complex and ﬁtness of huge data set of k-means method, and apply buffering function of GIS to get a certain number of candidate points around the mean value which is regarded as the center of a buffering circle. So the optimal result of the clustering is always around the mean value whether there are obstacles or not. The points got by

buffering serve as the possible mediods (centers) of clustering. This sampling method can remarkably reduce the searching space of clustering algorithm. Though the solution of global optimal clusters is NP-hard, Simulated annealing is a random search method that has been proved to be efﬁcient for optimization which can be applied to generate results with high quality (Kirkpatrick, Gelatt, & Vecchi, 1993). On the basis of above researches, we further design a hybrid algorithm named OETTC_MEANS_CLASA which takes all the spatial factors into account to better realize the purpose of site selection in Algorithm 3. Algorithm 3 composes of ﬁve sub-algorithms including Algorithms 4–8. The correlation of the ﬁve sub-algorithms, the correlation of the Algorithm 3, the Algorithms 1 and 2, and the correlation of spatial data cube, Algorithms 1–3 are described in Fig. 4. In Fig. 4, the operations of spatial data cube can provide the data resource as the inputs of Algorithms 1–3. the former algorithm provides one of the inputs to the latter algorithm in sequence from Algorithms 4 to 8. In Algorithm 7, Algorithm 1 should be called to realize the computation of spatial-predicates. In Algorithm 8, Algorithm 2 should be called to realize the calculation of spatial distance.

Fig. 5. Dataset 1 for Experiment 1.

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

Algorithm 3: OETTC_MEANS_CLASA Input: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)

The number of the clusters – k The number of iterations of k-means – m The number of pointers got by GIS buffering in each cluster – r Location map of the customers – Set() The weight set for customers – w() The map of spatial obstacle – P(V, E) Spatial environmental restriction factors Spatial terrain restriction factors Spatial trafﬁc restriction factors Cost restriction factors Initial temperature T0, cooling temperature Ta Parameters d, q, k

Output: k-clusters and k-mediods Method: Step 1: Calling Weight adding Algorithm. /If the weight of point p is w, then generate the other w 1 points of p onto the digital map, which represents the service level of the customer. The method can be described as Algorithm 4./ Step 2: Calling k-means algorithm. /Deﬁning the parameter k and the iterations m before computing the mean value whose direct Euclidean distance is minimum after step 1, ﬁnally produce k-mediods and kclusters. The algorithm is numbered as Algorithm 5./ Step 3: Calling Subset Algorithm. /Using buffering analytical functions of GIS to get r near points from each cluster and to form the result space {Subset1(), Subset2(), . . . , Subsetk()}. The method can be described as Algorithm 6./ Step 4: Calling Elimination Algorithm. /Evaluating the spatial environment factors and spatial terrain factors by using spatial-predicates which are the operations through spatial measurement function, the cost factors are taken into account as well. Then eliminate the impossible mediods from Subset1(), Subset2(),. . . , Subsetk(), respectively, by the above restriction factors, and form the sets of Candidateset1(), Candidateset2(), . . . , Candidatesetk() accordingly. The method can be described as Algorithm 7./ Step 5: Choosing Initial mediods. /Arbitrarily choosing an object in each Candidateset1(), Candidateset2(), . . . , Candidatesetk() as initial medoids C = {c1, c2, . . . , ck}./ Step 6: Calling CLASA Algorithm. /Testing and swapping mediods from some nearest objects in the same group of Candidateset1(), Candidateset2(), . . . , Candidatesetk(), respectively, to ﬁnd the optimal mediods by applying simulate annealing algorithm in a rather small scope. The method can be described as Algorithm 8./ Step 7: Return the k-medoids and k-clusters.

Algorithm 4: Weight adding algorithm Input: (1) Location map of the customers – Set() /As described in Deﬁnition 5./ (2) The weight set for each customer – w() /As described in Deﬁnition 12./ Output: (3) Spatial objects set after being weighted in the digital map – Weightobject() Method: Weight-object() :¼ set() For each coordinate (xi, yi) in Weight-object() do /For each point in Weightobject()./ (xi, yi) wi /Adding the corresponding weight to each point./ For j = 1 to w 1; j++ Sw1 ðxi ; yi Þ/generating the points according to the Weight-object() :¼ j¼1 number of the weight, and incorporating them into the set of Weightobject(). Endfor; Endfor;

3929

Algorithm 5: k-Means algorithm /Running k-means algorithm in the set of Weight-object() which is the result of Algorithm 4 to ﬁnd k-mediods/

Algorithm 6: Subset algorithm Input: (1) k-mediods that are the result of Algorithm 5. (2) The number of r-mediods which are candidates centers.

(3)

Output: {Subset1(), Subset2(), . . . , Subsetk()} Method: For i = 1 to k do Subseti() :¼ GIS buffering function (mediodi, r, clusteri) /Applying buffering function of GIS to take mean values computed by k-means method as the k-centers of a buffering circles, and acquiring r nearest points, respectively, by extending the scope of each buffering circle in each cluster. Then forming the sets of Subset1(), Subset2(), . . . , Subsetk()/ End for;

Alogrithm 7: Elimination algorithm Input: (1) (2) (3) (4)

{Subset1(), Subset2(), . . . , Subsetk()} which is the result of Algorithm 6 Spatial environmental factors (such as Not-close-to(x, railway), etc.) Spatial trafﬁc factors (such as Touch(x, wide-road), etc.) Spatial terrain factors (such as Overlap(x, low-terrain), Overlap(x, enough-area), etc.) (5) Cost factors (such as the cost must be lower than 5000/m m)

Output: (6) {Candidateset1(), Candidateset2(), . . . , Candidatesetk()} Method: For i = 1 to k do Candidateseti() :¼ Subseti() For each mediod in Candidateseti() do Calling algorithm 1 /Computing spatial topological predicates/ If Mediod spatial-predicate :¼ Environmental restriction /Applying spatial-predicates to judge the environmental restriction factors of the possible location./ Then Candidateseti() :¼ Candidateseti() {mediod} /Wiping off this mediod(point)/ Else if Mediod spatial-predicate :¼ Trafﬁc restriction /Applying spatial-predicates to judge the trafﬁc restriction factors of the possible location./ Then Candidateseti() :¼ Candidateseti() {mediod} /Wiping off this mediod(point)/ Else if Mediod spatial-predicate :¼ Terrain restriction /Applying spatial-predicates to judge the terrain restriction factors of the possible location./ Then Candidateseti() :¼ Candidateseti() {mediod} /Wiping off this mediod(point)/ Else if Mediod cost :¼ Cost restriction /Applying the r estriction factors of the cost to ﬁnd suitable locations./Then Candidateseti() :¼ Candidateseti() {mediod} /Wiping off this mediod (point)/ Endif Endfor; Return Candidateseti() Endfor;

4.2. The sub-algorithms of OETTC–MEANS–CLASA Metropolishas proposed simulated annealing method which is a simulation process of physical annealing by computer in the year of 1953. In order to achieve a qualiﬁed crystal, Simulated annealing heats up a solid to a molten temperature, then slowly cool down

3930

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

the temperature to the freezing point. The algorithm of simulated annealing tries to ﬁnd global optimal solution by simulating the above process of annealing. During the annealing method, energy diversion process submits to the following Boltzmann formula (Kirkpatrick et al., 1993)

PðEÞ ¼ expðE=TÞ P(E) denotes the probability of system at E (energy) stage, T is system temperature. The lower the temperature T is, the smaller the probability of the system at high-energy stage is, which denotes the gradual decline of the system activity. When the system temper-

Fig. 6. Dataset 2 for Experiment 2.

Table 1 The parameter setting of OETTC factors for two algorithm T(Trafﬁc factors)

T (Terrain factors)

C

Experiment 1 P(V, E) Not-Touch (x, hill), Not-close-to (x, cinema)

O

E (Environment factors)

Touch (x, wide road), Not-close-to (x, railway)

Overlap (x, low-terrain), Overlap (x, enough-area)

5000/m * m

Experiment 2 P(V0 , E0 ) Not-Touch (x, school), Not-close-to (x, market)

Touch (x, wide road), Close-to (x, highway)

Overlap (x, plain), Not-Overlap (x, building)

8000/m * m

Table 2 Other parameters of two algorithms Experiment 1 OETTC–MEANS–CLASA OETTC–CLARANS Experiment 2 OETTC–MEANS–CLASA OETTC–CLARANS

k 10 k = 10

m 50

r 200

Ta 0.000025

T0 d q 0.0001 0.9 15 max_try = 400 (maximal times of neighbours searching)

k 50,000

k 6 k=6

m 50

r 300

Ta 0.000025

T0 d q 0.0001 0.9 15 max_try = 400 (maximal times of neighbours searching)

k 50,000

3931

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

ature is low enough, the probability of energy stage reaches value ‘‘1”, which denotes the system appears stable at low-energy stage. When the Boltzmann function arrives at P(E) = 1, global optimization is obtained. In addition, during the process of cooling down, system energy might somehow ascend, which means P(E) is higher than a given probability value. The algorithm can accept hypo-optimal solution to some extent so as to jump out of the local optimization to enhance the result quality towards global optimization. The process of simulated annealing should be long enough to enable the system to reach a stable stage, especially when the temperature is around the freezing point. If it is too quick to cool down the temperature, some ﬂaws will be produced on the crystal. The key step of simulated annealing algorithm is to choose a suitable ‘‘cooling programming” which includes the set of initial stage, the test of Metropolis stability, cooling strategy, times of iterations and stop criterion. We design simulated annealing algorithm for site selection shown as Algorithm 8. 4.3. Experiments and comparisons We apply 200,000 items of customers information provided by Beijing Carrefour Supermarket and organize those data into SQL

Alogrithm 8: CLASA algorithm Input: (1) Initial mediods C = {c1, c2, . . . , ck} (2) Initial temperature T0, freezing temperature Ta (3) S = {Candidateset1(), Candidateset2 (), . . . , Candidatesetk()} /The result of Algorithm 7/ (4) Parameters t, d, q, k Output: k-medoids and k-clusters Method: Repeat Begin Generate (C0 form S) /Swapping the initial mediods C in S, randomly generate a new mediods C0 / ME = E(C0 ) E(C)/E(C) is total square-error of mediods C which can be calculated by Algorithm 2 and described by Deﬁnition 11./ If ME < 0 /Which means total square-error of mediods C0 is less than C/ Then C :¼ C0 and n :¼ n + 1 /Which denotes mediods C0 is a better solution that can replace mediods C, and adding 1 to the total number n when ME < 0./ Else if eDE=T t Prandom[0, 1) /Given a parameter random[0, 1), if P(E) P random[0, 1)/Then C :¼ C0 /C0 is accepted at the probability value of P(E), which is an important mechanism to jump out of the local optimization to search for more qualiﬁed mediods./ Endif; t :¼ t + 1; Calculate_length (g); /n is the times of ME < 0, Calculate_length (g) here means if g > q (q is a given parameter), then algorithm jumps to Stop criterion./ Calculate_control (Tt); /Calculating control temperature Tt which is the schedule of annealing process. In our case, Tt = T0dt (T0 is Initial temperature, t is times of algorithm iterations, d is a given constant, 0 < d < 1.)/ Until Stop criterion; /There exist two stop criterions in our case, if either of them is met then the algorithm will cease. (1) Give a number k to restrict the number of the combinations of mediods, if t P k(t is the times of the iterations), then the algorithm will cease. (2) If Tt < Ta, (Ta denotes freezing temperature, Tt is the control temperature), then algorithm will cease too. Otherwise with the increase of t, the control temperature will continue to drop down, the algorithm will continue to iterate./ (1) t P k (2) Tt < Ta Return {C} /Output the result of k-mediods/ Return {Cluster} /Output the result of k-clusters/ End;

server 2000 as data warehouse, then we use digital map of Chinese Beijing stored in supermap3.0 (GIS software), and devise automatic programming to generate spatial points as customer locations according to the address of those customers. The matching tool is ArcSDE which fulﬁls the functions of spatial data engineering by integrating spatial data and non-spatial data of customers (ESRI, http://www.esri.com/software/sde/index.html). Visual Basic 6.0 is used as the platform. 4.3.1. Experiments about the running times for testing two algorithms. 4.3.1.1. Experiment data. In order to test the feasibility and efﬁciency of our method, we use two cube operations described in Deﬁnitions 1–4 to select 103,875 customers whose total expenditure in 2005 is above 5000 Yuan, and to select 121,556 customers who bought TV in 2005. The function – Set(SM) described in Definition 5 can get the two-dimensional points in GIS as two datasets for simulating customer locations of the results of the two cube operations, respectively. Two digital maps which are got by spatial data cube operation serve as the background of customer distribution in the city of Beijing. The two datasets will overlap on the two maps shown as Figs. 5 and 6, where the points are represented as (x, y), the obstacle entities can be denoted as (V, G) in GIS, e.g. park, lake, railway, etc. Our task is to ﬁnd the suitable centers and their corresponding clusters, In other words, is to ﬁnd the customer service centers (mediods) and their dominative areas (clusters). 4.3.1.2. Experimental analysis. Based on above datasets, the algorithms of COD–CLARANS and our OETTC–MEANS–CLASA are tested for merits and shortages. Since COD process is limited in the spatial obstacle factors, we adjust COD–CLARANS to be suitable for OETTC conditions, and design OETTC–CLARANS algorithm to compare

Table 3 The results of two algorithms based on the Dataset 1 (Experiment 1) OETTC–CLARANS

OETTC–MEANS–CLASA

The total distance calculated by algorithm (105 m)

The average distance of service response (103 m)

The total distance calculated by algorithm (105 m)

The average distance of service response (103 m)

8.57 8.64 8.70 8.60 8.69 8.68 8.62 8.59 8.61 8.65 8.66 8.65 8.63 8.70 8.68 8.67 8.66 8.63 8.69 8.67 8.59 8.62 8.60 8.66 8.67 8.65 8.61 8.61 8.68 8.64

7.75 7.74 7.78 7.69 7.71 7.78 7.69 7.74 7.75 7.72 7.72 7.74 7.77 7.76 7.75 7.76 7.71 7.72 7.73 7.75 7.74 7.72 7.73 7.75 7.77 7.72 7.69 7.70 7.71 7.73

3.76 3.80 3.75 3.79 3.85 3.82 3.84 3.81 3.79 3.77 3.78 3.77 3.80 3.75 3.82 3.76 3.78 3.79 3.81 3.83 3.80 3.77 3.79 3.84 3.82 3.81 3.79 3.77 3.78 3.84

7.35 7.37 7.38 7.36 7.35 7.38 7.35 7.37 7.35 7.37 7.36 7.36 7.37 7.39 7.35 7.38 7.37 7.38 7.36 7.35 7.37 7.38 7.38 7.36 7.37 7.35 7.39 7.38 7.35 7.36

3932

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

with our OETTC–MEANS–CLASA algorithm. OETTC factors are deﬁned as Table 1, and many other parameters are set in advance as Table 2: On the basis of Dataset 1 (Fig. 5) we carry out Experiment 1, two algorithms run for 30 times, respectively, according to the parameter sets. The results are shown as Tables 3–5. In the following tables, ‘‘the total distance calculated by algorithm” tests the overall distance needs to be calculated by using this algorithm. The shorter the total distance is calculated, the less computational cost will the algorithm spends, so it can stand for the performance efﬁciency of the algorithm. ‘‘The average distance of service response” tests the average traveling distance of the service centers (mediods) to all the customers’ locations (objects in the cluster) under the charge of the service centers, and can represent the result quality of the algorithm. From the experimental results shown in Tables 3–5, We can draw the conclusion that the OETTC–MEANS–CLASA algorithm is superior to OETTC–CLARANS algorithm in the both aspects of performance efﬁciency (computational cost) and result quality when running the two algorithms for any times (here 30 times for example). On the basis of Dataset 2 (Fig. 6) we carry out Experiment 2, two algorithms run for 30 times, respectively, according to the parameter sets. The results are shown as Tables 6–8. We can also draw the conclusion that there exist gaps of the two algorithms and OETTC–MEANS–CLASA algorithm is superior to OETTC–CLARANS algorithm in the both aspects of computational cost and result quality when running the two algorithms for any times (here 30 times for example). Table 4 The total distance calculated by two algorithms

Total distance calculated by algorithm

OETTC-CLARANS

OETTC-MEANS-CLASA

10 8 6 4 2 0

1

3 5

7 9 11 13 15 17 19 21 23 25 27 29 Running times

OETTC–CLARANS

OETTC–MEANS–CLASA

The total distance calculated by algorithm (105 m)

The average distance of service response (103 m)

The total distance calculated by algorithm (105 m)

The average distance of service response (103 m)

9.66 9.85 9.72 9.87 9.70 9.75 9.74 9.70 9.69 9.68 9.78 9.68 9.85 9.81 9.70 9.75 9.84 9.73 9.75 9.84 9.66 9.68 9.83 9.69 9.77 9.70 9.72 9.80 9.84 9.72

8.15 8.14 8.18 8.17 8.16 8.16 8.17 8.16 8.15 8.18 8.13 8.17 8.15 8.17 8.15 8.16 8.14 8.18 8.14 8.15 8.16 8.13 8.15 8.17 8.15 8.16 8.17 8.16 8.14 8.16

4.16 4.20 4.09 4.18 4.15 4.13 4.17 4.19 4.20 4.12 4.10 4.13 4.20 4.12 4.19 4.17 4.15 4.14 4.10 4.18 4.17 4.20 4.13 4.09 4.18 4.19 4.11 4.18 4.15 4.19

7.86 7.89 7.88 7.89 7.85 7.87 7.87 7.85 7.86 7.88 7.87 7.86 7.86 7.88 7.86 7.87 7.88 7.89 7.85 7.89 7.85 7.86 7.88 7.86 7.88 7.89 7.86 7.88 7.85 7.85

OETTC-CLARANS

OETTC-MEANS-CLASA

Total distance calculated byalgorithm

Average distance of service response

7. 9 7. 8 7. 7 7. 6 7. 5 7. 4 7. 3 7. 2 7. 1

Table 6 The results of two algorithms based on the Dataset 2 (Experiment 2)

Table 7 The total distance calculated by two algorithms

Table 5 The average distance of service response

OETTC-CLARANS

4.3.1.3. Clustering result. The two experiments of our OETTC– MEANS–CLASA can realize the selection of customer service location. We can randomly select the clustering result in above experiments. For examples, Fig. 7 is the clustering result of 29th item in the total 30 items of Experiment 1 (See Table 3) and Fig. 8 is the clustering result of the 5th item in the total 30 items of Experiment 2 (See Table 6). The result maps are Fig. 7 (it’s original map is Fig. 5) and Fig. 8 (it’s original map is Fig. 6) include the location selection of customer service center (mediod) and the response range(domi-

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Running times

OETTC-MEANS-CLASA

12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Running times

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936 Table 8 The average distance of service response

Average distance of service response

OETTC-CLARANS

OETTC-MEANS-CLASA

8.3 8.2 8.1 8 7.9 7.8 7.7 7.6 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Running times

native area) of each center (cluster). The service locations are marked as ‘‘j”, the areas circled by black lines are the corresponding clusters. 4.3.2. Experiments about the different scale of datasets for testing two algorithms. Supposing that all the parameters are same as that of Tables 1 and 2 in Experiment 1, we use automatic programming to generate 100,000, 200,000, . . . , 1,000,000 two-dimension points, respec-

3933

tively, to be 10 datasets for testing performance of the two algorithms when the number of pointers increase. Table 9 shows the change of ‘‘the total distance calculated by algorithm” due to change of the number of points, and Table 10 shows the change of ‘‘the average distance of service response” due to the change of the number of points. We can draw a conclusion from Table 9 that the gap of the performance efﬁciency (computational cost) of two algorithms become larger and large when the number of points increases. The reason is that the computational complexity of is different. The computational complexity of OETTC–CLARANS is O(n n), and that of OETTC–MEANS–CLASA algorithm is O(n k t + k r r) which is made up of the computational complexity of k-means as O(n k t) and the computational complexity of CLASA as O(k r r). Normally k n, t n and r n (r is the number of the nearest objects of each cluster), then O(n k t + k r r) O(n n). We can also draw a conclusion from Table 10 that our algorithm is better than OETTC–CLARANS in the aspect of result quality when the number of points increases. The reason is that simulated annealing can search better results (have more qualiﬁed results) than CLARANS.

4.3.3. Experiments about the change of P(V, E) for testing two algorithms. Supposing that parameter O is changeable (P(V, E) is changeable), the other parameters are same as that of Tables 1 and 2 in Experiment 1. Given the number of the points is 120,000, we

Fig. 7. OETTC–MEANS–CLASA results for Dataset 1.

3934

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

Fig. 8. OETTC–MEANS–CLASA results for Dataset 2.

Table 9 The comparison of performance efﬁciency when the number of points increases

Table 10 The comparison of result quality when the number of points increases

use automatic programming to randomly generate 25 different P(V, E)s, each of which composes the corresponding convex obstacles and concave obstacles. The experiment is to test the

OETTC-MEANS-CALSA

25 20 15 10 5 0 10 00 0 20 0 00 0 30 0 00 0 40 0 00 0 50 0 00 0 60 0 00 0 70 0 00 0 80 0 00 0 90 0 00 10 00 00 00 0

30 00 00 40 00 00 0 50 0 00 0 60 0 00 0 70 0 00 80 00 00 90 00 00 10 00 00 00 0

00

00

00 10

The number of points

Average distance ofservice response

OETTC-MEANS-CLASA

300 250 200 150 100 50 0 20

Total distance calculated by algorithm

OETTC-CLARANS OETTC-CLARANS

The number of points

performance of the two algorithms when they meets different P(V, E)s. The experimental results are shown as Tables 11 and 12.

3935

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

0 1 2 3 4 5 6

7 8 9 10 11 12 13 14 15

Different P(V,E)

Table 12 The comparison of result quality when it appears different P(V, E)

Average distance of service response

OETTC-CLARANS

OETTC-MEANS-CLASA

n=800000

15

n=700000

10

n=600000 n=500000

5

n=400000

0

n=300000

11 % 13 % 15 %

5

n=900000

20

9%

10

n=100000

25

7%

15

30

5%

20

3%

OETTC-MEANS-CLASA

1%

Total distance calculated by algorithm

OETTC-CLARANS

Table 13 The correlation of parameter r and the result quality of our algorithm

Average distance of service respons

Table 11 The comparison of performance efﬁciency when it appears different P(V, E)

The proportion of sampled points

n=200000 n=100000

respectively) demonstrate that the value of 6% is an ideal proportion of sampling. When the proportion exceeds 6%, it will contribute nothing to get a more qualiﬁed result., while the searching space of the algorithm will became larger than before, which will result in a higher computational cost.

12 10

5. Conclusion

8 6 4 2 0

1 2 3 4 5

6 7 8 9 10 11 12 13 14 15 Different P (V,E)

No matter what kind of P(V, E) is met, OETTC–MEANS–CLASA algorithm is superior to OETTC–CLARANS algorithm in the both aspects of computational cost and results quality. The obstacles denoted by different P(V, E)s are not the factors to affect or change the gap of two algorithms. 4.3.4. Experiments about the parameter r with different values. Supposing that parameter r is changeable, other parameters are same as that of Tables 1 and 2 in Experiment 1. Parameter r (see Algorithm 6) decides the candidate number of the points as the possible centers, and is a key parameter to affect the result quality of OETTC–MEANS–CLASA algorithm. We assign the number of points (n) to 100,000, 200,000, . . . , 1,000,000, respectively, and do an experiment to test the correlation of parameter r and result quality. Thex-axis is the proportion of the number of the sampling points divided by the number of all the points in the dataset (here the proportion can be represented by the formula: k r/n, where k is the number of cluster, k r is the number of the sampling points, and n is the total number of the points in the dataset), the y-axis records the result quality denoted by ‘‘average distance of service response”. The experimental result is shown as Table 13. The OETTC–MEANS–CLASA does not get the optimal result until the proportion (k r/n) reaches the value of 6%. The experiment results which are gotten by running OETTC–MEANS–CLASA on 10 different datasets (n is assigned to 100,000, 200,000, . . . , 1,000,000,

The article proposes a hybrid spatial clustering method for the selection of customer service location. We transform the traditional model-oriented method to data-driven method, establish a data model named spatial data cube for customer management, and explore a process of site selection from the huge amount of objects which are recorded as two-dimension points in GIS. OETTC–MEANS–CLASA is proposed in consideration of spatial obstacle factors, spatial environmental factors, spatial trafﬁc factors, spatial terrain factors and cost factors. It provides a new decision making process for the problem of site selection. Our experiments have proved that OETTC–MEANS–CLASA is more efﬁcient and can achieve more qualiﬁed results than other spatial clustering methods. References Aboolian, Robert (2007). Competitive facility location and design problem. European Journal of Operational Research, 182, 40–62. Ayala, G., Epifaniob, I., Simó, A., & Zapatera, V. (2006). Clustering of spatial point patterns. Computational Statistics & Data Analysis, 50, 1016–1032. Berman, O., & Krass, D. (2002). Expenditures locating multiple competitive facilities: Spatial interaction models with variable expenditures. Annals of Operations Research, 111, 197–225. Chawla, Sanjay, & Shekhar, Shashi (1999). Modeling spatial dependencies for mining geospatial data: An introduction. Geographic Data Mining and Knowledge discovery (GKD), 75(6), 112–120. Cheng, M. Y., & Chang, G. L. (2001). Automating utility route design and planning through GIS. Automation in Construction, 10, 507–516. Cheng, Eddie W. L., Li, Heng, & Yu, Ling (2007). A GIS approach to shopping mall location selection building and environment. Computers & Operations Research, 42, 884–892. Chu, S. C., Roddick, J. F., & Pan, J. S. (2001). A comparative study and extensions to kmedoids algorithms (pp. 1708–1717). In Fifth international conference on optimization: Techniques and application, Hong Kong, China. Chu, S. C., Roddick, J. F., & Pan, J. S., (2002) An efﬁcient k-medoids-based algorithm using previous medoid index, triangular inequality elimination criteria and partial distance search (pp. 63–72). In The 4th international conference on data warehousing and knowledge discovery (DaWaK 2002), Aix-en-Provence, France. Clementini, Eliseo et al. (2000). Mining multiple-level spatial association rules for objects with a broad boundary. Data & Knowledge Engineering, 34, 251–270. Demir, Engin et al. (2007). Clustering spatial networks for aggregate query processing: A hypergraph approach. Information Systems [Available online at 3 April 2007].

3936

B. Fan / Expert Systems with Applications 36 (2009) 3923–3936

Domingo, J., Ayala, G., & Dı´az, M. (2002). Morphometric analysis of human corneal endothelium by means of spatial point patterns. International Journal of Pattern Recognition and Artiﬁcial Intelligence, 16(2), 127–143. ´ nguez, Enrique et al. (2008). A neural model for the p-median problem. Domı Computers & Operations Research, 35, 404–416. ESRI spatial database engine. 2003. Ester, Martin, & Kriegel, Hans-Peter (1998). Clustering for mining in large spatial databases [Special Issue on Data Mining]. KI-Journal, 1, 332–338. Han, J., Kamber, M., & Tung, A. K. H. (2000). Spatial clustering methods in data mining: A survey (pp. 1–28). Technical report. Simon Fraster University, Computer Science. Han, Jiawei, & Kamber, Micheline (2000). Data mining: Concepts and technique (pp. 23–56, 77–106). Morgan Kaufman. Hu, Tianming, & Sung, SamYuan (2006). A hybrid EM approach to spatial clustering. Computational Statistics & Data Analysis, 50, 1188–1205. Jain, K., & Vazirani, V. (2001). Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. Journal of the ACM, 48, 274–296. Jankowski, P. (1995). Integrating geographical information system and multiple criteria decision-making methods. International Journal of Geographical Information Systems, 9, 251–273. Kirkpatrick, S., Gelatt, C. D., Jr., & Vecchi, M. P. (1993). Optimization by simulated annealing. Science, 200(4598), 671–680. Korte, George P. E., & Koret, George P. (1997). The GIS book: Understanding the value and implementation of geographic information systems [pp. 7–28]. Delmar Publishers. Ladner, R., Petry, E., & Cobb, M. A. (2003). Fuzzy set approaches to spatial data mining of association rules. Transactions in GIS, 7(1), 123. Lin, Ge (2004). Comparing spatial clustering tests based on rare to common spatial events. Computers, Environment and Urban Systems, 28, 691–699. Marianov, Vladimir, & Revelle, Charles (1996). The queueing maximal availability location problem: A model for the sitting of emergency vehicles. European Journal of Operational Research, 93, 110–120. Nathanail, C. P. (1998). Spatial management of geotechnical data for site selection. Engineering Geology, 50, 347–356. Nebojsa, Stefanovic (1997). Design and implementation of on-line analytical processing (OLAP) of spatial data (pp. 32–78). M.Sc. Thesis. Canada: Computer Science, Simon Fraser University.

Ng, Raymond T., & Han, Jiawei (2002). CLARANS: A method for clustering objects for spatial data mining. IEEE Transaction on Knowledge and Data Engineering, 14(5), 1003–1016. Nikolakaki, Pantoula (2004). A GIS site-selection process for habitat creation: Estimating connectivity of habitat patches. Landscape and Urban Planning, 68, 77–94. Owen, S. H., & Daskin, M. S. (1998). Strategic facility location: A review. European Journal of Operational Research, 111, 423–447. Papadias, Dimitris, Kalnis, Panos, & Tao, Yufei (2001). Efﬁcient OLAP operations in spatial data warehouses (Vol. 1, pp. 1–4). Technical report HKUST-CS01-01. Stefanovie, N., Han, J., & Koperski, K. (2000). Object-based selective materialization for efﬁcient implementation of spatial data cubes. IEEE Transactions on Knowledge and Data Engineering, 12, 938–958. Tran, Thanh N. et al. (2003). SpaRef: A clustering algorithm for multispectral images. Analytica Chimica Acta, 490, 303–312. Tung, A. K. H., & Han, J., (2001). Constraint-based clustering in large databases. In Proceedings of the 2001 international conference on database theory (pp. 405– 419). Tung, A. K. H., Hou, J., & Han, J. (2001). Spatial clustering in the presence of obstacles. In Proceedings of the 2001 international conference on data engineering (ICDE’01) (pp. 359–367). Vlachopoulou, Maro et al. (2001). Geographic information systems in warehouse site selection decisions. International Journal of Production Economics, 71, 205–212. Wang, Lizhen, & Xie, Kunqing (2005). Efﬁcient discovery of multilevel spatial association rules using partitions. Information and Software Technology, 47(11), 829–840. Yeh, Anthony Gar-On, & Chow, Man Hong (1996). An integrated GIS and locationallocation approach to public facilities planning: An example of open space planning. Computers Environment and Urban Systems, 20, 339–350. Zaiane, Osmar R., & Lee, Chi-Hoon, (2002). Clustering spatial data in the presence of obstacles: A density-based approach. In International database engineering and applications symposium (IDEAS’02), Edmonton, Canada (pp. 214–224). Zhang, Peng (2007). A new approximation algorithm for the k-facility location problem. Theoretical Computer Science, 384, 126–135. Zhang, Lixun, & Rushton, Gerard (2008). Optimizing the size and locations of facilities in competitive multi-site service systems. Computers & Operations Research, 35, 327–338.

A hybrid spatial data clustering method for site selection: The data driven approach of GIS mining

A hybrid spatial data clustering method for site selection: The data driven approach of GIS mining

Recommend Documents