Efficient generation of spatiotemporal relationships from spatial data streams and static data

Efficient generation of spatiotemporal relationships from spatial data streams and static data

Information Processing and Management 57 (2020) 102205 Contents lists available at ScienceDirect Information Processing and Management journal homep...

3MB Sizes 0 Downloads 19 Views

Information Processing and Management 57 (2020) 102205

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Efficient generation of spatiotemporal relationships from spatial data streams and static data

T

Sungkwang Eom, Xiongnan Jin, Kyong-Ho Lee



Department of Computer Science, Yonsei Univesity, 50 Yonsei-Ro, Seodaemun-gu, Seoul 03722, Republic of Korea

ARTICLE INFO

ABSTRACT

Keywords: Spatiotemporal relationship Data stream Spatiotemporal indexing

Recently, a massive amount of position-annotated data is being generated in a stream fashion. Also, massive amounts of static data including spatial features are collected and made available. In the Internet of Things (IoT) environments, various applications can get benefits by utilizing spatial data streams and static data. Therefore, IoT applications typically require stream processing and reasoning capabilities that extract information from low-level data. Particularly for sophisticated stream processing and reasoning, spatiotemporal relationship (SR) generation from spatial data streams and static data must be preceded. However, existing techniques mostly focus solely on direct processing of sensing data or generation of spatial relationships from static data. In this paper, we first address the importance of SRs between spatial data streams and static data and then propose an efficient approach of deriving SRs in real-time. We design a novel R-treebased index with Representative Rectangles (RRs) called R3 index and devise an algorithm that leverages relationships and distances between RRs to generate SRs. To verify the effectiveness and efficiency of the proposed approach, we performed experiments using real-world datasets. Through the results of the experiments, we confirmed the superiority of the proposed approach.

1. Introduction In recent years, with the development of location sensing technologies and Global Positioning System (GPS)-enabled mobile devices, a massive amount of position-annotated data streams has been generated (Jung, Zhang, & Winslett, 2017). Also, massive amounts of static data including spatial features are collected and made available from diverse domains by various agencies. Such spatial data streams together with static data open up new opportunities for IoT applications, including transportation (Paule, Sun, & Moshfeghi, 2019), location-based services (Valverde-Rebaza, Roche, Poncelet, & de Andrade Lopes, 2018; Wu, Kao, Wu, & Huang, 2015) or event detection (Ruocco & Ramampiaro, 2015), to better understand their states, improve their performance, and become more efficient. Therefore, IoT applications typically require stream processing (Wang et al., 2018) and reasoning (Babu & Widom, 2001; Della Valle, Ceri, Van Harmelen, & Fensel, 2009) capabilities that can handle highly dynamic data streams and extract highlevel information from low-level data to facilitate decision support. In general, IoT data is extremely heterogeneous. To cope with the heterogeneity, some researches have applied semantic web technologies to processing data streams. For this purpose, diverse stream processing engines and query languages such as Streaming SPARQL (Bolles, Grawunder, & Jacobi, 2008), C-SPARQL (Barbieri, Braga, Ceri, & Grossniklaus, 2010), EP-SPARQL (Anicic, Fodor, Rudolph, & Stojanovic, 2011), and CQELS (Le-Phuoc, Dao-Tran, Parreira, & Hauswirth, 2011) have been proposed. These engines and languages are extended versions of SPARQL and have semantic models to process continuous queries on resource description ⁎

Corresponding author. E-mail address: [email protected] (K.-H. Lee).

https://doi.org/10.1016/j.ipm.2020.102205 Received 24 March 2019; Received in revised form 16 December 2019; Accepted 11 January 2020 0306-4573/ © 2020 Elsevier Ltd. All rights reserved.

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 1. An example of real world traffic condition.

framework (RDF) streams. Also, they are equipped with the operational semantics of the continuous query language (CQL) (Arasu, Babu, & Widom, 2006) to support snapshot captures of recent data by applying temporal window operators to data streams. For theoretical foundations of RDF stream processing, the LARS framework (Beck, Dao-Tran, Eiter, & Fink, 2015) provides a rulebased formalism and semantics to refer to time and window operators to represent views on data streams. The approaches mentioned above, however, mainly focus on temporal aspects of data streams. They do not consider to generate spatiotemporal relationship (SR) from spatial data streams and static data for sophisticated stream processing and reasoning. IoT applications can benefit from spatial and temporal aspects of data for context recognition, information dissemination, and decision making (Wang & Li, 2017). In a situation where a large amount of geo-annotated data streams containing spatiotemporal features are being generated, the spatiotemporal features often play an essential role in IoT applications (Barbosa, Pham, Silva, Vieira, & Freire, 2014). In particular, discovering relationships between data streams and static data containing the spatiotemporal features can provide a better understanding and new insights for IoT applications. Let us consider a traffic monitoring application as an example of an IoT application. A traffic monitoring application should be able to react appropriately according to vehicles, road infrastructures, and traffic events, which affect road conditions. Fig. 1 shows an example of a real-world traffic condition. In this example, sensing data of vehicles and information of traffic events are entered into the application in a stream fashion. For detecting traffic situations, rule-based traffic monitoring applications execute the rules shown in Table 1. Through executing the rules, the applications recognize traffic situations (for example, traffic jams and car accidents) and perform appropriate actions (for example, notifications). Let us assume that input data at time t is as follows: D = {CarSpeed(car1, 0), CarSmoke(car1, high), within(car1, road1), nearby(car1, police1), CarSpeed(car2, 50), within(car2, road2), ···}. Data elements such as CarSpeed(car1, 0) and CarSmoke(car1, high) are generated from observations of vehicles. Spatiotemporal relationships (SRs) such as within (in R2, R5, R6), overlaps (in R1), and nearby (in R7), however, cannot be obtained directly from vehicles. To actually obtain these SRs, additional computations using data streams and static data are required. Ultimately, to achieve sophisticated situational awareness, it is necessary to generate SRs between vehicles, road areas, and traffic events. Existing traffic monitoring applications, however, utilize only sensing data of vehicles and do not consider SRs that help to recognize complex traffic conditions (Eiter, Parreira, & Schneider, 2017; Zhang, Wo, Xie, Lin, & Liu, 2017b). Regarding traffic monitoring applications, such applications should be able to identify SRs between objects on road and road areas for appropriate Table 1 An example of rules for detecting road condition. Id.

Rule

R1 R2 R3 R4 R5 R6 R7

TrafficJam(S) ← Construction(E), overlaps(E, S) Notification(C) ← Car(C), Construction(E), within(C, E) CarFire(C) ← Car(C), CarSmoke(C, high) CarAccident(C) ← CarFire(C), CarSpeed(C, 0) TrafficJam(S) ← CarAccident(C), within(C, S) Notification(C) ← Car(C), within(C, S), TrafficJam(S) Notification(P) ← CarAccident(C), nearby(C, P), Police(P)

2

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

reaction. Namely, they are required to efficiently extract SRs from huge volumes and heterogeneous types of data streams and static data with spatiotemporal features. Therefore, the accurate and fast generation of SRs from spatial data streams and static data is a crucial challenge. In this paper, we define three types of SRs between spatial data streams and static data. Specifically, data streams and static data contain geometric objects such as point, line, or polygon. Therefore, SRs represent topological relationships between geometric objects. The generation of SRs in real-time is a challenging issue because performing topological calculations on geometric objects incurs a high computational cost. Moreover, there is little published research on SR generation in stream environments. We design a novel spatiotemporal index called R3 index that supports the efficient generation of SRs. The R3 index is an R-treebased index (Guttman, 1984) with Representative Rectangles (RRs) designed to effectively search for objects that can have SRs with an input object. We devise an algorithm that prunes the search space and rapidly calculates SRs by using relationships and distances between the RRs. We also apply distance approximation to generate SRs between objects. Our contributions can be summarized as follows:

• We address the problem of generating SRs from spatial data streams and static data in real-time. This problem is related to • •

computing topological relationships and distances between spatial objects in a timely manner. Our work is the first attempt to utilize hierarchical tree structures such as R-tree for generating SRs. We design a novel index structure based on R-tree with RRs called R3 index to store spatial objects and support SR generation. We devise an efficient and effective algorithm that leverages relationships and distances between the RRs to filter candidate objects and generate SRs. We evaluate the performance of the proposed approach using real-world datasets and report on the efficiency and effectiveness of the proposed approach.

The remainder of this paper is organized as follows. Related work is reviewed in Section 2. Preliminary definitions are given in Section 3. A novel index structure to store data streams and static data is introduced in Section 4. An efficient algorithm for SR generation is proposed in Section 5. Experimental results are reported and discussed in Section 6. Finally, the paper is concluded, and future directions are presented in the last section. 2. Related work In this paper, we present an approach of generating SRs for rule-based stream reasoning. This section summarizes and discusses related work on spatiotemporal stream processing and reasoning, spatial relationship generation, and spatiotemporal indexing. 2.1. Spatiotemporal stream processing and reasoning A number of approaches have been proposed that support spatiotemporal stream processing and reasoning. Eiter et al. propose an ontology-mediated query answering method (Eiter et al., 2017) for DL-LiteA. This method is based on conjunctive queries over DLLiteA ontologies, which combine spatial relationships and window operators. Eiter et al. present a technique for query rewriting that transforms spatial relationships and uses a decomposition method that generates a query execution plan. Leng et al. present a hybrid logic called Metric Spatio-Temporal Logic (MSTL) for spatiotemporal stream reasoning (de Leng & Heintz, 2016). MSTL is a combination of the Metric Temporal Logic (MTL) and Region Connection Calculus (RCC-8) (Randell, Cui, & Cohn, 1992), and makes it possible to reason over spatiotemporal objects. Leng et al. define the concept of a landmark as an area that does not change between two time points. The landmark is used to infer relationships that cannot be observed in many cases between spatial entities at different time points. Eom and Lee present a spatiotemporal query language that integrates temporal and geospatial properties (Eom & Lee, 2017). They introduce spatiotemporal window operators and formal semantics of such operators. For efficient access and storage of spatial semantic streams, they propose a spatiotemporal index that helps to process spatiotemporal queries. The approaches mentioned above identify relationships and perform stream processing and reasoning based on recognized relationships. They, however, only consider relationships between spatial data streams with spatial objects of a single type (for example, point type). They do not take into account SRs between spatial data streams with spatial objects of a diverse kind (for example, polygons). In this paper, we propose an approach of generating SRs from spatial data streams and static data with spatial objects of a diverse type. 2.2. Spatial relationship generation In recent years, numerous spatial relation generation strategies have been proposed. RCC-8 (Randell et al., 1992) is presented to represent spatial relationships between regions. Similar to RCC-8, Clementini et al. introduce the Dimensionally Extended 9 Intersection Model (DE-9IM) (Clementini, Sharma, & Egenhofer, 1994), a topological model and a standard used to describe spatial relationships of two regions. Perry et al. propose SPARQL-ST (Perry, Jain, & Sheth, 2011) to support spatiotemporal queries on temporal RDF graphs containing spatial objects. stSPARQL (Koubarakis & Kyzirakos, 2010) is proposed for the same purpose as SPARQL-ST. Both query languages are proposed to query against static RDF data that include spatiotemporal information. The Open Geospatial Consortium 3

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

propose the GeoSPARQL standard (Perry & Herring, 2012), which is a query language that supports representing and querying geospatial data and discovering topological relations. Kyzirakos et al. present Strabon (Kyzirakos, Karpathiotakis, & Koubarakis, 2012), an RDF store that supports geospatial query languages such as stSPARQL and GeoSPARQL. Salas and Harth (2011) present a method for finding spatial equivalences between geospatial RDF datasets using the Hausdorff distance distribution (Nutanong, Jacox, & Samet, 2011). Vilches et al. propose a co-reference resolution approach for interlinking geospatial linked data (Vilches-Blázquez, Saquicela, & Corcho, 2012). Orchid (Ngomo, 2013) is introduced as a link discovery method for geospatial data. It uses the Hausdorff and orthodromic metrics to compute point set distance between geospatial entities. Volz, Bizer, Gaedke, and Kobilarov (2009) present Silk, a link discovery framework for finding topological relations between objects within different data sources. Silk uses a link specification language for specifying which type of RDF links should be discovered and which conditions entities must fulfill to be interlinked. Sherif et al. introduce RADON (Sherif, Dreßler, Smeros, & Ngomo, 2017) for the discovery of topological relationships between geospatial resources according to the DE-9IM standard. They propose an indexing algorithm based on a space tiling and filtering method for the discovery of topological relationships. The researches mentioned above focus on the generation of topological relationships between static geospatial resources. There is a limitation, however, in that they do not consider the discovery of topological relationships between spatial data streams and static data. Spatial data streams have meaningful information to recognize complex situations. In particular, a combination of spatial data streams and static regions enables more sophisticated situation awareness. In this paper, we perform SR generation between spatial data streams and static regions. 2.3. Spatial index Some studies have been proposed to support storage and query processing of spatiotemporal data. Xie et al. propose Spatial InMemory Big data Analytics (SIMBA) (Xie et al., 2016) to offer spatial query processing and analytics for big spatial data. Simba is based on the Spark SQL engine (Armbrust et al., 2015) and introduces indexes over resilient distributed datasets (RDDs) to support spatial operations. Hoang et al. present an indexing strategy for spatiotemporal keyword queries (Hoang-Vu, Vo, & Freire, 2016). They propose a variant of KD-tree (Bentley, 1975) to handle text, space, and time in a single structure. Guo et al. propose a location-aware pub/sub system (Guo, Zhang, Li, Tan, & Bao, 2015) to monitor moving users subscribing to an event stream. They present an index, BEQ-tree, which is based on Quad-tree (Finkel & Bentley, 1974; Islam, Liu, Rahayu, & Anwar, 2016) to handle spatial Boolean expression matching. Wang et al. introduce a framework called Top-k Spatial Keyword Publish/ Subscribe (SKYPE) (Wang, Zhang, Zhang, Lin, & Huang, 2016) to maintain top-k geo-textual messages for subscriptions over a sliding window model. They also use the Quadtree structure for a subscription index. Qi et al. propose an R-tree parallelization method (Qi, Tao, Chang, & Zhang, 2018) using a space-filling curve and analyze performance under parallel communication model. These researches propose methods to store spatial data and process spatial queries using an index structure such as KD-tree or Quadtree. They, however, only support spatial data expressed in point coordinates and do not handle spatial objects of a polygon type. Our proposed index structure stores spatial objects of diverse types that are input in stream fashion and supports an efficient generation of SRs between them. 3. Preliminary definitions In this section, we introduce the definitions used throughout this paper. Spatial data stream and three types of SRs (Fig. 2) are formally defined. In addition, frequently used notations are described in Table 2. Definition 1. Spatial data streamis defined as an infinite sequence of data elements D = {e1, e2, e3, e4, …, ei}. A data element e is defined as e = {id, attr, t, o}, where e.id is an identifier, e.attr is a set of attributes, e.t is a time point (ei.t and e i + 1.t are time points in timeline T, ei < e i + 1), and e.o is a spatial object. Data elements may arrive in order or out of order in time points. The proposed approach processes data elements sequentially in the order in which they come. That is, arrival times of the data elements are assigned as their time points. Definition 2. within(x, y): Given two spatial objects, o1 and o2, âwithin(o1, o2)â denotes that o1 lies in the interior of o2. Definition 3. overlap(x, y): Given two spatial objects, o1 and o2, âoverlap(o1, o2)â denotes that o1 and o2 have some but not all points in common. Definition 4. nearby(x, y): Given two spatial objects, o1 and o2, ânearby(o1, o2)â denotes that the distance between two objects is less than or equal to θ, where θ is a distance threshold.

Fig. 2. Three types of spatiotemporal relationships: (a) within, (b) overlap, and (c) nearby. 4

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Table 2 Notations used in this paper. Notation

Description

p qi mbr(p)(=r) E(r) ei dist(q1, q2) maxdist(q, r)

A polygon which is a set of points p = {q1, q2, …, qn} ith point in a polgon p, qi ∈ p MBR of p A set of edges in r ith edge in r, ei ∈ r Distance from q1 to q2 max qi r dist(q, qi) for q and r

maxdist(r1,r2)

mindist(q, r)

mindist(r1,r2)

mindist(q,E(r))

max qi min qi

r1, qj r 2 dist(qi,

r dist(q,

qj) for r1 and r2

qi) for q and r

min qi

r1, qj r 2 dist(qi,

min ei

r dist(q, ei)

qj) for r1 and r2

for q and edges of r

In our approach, to generate SRs between two objects, a number of distance calculations between two objects are performed. Regarding distance calculations, we employ various distance functions, such as maxdist and mindist (Table 2), according to object type. 4. Spatiotemporal index structure In this section, we describe in detail the R3 index structure. To store spatial objects and support efficient SR generation, we design the R3 index which is an R-tree-based index structure with RRs. SR generation is similar to spatial joins (Zhang, You, & Gruenwald, 2017a) and distance queries (Katayama & Satoh, 1997; Yi, Paulet, Bertino, & Varadharajan, 2014; Zheng et al., 2016). R-tree-based spatial indexes can quicken spatial joins against a large number of spatial objects and boost spatial pruning using MBRs (Emrich, Kriegel, Kröger, Renz, & Züfle, 2010; Manolopoulos, Nanopoulos, Papadopoulos, & Theodoridis, 2010). For these reasons, we employ an R-tree structure for the spatiotemporal index. We also employ the concept of the pivots used in the similarity search methods (Arora, Sinha, Kumar, & Bhattacharya, 2018; Chen, Gao, Li, Jensen, & Chen, 2015; Chen et al., 2017; Mao, Zhang, Li, Liu, & Lu, 2016; Traina Jr, Traina, Vieira, Faloutsos et al., 2007). These methods pick pivot points in metric space and pre-compute distances from the pivots to all points in a dataset. The RRs play a similar role as the pivots used in the similarity search methods. By using RRs, we reduce the number of distance computations significantly. R-tree is a multidimensional method for spatial objects. We, however, consider 2-dimensional spatial objects for ease of presentation in the rest of this paper. 4.1. Index structure Fig. 3 shows the overall structure of the R3 index. The R3 index comprises three parts: R-tree, Stream List, and Representative Rectangle Table. Stream List stores pairs in the form of < e, p > where e and p are an input data element and a pointer that indicates a node in Rtree. The node pointed by p contains e in 2-dimensional space. First, a newly arrived data element is added to the head of the Stream List. Then, an appropriate node containing the new data element in R-tree is found. Next, the node is mapped to the pointer of the new data element. The found node only keeps the location of the new data element in the Stream List. Since new data elements are stored at the head of the Stream List, data elements of the Stream List are sorted in descending order according to arrival time. Outdated data elements are periodically deleted for newly arriving data elements. Upon storage of a new data element in the Stream List, SRs between the new data element and objects stored in the index are calculated. For the efficient generation of SRs, RRs are utilized. RRs help to generate SRs as representatives of R-tree nodes quickly. The RRs and RR table are created during an R-tree initialization process. The RR table contains information about overlaps and distances between the RRs. We build an initial R-tree index considering the distribution of spatial objects. First, we generate MBRs from sample data of realworld datasets. The generated MBRs are leaf nodes of the initial R-tree. Next, the R-tree is built up from bottom to top through packing the nodes. RRs are selected from inner nodes in the R-tree and then an RR table is generated from the RRs. Each of the fundamental steps is next explained in detail. 4.2. Index initialization We generate an initial index through several steps. The primary reason for creating an initial index is to create RRs and an RR table. In stream environments, since new data elements continuously arrive and old data elements are periodically deleted, distances between all data in the index and representatives cannot be computed in advance. Therefore, representatives for data elements to be input in the future should be pre-generated considering their distribution. To create an initial index considering the distribution of 5

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 3. Overall structure of the proposed index.

data, we obtain a sample dataset from real-world datasets and generate leaf nodes of the initial index from the sample dataset. Since datasets of the real-world are not evenly distributed and are often biased toward one side, an R-tree created in a top-down fashion cannot cope with the distribution of data adequately. To alleviate this problem, we build up an R-tree sequentially from leaf nodes to the root node, considering the distribution of the real-world datasets. We utilize the Sort-Tile-Recursive (STR) method (Leutenegger, Lopez, & Edgington, 1997) to generate leaf nodes of an initial R-tree from sample datasets, and to create upper-level nodes by packing lower-level nodes. The STR is a rectangle packing algorithm that is simple to implement, and that determines partition boundaries. The boundaries generated by the STR algorithm are MBRs of the sample datasets. We divide the sample dataset into small sets with a uniform number of objects and create leaf nodes for each set. We build an initial R-tree from bottom to top. To generate the entire R-tree, we first generate an MBR set from the sample dataset. Assume that r is the number of sample dataset and m is the number of objects in an MBR. To generate an MBR set, we sort the sample objects according to x-coordinate of the center of the object. Next, S= r / m vertical slices are generated. The number of generated MBRs is determined as n=⌈r/m⌉, hence, S= n . Objects in each slice are sorted according to y-coordinate of the center of the object and each m objects in a slice are packed to an MBR (except the last MBR has q ≤ m objects). The generated MBRs become leaf nodes of an initial R-tree. Next, to generate the entire R-tree, we repeat the node packing until only one node remains. We pack m nodes and generate MBRs of each m nodes. The generated MBRs become the parent nodes of m nodes. Through the repetition of the node packing, the last remaining node becomes the root node of the R-tree. In the next step, we select RRs and generate an RR table. The purpose of using RRs is to reduce computations to find appropriate leaf nodes containing spatial objects that potentially related to input. In R-tree, calculations are performed to find the proper child nodes at all levels, and multiple paths can be found, from the root to leaf nodes. An obvious way to reduce the number of calculations is to set up nodes closest to leaf nodes as representatives. Therefore, we specify parent nodes of leaf nodes as RRs (red box, Fig. 3). Then, we calculate overlap relationships and distances between the RRs and store them in an RR table. The RRs stored in the RR table are arranged according to x-coordinate of the lower left point. Each RR has a list that stores distance to other RRs. The distances stored in the list are sorted in ascending order. When a new data element arrives, our approach generates SRs using the information of the RR table. The pre-computed overlap relationships and distances between the RRs are used to reduce the search space and the number of costly distance calculations. Algorithm 1 summarizes the proposed R-tree initialization approach. The algorithm sorts the sample dataset and packs them to nodes (lines 4–10). Then, it repeatedly packs the nodes until one remains (lines 11–15). Finally, the initial R-tree is completed by creating the RR table (lines 16 and 17).

6

7

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

T.root point to the last node in N Select RRs and generate RR Table Add RR Table to T return T

N ←N p and N p ← ∅

while N.size() > 1 do for each m ∈ N do Pack m nodes in a node and add to N p

for s ∈ VS do Sort s in ascending order of y-coordinate for each m objects ∈ s do Pack m objects in a node and add to N

Algorithm 1. Initial R-tree generation

Input : Q : a sample dataset; m : the capacity of a tree node Output: T : an initial R-tree VS ← ∅ // a set of vertical slices N←∅ // a set of tree nodes Np ← ∅ // a set of parent nodes r ←Q.size() Sort Q in ascending order of x-coordinate for each r/m objects ∈ Q do Group r/m objects and add to VS

S. Eom, et al.

Information Processing and Management 57 (2020) 102205

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 4. An example of polygon insertion in the proposed index.

4.3. Insertion Inserting spatial objects in the R3 index depends on the type of objects. In the case of the polygon type, polygons can be stored in more than one leaf node. Fig. 4 shows an example of a polygon insertion. There are four leaf nodes (a, b, c, and d) which are siblings, two polygons (e, f) and a number of points (p). In Fig. 4(a), the leaf node a includes the leaf nodes b and c. This is because the polygon e is stored in the leaf node a and the size of the MBR of a is increased. For efficiency of R-tree, it is important to minimize overlap and coverage between the nodes. The dead space is minimized by minimizing the area covered by nodes. Minimizing overlap is more important than minimizing coverage. By reducing the overlap between nodes, the number of paths to be traversed can be decreased. As shown in Fig. 4(b), if the polygon e is divided into the leaf nodes a, b, and c, the overlap and coverage between the nodes are reduced. By allowing to split polygons, efficient SR generation can be achieved. 4.4. Analysis We first analyze the index initialization time of the R3 index. Time of the index initialization consists of R-tree generation and RR table generation time. R-tree initialization relies on sorting of the sample objects. Assume that n is the number of sample objects and m is the maximum number of entries per tree node. The sample objects are sorted by x-coordinate in (log n ) time and then partitioned into n/ m subsets. Since each subset is sorted by y-coordinate in (log n/ m ) time, all subsets are sorted in ( n /m log n/ m ) time. Thus, the total R-tree generation requires (log n + n/ m log n/ m ) time. RR table generation time relies on computing distances between RRs. Assume that computing a distance between two RRs consumes (t ) time. Thus, computing distances between r RRs requires (t·r (r + 1)/2 ) time. The total initialization time of proposed index is, therefore, (log n + n/ m log n/ m + t · r (r + 1)/2 ). We next analyze the space requirement for the proposed index. If each data element requires (v) storage, n data elements require (n· v ) storage. The height of R-tree storing n data elements is h=logmn, where m is the capacity of a tree node. The total number of nodes in R-tree with height h and capacity m is 1 + m1 + m2 + ⋅⋅⋅ + mh 1 = (mh-1)/(m-1). If a tree node with an MBR and m pointers to its entries requires (r) space, the total R-tree structure requires (r(mh-1)/(m-1)) space. RR table, which is part of the proposed index, stores distances between k RRs and requires (k 2 ) space. Hence, the total space of the proposed index is (n·v + r(mh−1)/ (m−1) + k2). 5. Spatiotemporal relationship generation In this section, we introduce an algorithm for spatiotemporal relationship generation between spatial data streams and static data. SR generation is similar to generating spatial joins and distance computations, which are the most computationally intensive operations. We devise an efficient and effective algorithm to reduce calculations by using an RR table. The algorithm comprises two steps: pruning and refinement. Fig. 5 illustrates a workflow of the SR generation. The pruning step efficiently eliminates spatial objects that are not related to the input object by confirming the distance and overlap between MBRs. The comparison using MBRs makes it easy to identify pairs of

Fig. 5. Steps of spatiotemporal relationships generation. 8

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 6. An example of pruning using distances between RRs.

spatial objects that can potentially be related to each other. In the refinement step, SRs between an input object and the candidate objects created in the pruning step are calculated to generate a final result. 5.1. Pruning We use the RR table that is pre-generated in the index initialization process for pruning objects unrelated with an input. Fig. 6 j) j ) and an input p. Eight nodes (c shows an example of pruning using distances between RRs. There are ten tree nodes (a except a and b are RRs. Finding objects having SRs with input p means finding objects having relationships with a circle having a radius θ (distance threshold) around p. In this example, to find nodes that overlap the circle, a typical R-tree must find nodes that overlap the circle at all levels until it reaches a and b from the root. After reaching a and b, distance computations must be performed to determine which of the eight nodes, the children of a and b, are the appropriate nodes (d ≤ θ). In case using the RR table, it is possible to find f and h that overlap the circle as soon as g containing p is found in the RR table. Because the distances between the RRs are precomputed and stored in the RR table, f and h located within θ from g can be easily identified. Through the use of the RR table, nodes containing objects that can have relationships with the input are quickly found, significantly reducing the number of computations. We use distances between the input and MBRs to speed up the pruning. These distances are used by SR generation algorithm to filter child nodes or objects in tree nodes efficiently. Theorem 1. Given an input p, an MBR r containing a set of objects O, and distance threshold θ, if the minimum distance mindist(p, r) between p and r is greater than or equal θ, the minimum distance mindist(p, o) between p and o ∈ O is greater than or equal θ, i.e. mindist(p, r) ≥ θ ⇒ o ∈ O, mindist(p, o) ≥ θ. Proof. Because the mindist(p, r) is equal to the distance from p to any point on the perimeter of r, all x on the perimeter of r satisfy mindist(p, r) ≤ dist(p, x). Assume that the distance from p to a point y on the perimeter of o ∈ O is equal to the minimum distance from p to o, i.e. mindist(p, o) = dist(p, y). Because o is in r, there exists a point x′ where the line from p to y crosses r. The distance of p from x′ on the perimeter of r is greater than or equal to the minimum distance from p to r, i.e. mindist(p, r) ≤ dist(p, x′). Consequentially, because mindist(p, r) ≤ dist(p, x′) + dist(x′, y) = dist(p, y) = mindist(p, y) is satisfied, mindist(p, r) ≤ mindist(p, o) is true. Thus, if the mindist(p, r) is greater than or equal θ, the mindist(p, o) is also greater than or equal θ. □ Theorem 1 is used to determine whether to prune all objects in an MBR. If the minimum distance between an input and an MBR is greater than the distance threshold, all objects in the MBR are pruned. Theorem 2. Given an input p, an MBR r containing a set of objects O, and distance threshold θ, if the maximum distance maxdist(p, r) between p and r is less than or equal θ, the maximum distance maxdist(p, o) between p and o ∈ O is less than or equal θ, maxdist(p, r) ≤ θ ⇒ o ∈ O, maxdist(p, o) ≤ θ. Proof. Because the maxdist(p, r) is equal to the distance from p to any point on the perimeter of r, all x on the perimeter of r satisfy maxdist(p, r) ≥ dist(p, x). Assume that the distance from p to a point y on the perimeter of o is equal to the maximum distance from p to o, i.e. maxdist(p, o) = dist(p, y). Because o is in r, there exists a point x′ where a line extending the line between p and y to the outside of r crosses r. The distance of p from x′ on the perimeter of r is less than or equal to the maximum distance from p to r, i.e. maxdist(p, r) ≥ dist(p, x′). Consequentially, because maxdist(p, r) ≥ dist(p, x′) = dist(p, y) + dist(y, x′) ≥ maxdist(p, o) is satisfied, maxdist(p, r) ≥ maxdist(p, o) is true. Thus, if the maxdist(p, r) is less than or equal θ, the maxdist(p, o) is also less than or equal θ. □ 9

10

34

33

32

31

30

29

28

27

26

25

24

23

22

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

re f ineWithin(q, Cw , S T R) re f ineNearby(q, Cn , S T R) return S T R Algorithm 2. Spatiotemporal relationships generation

else if mindist(q, ri ) ≤ θ then add ri to mbrn

else for ri ∈ m.childNode do if maxdist(q, ri ) ≤ θ then add all child nodes of ri to mbrn

while mbrn is not empty do m = mbrn .pop() if m is leaf node then add all ojbects in m to Cn

else if maxdist(q, ri ) ≤ θ then add all child nodes of ri to mbrn else if mindist(q, ri ) ≤ θ then add ri to mbrn

else for ri ∈ m.childNode do if q within ri then add ri to mbrw

while mbrw is not empty do m = mbrw .pop() if m is leaf node then add all ojbects in m to Cw

if mindist(q, E(RR)) ≤ θ then for ri ∈ RR.distance do if ri .distance ≤ θ then add ri to mbrn

Input : q : an input point; RRT : a representative rectangle table; θ : a distance threshold Output: S T R : a set of spatiotemporal relationships S T R = ∅, RR = ∅, Cw = ∅, Cn = ∅, mbrw = ∅, mbrn = ∅ for ri ∈ RRT do if q within ri then add ri and ri .overlap to mbrw RR = ri , break S. Eom, et al.

Information Processing and Management 57 (2020) 102205

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 7. An example of distance calculation between moving objects.

Theorem 2 is used to determine whether to include all objects in an MBR to a candidate set. If the maximum distance between an input and an MBR is less than the distance threshold, all objects in the MBR are included to the candidate set. 5.2. Algorithm Algorithm 2 shows how SRs are generated using RR tables and distance computations when input arrives (point type object). It takes a point object, an RR table, and a distance threshold (θ) as input and returns a set of SRs. The algorithm performs the pruning step to remove objects that are not related to the input (lines 4–33). Sets of candidate objects (Cw, Cn) that can potentially be related to the input object, are generated by the pruning step. To generate candidate sets, we first search an RR that contains the input from the RR table. The RR and its overlapped rectangles are added to an MBR set (mbrw) for within relationships (lines 4–7). Since the overlapped rectangles have a common area with the selected RR, they potentially can contain the input. Then, we calculate the minimum distance from the input to the edges of the RR. If this distance is greater than θ, all other RRs are excluded from the candidate sets. Otherwise rectangles in RR.distance with a distance less than θ are added to an MBR set (mbrn) for nearby relationships (lines 9–11). Since distances and overlaps between RRs are precalculated in advance and stored in the RR table, RRs related with the input can be found easily. After generating the MBR sets, we create a set of candidate objects (Cw) that can potentially have within relationships with the input (lines 12–23). If an MBR in mbrw is a leaf node of the R-tree, all the objects in the MBR are added to Cw for within relationships. Otherwise, we check the child nodes of the MBR. If a child node includes the input, it is added to mbrw. Otherwise, the child node is added to mbrn according to the distance between the input and the child node. We repeat this process until there are no more MBRs in mbrw. We also generate a set of candidate objects (Cn), that potentially can have nearby relationships similar to the generation of Cw (lines 24–33). Finally, we generate SRs between the input and the candidate sets (Cw, Cn). We also annotate timestamps to the generated SRs to output them as a data stream. If an input is a polygon type object, SRs between the input and spatial objects are generated in a similar way to Algorithm 2, but there is an additional consideration of overlap relationship. We check overlaps related with the input on the process of searching RRs on the RR table and the pruning using the MBR and generate a set of candidates for the overlap relationship. After the refinement step, we finally obtain results including the overlap relationship. We also use a kind of distance approximation to generate SRs which moving objects participate. Fig. 7 shows an example of p5 ) that not distance calculation between moving objects. Assume that t1 is a current time point and there are five points ( p1 expired. At t1, we can obtain points that are nearby p1. In this example, three points (p2, p3, p5) are nearby p1 through simply calculations of nearby relationships from the location stored in the index. However, at t1, the actual location of p3 is different from the location stored in the index. To generate more precise relationships, we use the velocity of an object to set an expected position range t3 ) as the radius around the p3. We set of the object. In the case of p3, the expected position range of p3 is a circle with p3.velocity·(t1 the distance between p1 and p3 to the maximum distance between p1 and the circle of p3. Consequently, by applying this distance approximation, p1 has a nearby relationship with only p2. 5.3. Analysis We analyze the time of the SR generation. Assume that the cost of topological or distance computation is (v ). If only one RR related to input is found, all RRs related to the input can be found due to pre-computation. Thus, finding RRs related with an input in an RR table storing r RRs requires (r·v) time. After finding the RRs, we find MBRs related with the input in subtrees of the RRs. The height of R-tree storing n data elements is h = logmn, where m is the capacity of a tree node. Let the level of the RRs and leaf nodes be t and 1. The number of nodes are related with the input are ( n /m ) + ( n /m2 ) + ⋅⋅⋅ + ( n /mh t 1 ) = ( n /m ). If each node has m entries, the refinement step requires (v·m · n/ m ) time. Thus, the total time of the SR generation for the input is (v· n/ m ) + (v·r) = (v ·(r + n/ m )). 11

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Table 3 The datasets for experiments. Name

Type

Sources

Size (unit)

Target Source Stream

Static Static Stream

NYC polygon, NYC taxi NYC taxi NYC taxi

1000 (k) 50, 100, 150, 200, 250 (k) 50, 100, 150, 200, 250 (objects/s)

6. Experiments In this section, we experimentally study the performance of the proposed approach. We performed various experiments using realworld datasets. In the following, we describe the experimental setup and discuss the results of the experiments. 6.1. Experimental settings Dataset description: We used two real-world datasets for experiments: New York City taxi data1 and New York City Atomic Polygons data2. The NYC taxi data was obtained through a Freedom of Information Law (FOIL) request from the NYC Taxi and Limousine Commission (NYC T&L). This data covers taxi operations for two years (2015-2016) in NYC. Each row in the data represents a single taxi trip and consists of car ID, date time of pickup and drop off, GPS coordinates of pickup and drop off, and so on. We generated spatial data streams and static datasets randomly extracted from the NYC taxi data for experiments. The NYC atomic polygons data is published by the department of city planning of NYC. The polygons data comprises small polygons that represent the whole of New York. The atomic polygons in the dataset serve as a set of basic building blocks for generating the polygons of many of the district areas such as election district, school district, community district, and so on. The NYC atomic polygons data is used as static data. Datasets for our experiments were obtained from two real world datasets (NYC taxi data and NYC polygon data). We used three types of datasets: target, source and stream. The target dataset means the dataset that should be stored in the index in advance. The source datasets are used to generate the relationships of objects with those of the target dataset. The stream dataset is for generating a stream, which is input to the index in real time. The stream is also used to generate the relationships of objects with those of target dataset. Table 3 shows the details of the above datasets. We used the minimum bounding box ([(−74.3,40.4), (−73.8,41.0)]) of New York as the target space. Platform: All experiments are performed on a machine equipped with an Intel(R) Core(TM) i5-3570 CPU, 16 GB of RAM, and 265 GB SSD. This machine ran Windows 10 Enterprise 64bit. We used Java 8 and JTS3 Topology Suite to support SR generation. 6.2. Baselines for comparison To verify the efficiency of the R3 index and algorithm (R3), we compared our approach with the following baselines. R-tree (RT): A typical R-tree (Guttman, 1984) without any auxiliary structure. The R-tree is generated from the root to leaf nodes in a top-down manner. Grid index (GI): A grid index based on a space tiling used in RADON (Sherif et al., 2017) to discover topological relationships between spatial objects. We divided the MBR of New York into 12,384 equal-sized cells (0.00625 granularity) for the grid index. Quad tree (QT): A Quadtree used in Khan, Kulik, Tanin, Hua, and Hashem (2018) to index POIs. If a cell contains more than k objects, the cell is divided into four equal square sized non-overlapping cells. 6.3. Experimental results We first measured the data insertion time according to the size of datasets. Fig. 8 shows the data insertion time of the R3 index compared to all baselines, depending on the number of objects. We measured the time it took to read the data from the data files and enter all the data into the index. We set the capacity of a node of indexes of tree type to k=100. In all datasets, the R3 index showed the best performance and the indexes in the R-tree family performed better than non R-tree indexes. The R3 index outperforms the typical R-tree index because the R3 index and its auxiliary structure are generated through the index initialization method (Section 4). The R3 index creates the tree in the bottom-up fashion considering the distribution of data. The data insertion time of the proposed approach is almost the data sorting time, and there is no overhead of tree node splitting. These advantages are apparent when compared to the typical R-tree. The performance of the R3 index in all datasets is superior to the typical R-tree. 1

https://data.cityofnewyork.us/Transportation/2016-Green-Taxi-Trip-Data/hvrh-b6nb https://geo.nyu.edu/catalog/nyu_2451_34563 3 https://locationtech.github.io/jts/ 2

12

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 8. Data insertion time according to the number of objects.

Fig. 9. Relationship generation time between two static datasets.

The performance of the Quadtree index for the data insertion has been found to be the worst in all datasets. The performance of the Quadtree was drastically degraded due to the bias of data in a particular area that affects the node split and the depth of the tree. The grid indexes performed poorer than the indexes in R-tree family and performed better than the Quadtree indexe. We prepared five static datasets to measure the execution time of relationship generation between static datasets. We have generated relationships between source and target datasets. Four datasets are corresponding to the source datasets and the dataset, has one million objects, is used as the target dataset. In other words, after inserting the target data into the indexes under the same settings as in the previous experiment, we performed the generation of relationships between the source datasets and the target dataset. Also, we measured the relationship generation time by changing the distance threshold to 1, 2, and 3 km. Fig. 9 shows the relationship generation time according to the size of the datasets in the four indexes. The experimental results showed that the R3 index has outstanding performance. As the size of dataset and distance threshold increase, the performance of the R3 index becomes more prominent. When the number of objects in the source dataset was 250k, and the distance threshold was 3km, the R3 index showed 2 times better than the typical R-tree, 2.3 times better than the grid-based index, and 3.1 times than the Quadtree index. When the distance threshold was 1 km, the performance of the R3 index was slightly ahead of the typical R-tree and Quadtree. In this experiment, the use of the RRs and RR table has been found to be less effective for a narrow target region. When generating spatial relationships between objects in the small area, we have found that the number of computations of the leaf node selection using the RRs and the number of calculations of leaf node selection through top-down search are similar. The benefits of using the pre-calculated RR information are revealed as the target region for the relationship generation grows to 2 km. The expansion of the target region means an increase in the number of leaf nodes containing objects that potentially related to the input. In other words, it means that the number of paths from the root to the leaf nodes to be found increases. Therefore, the typical R-tree and Quadtree performed a lot of computations to find the leaf nodes, and the performance of them dropped sharply. 13

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 10. Execution time to generate spatiotemporal relationships for an object according to input rate.

However, the R3 index effectively reduced the number of paths to find the leaf nodes using the RR information. Also, since the discovered paths do not start at the root, but start at the intermediate nodes of the tree, the search proceeds with the shorter paths. As a result, the R3 index generated relationships between objects efficiently and showed the ability to adapt well to the expansion of the target region. The grid index performed worse than the indexes in the R-tree family. The simple structure of the grid index does not effectively reduce the search space. After finding grid cells related to an input, the grid index compares all the objects in the grid cells with the input without further filtering. Owing to the nature of the real-world datasets, more computations are performed since the density of data in particular grid cells is higher, which leads to poor performance of the grid index. Quadtree also has a similar problem with the grid index. Especially, when the number of source data was 200k and 250k, and the distance threshold was 3 km, it showed the worst performance among all indexes. As the target region expanded, the Quadtree searched for the region with high data density, which has resulted in significant performance degradation. In the high-density area, the height of the Quadtree is maximum 54, which are very long search paths; thus it takes longer to reach the leaf nodes. We measured the execution time to generate SRs for input by changing the number of objects entered per second. Fig. 10 shows the experimental results when the number of input objects per second is 50, 100, 150, 200, and 250. This experiment showed similar results to the previous experiment. The R3 index showed the best performance and the grid index showed the worst performance. The performance of the Quadtree was greatly degraded as the distance threshold increased. Through these experiments, we confirmed that the R3 index is better adapted to an increase in the input rate than others. We measured the SR generation time by changing the capacity of a tree node for the R3 index. We performed experiments by changing the k value (50, 75, 100, and 125). Fig. 11 shows the execution time of SR generation according to the number of child nodes k. Experimental results show that the best performance was obtained when k was 50 for all datasets. In our experimental setup, the SR generation time decreases as the number of k decreases. A small k-value creates a taller tree that has relatively small size nodes than that for a larger k-value. The taller tree with small size nodes can perform MBR comparisons to smaller areas than a lower tree with larger nodes. This means that more objects can be pruned through MBR comparisons and distance calculations. Also, since the leaf node that is finally found has fewer objects, the

Fig. 11. Spatiotemporal relationship generation time according to the number of child nodes. 14

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 12. Memory usage according to the number of objects.

Fig. 13. Spatiotemporal relationship generation time according to the number of threads.

number of computations for SR generation is reduced. Therefore, when the value of k is 50, the size of the candidate set is smaller than in other cases, and the final results can be generated more quickly. We measured the memory usage of the indexes after loading the datasets used in the data insertion experiment. Fig. 12 shows the memory usage of the indexes, depending on the size of the datasets. The Quadtree used the least amount of memory, and the R3 index used more memory than the baselines. Compared with the typical R-tree, the proposed approach uses a little more memory than the memory used by the R-tree, because the R3 index uses an auxiliary structure such as an RR table. We found that the amount of memory increased due to the use of the RR and RR tables is very small. Since the number of entries in the RR table is very small compared to the total number of data, there is no significant influence on the total memory usage. We measured the performance through parallel implementation of the proposed SR generation algorithm. For load balancing, we used the simple round-robin policy (Shreedhar & Varghese, 1996). We used 1, 2, 4, 6, and 8 threads for the parallel implementations. Fig. 13 shows the result of the parallel execution of the proposed algorithm. In this experiment, we found that performance increased noticeably as the number of threads increased up to six threads. However, there was almost no performance gain in 6- and 8-thread configurations. When the number of threads was more than six, the overhead of multiple thread configurations began to outweigh the benefits of load balancing. Current parallel implementations do not guarantee sufficient scalability; thus, a mechanism for scalability of parallel configurations is needed. We performed the analysis of the trade-off in terms of the processing time and the memory requirements. Fig. 14 shows a scatter plot representing the correlation between time and space. The x-axis of the plot indicates the space and the y-axis shows the processing time. The proposed R3 index shows the best performance in terms of the processing time but requires more memory than others. Typical R-tree (RT) uses less memory than ours, but performs poorly in terms of processing time. Quad tree (QT) has the best 15

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Fig. 14. Trade-off of approaches in terms of time and space.

memory efficiency but the worst processing time. When processing time is important such as in a stream environment, the R3 index has a big advantage. On the other hand, in an environment where memory efficiency is important, quad tree is advantageous. To identify the existence of significant statistical difference between the proposed method and the other three methods, we made the analysis of variance (ANOVA) based on F-distribution. We applied one-way ANOVA to the experimental results measuring the processing time and the space requirement. Twenty experiments were performed for each method. We assume that all experimental results are independent and follow a normal distribution. In the case of the processing time, the one-way ANOVA resulted in the test statistic F-value of 81.518027 and P-value of 4.073555e−16 ( < α = 0.05). For the space requirement, the one-way ANOVA showed F-value of 72.205624 and P-value of 2.670546e−15 ( < α = 0.05). The high F-values and low P-values prove that the performance differences between the proposed method and the others are statistically significant. 7. Conclusions In this paper, we addressed the importance of the real-time SR generation and proposed an approach of generating SRs between spatial data streams and static data. We proposed the R3 index which a novel spatiotemporal index with RRs and designed an algorithm that prunes the search space and reduces topological computations efficiently using the RRs. We also annotated the relationship generation time to output the generated SR to the data stream. We intensively evaluated the performance of the proposed method using real-world datasets and confirmed the effectiveness and efficiency of the proposed approach. Also, the proposed approach can be applied to spatial join and distance queries. We plan to apply the proposed approach to a distributed environment for big spatial data streams and static data. The R3 index structure can be easily extended for distributed processing. The RR table and the part of R-tree from the root to the RRs can be a global index, while a subtree having RR as root can be a local index assigned to a node in a distributed cluster. We study load balance and optimization techniques to improve performance. We will also try to define and identify more complex SRs and to supplement the scalability of the SR generation. Finally, based on the SRs, we plan to support further distributed stream reasoning. CRediT authorship contribution statement Sungkwang Eom: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Xiongnan Jin: Investigation, Validation. Kyong-Ho Lee: Writing - review & editing, Supervision. Acknowledgments This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP; Ministry of Science, ICT & Future Planning) (No. NRF-2019R1A2B5B01070555). 16

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Supplementary material Supplementary material associated with this article can be found, in the online version, at 10.1016/j.ipm.2020.102205 . References Anicic, D., Fodor, P., Rudolph, S., & Stojanovic, N. (2011). EP-SPARQL: a unified language for event processing and stream reasoning. Proceedings of the world wide web conference (WWW). ACM635–644. Arasu, A., Babu, S., & Widom, J. (2006). The CQL continuous query language: semantic foundations and query execution. Proceedings of the VLDB journal Vol. 15. Springer121–142. Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., ... Ghodsi, A., et al. (2015). Spark SQL: Relational data processing in spark. Proceedings of the international conference on management of data (SIGMOD). ACM1383–1394. Arora, A., Sinha, S., Kumar, P., & Bhattacharya, A. (2018). Hd-index: Pushing the scalability-accuracy boundary for approximate kNN search in high-dimensional spaces. Proceedings of the VLDB endowment vol. 11. VLDB Endowment906–919. Babu, S., & Widom, J. (2001). Continuous queries over data streams. Proceedings of the ACM SIGMOD record, Vol. 30,. ACM109–120. Barbieri, D. F., Braga, D., Ceri, S., & Grossniklaus, M. (2010). An execution environment for C-SPARQL queries. Proceedings of the international conference on extending database technology (EDBT). ACM441–452. Barbosa, L., Pham, K., Silva, C., Vieira, M. R., & Freire, J. (2014). Structured open urban data: understanding the landscape. Proceedings of the big data, Vol. 2,144–154. Beck, H., Dao-Tran, M., Eiter, T., & Fink, M. (2015). LARS: A Logic-Based Framework for Analyzing Reasoning over Streams. Proceedings of the AAAI Conference on Artificial Intelligence1431–1438. Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Proceedings of the communications of the ACM, Vol. 18. ACM509–517. Bolles, A., Grawunder, M., & Jacobi, J. (2008). Streaming SPARQL-extending SPARQL to process data streams. Proceedings of the extended semantic web conference (ESWC). Springer448–462. Chen, L., Gao, Y., Li, X., Jensen, C. S., & Chen, G. (2015). Efficient metric indexing for similarity search. Proceedings of the international conference on data engineering (ICDE). IEEE591–602. Chen, L., Gao, Y., Zheng, B., Jensen, C. S., Yang, H., & Yang, K. (2017). Pivot-based metric indexing. Proceedings of the VLDB endowment, Vol. 10,. VLDB Endowment1058–1069. Clementini, E., Sharma, J., & Egenhofer, M. J. (1994). Modelling topological spatial relations: Strategies for query processing. Proceedings of the computers & graphics, Vol. 18. Elsevier815–822. Della Valle, E., Ceri, S., Van Harmelen, F., & Fensel, D. (2009). It’s a streaming world! reasoning upon rapidly changing information. Proceedings of the IEEE intelligent systems, Vol. 24. IEEE. Eiter, T., Parreira, J. X., & Schneider, P. (2017). Spatial ontology-mediated query answering over mobility streams. Proceedings of the extended semantic web conference (ESWC). Springer219–237. Emrich, T., Kriegel, H.-P., Kröger, P., Renz, M., & Züfle, A. (2010). Boosting spatial pruning: on optimal pruning of MBRS. Proceedings of the international conference on management of data (SIGMOD). ACM39–50. Eom, S., & Lee, K.-H. (2017). Incorporating spatial queries into semantic sensor streams on the internet of things. Journal of Database Management (JDM), Vol. 28, 24–39 IGI Global Finkel, R. A., & Bentley, J. L. (1974). Quad trees a data structure for retrieval on composite keys. Proceedings of the acta informatica, Vol. 4. Springer1–9. Guo, L., Zhang, D., Li, G., Tan, K.-L., & Bao, Z. (2015). Location-aware pub/sub system: When continuous moving queries meet dynamic event streams. Proceedings of the international conference on management of data (SIGMOD). ACM843–857. Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching, vol. 14. ACM. Hoang-Vu, T.-A., Vo, H. T., & Freire, J. (2016). A unified index for spatio-temporal keyword queries. Proceedings of the conference on information and knowledge management (CIKM). ACM135–144. Islam, M. S., Liu, C., Rahayu, W., & Anwar, T. (2016). Q+ tree: An efficient quad tree based data indexing for parallelizing dynamic and reverse skylines. Proceedings of the conference on information and knowledge management (CIKM). ACM1291–1300. Jung, D., Zhang, Z., & Winslett, M. (2017). Vibration analysis for IoT enabled predictive maintenance. Proceedings of the international conference on data engineering (ICDE). IEEE1271–1282. Katayama, N., & Satoh, S. (1997). The SR-tree: an index structure for high-dimensional nearest neighbor queries. Proceedings of the ACM SIGMOD record, Vol. 26. ACM369–380. Khan, A., Kulik, L., Tanin, E., Hua, H., & Hashem, T. (2018). Efficient computation of the optimal accessible location for a group of mobile agents. Proceedings of the ACM transactions on spatial algorithms and systems (TSAS), Vol. 4. ACM10. Koubarakis, M., & Kyzirakos, K. (2010). Modeling and querying metadata in the semantic sensor web: The model stRDF and the query language stSPARQL. Proceedings of the extended semantic web conference (ESWC). Springer425–439. Kyzirakos, K., Karpathiotakis, M., & Koubarakis, M. (2012). Strabon: a semantic geospatial dbms. Proceedings of the international semantic web conference (ISWC). Springer295–311. Le-Phuoc, D., Dao-Tran, M., Parreira, J. X., & Hauswirth, M. (2011). A native and adaptive approach for unified processing of linked streams and linked data. Proceedings of the international semantic web conference (ISWC). Springer370–388. de Leng, D., & Heintz, F. (2016). Qualitative spatio-temporal stream reasoning with unobservable intertemporal spatial relations using landmarks. Proceedings of the AAAI conference on artificial intelligence (AAAI)957–963. Leutenegger, S. T., Lopez, M. A., & Edgington, J. (1997). STR: a simple and efficient algorithm for R-tree packing. Proceedings of the international conference on data engineering (ICDE). IEEE497–506. Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A. N., & Theodoridis, Y. (2010). R-trees: Theory and applications. Springer Science & Business Media. Mao, R., Zhang, P., Li, X., Liu, X., & Lu, M. (2016). Pivot selection for metric-space indexing. International Journal of Machine Learning and Cybernetics, Vol. 7, 311–323 Springer Ngomo, A.-C. N. (2013). Orchid–reduction-ratio-optimal computation of geo-spatial distances for link discovery. Proceedings of the international semantic web conference (ISWC). Springer395–410. Nutanong, S., Jacox, E. H., & Samet, H. (2011). An incremental Hausdorff distance calculation algorithm. Proceedings of the VLDB endowment, Vol. 4. VLDB Endowment506–517. Paule, J. D. G., Sun, Y., & Moshfeghi, Y. (2019). On fine-grained geolocalisation of tweets and real-time traffic incident detection. Proceedings of the information processing & management, Vol. 56. Elsevier1119–1132. Perry, M., & Herring, J. (2012). Ogc GeoSPARQL – A geographic query language for RDF data. Ogc implementation standard. Perry, M., Jain, P., & Sheth, A. P. (2011). SPARQL-ST: Extending sparql to support spatiotemporal queries. Proceedings of the geospatial semantics and the semantic web. Springer61–86. Qi, J., Tao, Y., Chang, Y., & Zhang, R. (2018). Theoretically optimal and empirically efficient r-trees with strong parallelizability. Proceedings of the VLDB endowment, Vol. 11. VLDB Endowment621–634. Randell, D. A., Cui, Z., & Cohn, A. G. (1992). A spatial logic based on regions and connection. Proceedings of the KR, Vol. 92165–176. Ruocco, M., & Ramampiaro, H. (2015). Geo-temporal distribution of tag terms for event-related image retrieval. Proceedings of the information processing & management, Vol. 51. Elsevier92–110. Salas, J., & Harth, A. (2011). Finding spatial equivalences across multiple RDF datasets. Proceedings of the terra cognita workshop on foundations, technologies and applications of the geospatial web. Citeseer114–126. Sherif, M. A., Dreßler, K., Smeros, P., & Ngomo, A.-C. N. (2017). Radon-rapid discovery of topological relations. Proceedings of the AAAI conference on artificial intelligence (AAAI) 175–181. Shreedhar, M., & Varghese, G. (1996). Efficient fair queuing using deficit round-robin. Proceedings of the IEEE/ACM transactions on networking, Vol.4. IEEE375–385. Traina, C. Jr, Traina, A. J., Vieira, M. R., Faloutsos, C., et al. (2007). The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. The VLDB Journal, Vol. 16483–505 Springer Valverde-Rebaza, J. C., Roche, M., Poncelet, P., & de Andrade Lopes, A. (2018). The role of location and social strength for friendship prediction in location-based social networks. Proceedings of the information processing & management, Vol. 54. Elsevier475–489. Vilches-Blázquez, L. M., Saquicela, V., & Corcho, O. (2012). Interlinking geospatial information in the web of data. Bridging the geographic information sciences. Springer119–139. Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009). Silk-a link discovery framework for the web of data. Proceedings of the LDOW, Vol. 5358. Citeseer.

17

Information Processing and Management 57 (2020) 102205

S. Eom, et al.

Wang, H., & Li, Z. (2017). Region representation learning via mobility flow. Proceedings of the conference on information and knowledge management (CIKM). ACM237–246. Wang, L., Cai, R., Fu, T. Z., He, J., Lu, Z., Winslett, M., & Zhang, Z. (2018). Waterwheel: Realtime indexing and temporal range query processing over massive data streams. Proceedings of the international conference on data engineering (ICDE). IEEE. Wang, X., Zhang, Y., Zhang, W., Lin, X., & Huang, Z. (2016). Skype: top-k spatial-keyword publish/subscribe over sliding window. Proceedings of the VLDB endowment, Vol. 9. VLDB Endowment588–599. Wu, C., Kao, S.-C., Wu, C.-C., & Huang, S. (2015). Location-aware service applied to mobile short message advertising: Design, development, and evaluation. Proceedings of the information processing & management, Vol. 51. Elsevier625–642. Xie, D., Li, F., Yao, B., Li, G., Zhou, L., & Guo, M. (2016). Simba: Efficient in-memory spatial analytics. Proceedings of the international conference on management of data (SIGMOD). ACM1071–1085. Yi, X., Paulet, R., Bertino, E., & Varadharajan, V. (2014). Practical k nearest neighbor queries with location privacy. Proceedings of the international conference on data engineering (ICDE). IEEE640–651. Zhang, J., You, S., & Gruenwald, L. (You, Gruenwald, 2017a). Parallel selectivity estimation for optimizing multidimensional spatial join processing on GPUS. Proceedings of the international conference on data engineering (ICDE). IEEE1591–1598. Zhang, M., Wo, T., Xie, T., Lin, X., & Liu, Y. (Wo, Xie, Lin, Liu, 2017b). Carstream: An industrial system of big data processing for internet-of-vehicles. Proceedings of the VLDB endowment, Vol. 10. VLDB Endowment1766–1777. Zheng, B., Zheng, K., Xiao, X., Su, H., Yin, H., Zhou, X., & Li, G. (2016). Keyword-aware continuous kNN query on road networks. Proceedings of the international conference on data engineering (ICDE). IEEE871–882.

18