Continuously maintaining approximate quantile summaries over large uncertain datasets

Information Sciences 456 (2018) 174–190 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

Download PDF

908KB Sizes 2 Downloads 53 Views

Report

PDF Reader
Full Text

Information Sciences 456 (2018) 174–190

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Continuously maintaining approximate quantile summaries over large uncertain datasets Chunquan Liang a,b,∗, Yang Zhang a, Yanming Nie a, Shaojun Hu a a b

College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China Key Laboratory of Agricultural Internet of Things, Ministry of Agriculture, China

a r t i c l e

i n f o

Article history: Received 7 September 2016 Revised 3 February 2018 Accepted 21 April 2018 Available online 27 April 2018 Keywords: Uncertain datasets Quantile summaries Data reduction Uncertain data management

a b s t r a c t Quantile summarization is a useful tool for management of massive datasets in the rapidly growing number of applications, and its importance is further enhanced with uncertainty in the data being explored. In this paper, we focus on the problem of computing approximate quantile summaries over large uncertain datasets. On the basis of GK [14] algorithm, we propose a novel online algorithm namely uGK. Using only little space, the proposed uGK algorithm maintains a small set of tuples, each of which contains a point value and the “count” of uncertain elements that are not larger than this value, and supports any quantile query within a given error. Experimental evaluation on both synthetic and reallife datasets illustrates the effectiveness of our uGK algorithm. © 2018 Elsevier Inc. All rights reserved.

1. Introduction Over recent years massive datasets generated in the presence of uncertainty have been increasingly common in numerous applications, e.g., sensor networks, environmental monitoring, moving object management, data cleaning, and integration. The uncertainty in these applications results from unreliable data transferring, imprecise measurement, repeated sampling, privacy protection, and so forth [18,28]. The applications have created the demand for eﬃciently processing and managing massive uncertain datasets, which has gradually become the ﬁrst-class issue in modern database systems [6,8]. To this end, the ﬁrst and fundamental operation is using effective data reduction techniques to compress large amounts of uncertain data down to summaries with capturing important portraits of the original one [8,50]. Similar to their deterministic counterpart, such summaries provide the foundation for answering query processing, query plan and optimization, and statistical data analysis over uncertain data [6,8,50]. Among various data reduction techniques [1], quantile summarization is the one that is able to eﬃciently characterize the distribution of a real world dataset. Informally, a quantile is the element at a speciﬁed position of a sorted data sequence. Recently, most studies have focused on estimating approximate quantiles, in which error bound is used to improve space and time eﬃciency. In addition to the ﬁelds above, approximate quantile is also very useful to data mining, database parallelization, and management of data streams [14]. Hence, it is interesting and important to compute approximate quantile summaries on massive uncertain datasets. An estimate of such summaries over massive datasets with uncertainty can provide the distribution function induced by such datasets with reasonable precision approximation, supporting aforementioned uses. Some more concrete examples are as follows. ∗

Corresponding author at: College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China. E-mail address: [email protected] (C. Liang).

https://doi.org/10.1016/j.ins.2018.04.070 0020-0255/© 2018 Elsevier Inc. All rights reserved.

C. Liang et al. / Information Sciences 456 (2018) 174–190 •

•

•

175

In sensor networks, sensors are commonly deployed in an ad-hoc fashion to monitor various physical environments, such as temperature, sound, light intensity, and so forth [19,40]; for energy conservation, individual sensor nodes are usually designed to transmit the distribution of the uncertain sensor values to base stations, thereby users can pose sophisticated queries or analysis into the sensor data [15,19]. In data stream classiﬁcation applications, take credit card fraud detection for example, private information of customer such as age, address, and occupation may be masked by imprecise values in publishing for the purpose of data mining; the distribution of an uncertain data stream can be used as suﬃcient information for constructing classiﬁcation models, such as very fast decision trees [11]. In management of uncertain datasets, the approximation of the distribution of the datasets can be used to implement a query optimization or aggregate query processing, such as estimating cardinality on two uncertain attributes [6,41].

Despite the practical importance above, there have been very few works proposed to compute summary of uncertain data (we review related work in next section). Existing important works include computing histogram-based, wavelet-based, and essential aggressive-based summaries on probabilistic data [6,8,20,47]. These works only consider categorical or static data; however, uncertain numerical data, which is frequently generated in an incremental or stream fashion [11,32], is ubiquitous in many application domains. Take the applications above for example, the attributes of interest, such as temperature, sound, age, and income, are generated continuously with uncertainty represented by continuous probability distributions, e.g., uniform or Gaussian distributions [10,28]. In our previous work [32], to classify uncertain data streams, we gave a somewhat simplistic solution to approximate incrementally the Gaussian distribution of these streams. The solution, however, does not bound the error between obtained distribution and the real one. Motivated by the above, we in this paper consider the problem of incrementally computing approximate quantile summaries over a large dataset with numerical uncertainty. Generally, the deﬁnition of quantile is related to an ordered sequence; however, data elements with uncertainty cannot be sorted. Thus, instead of deﬁning quantile according to the sorted data, we introduce quantile over uncertain data based on probabilistic cardinality. On the basis of GK algorithm [14], we propose an algorithm, namely uGK, for computing quantile summaries over uncertain data online. The proposed uGK algorithm uses similar data structure to store the summary and has similar space complexity to those of GK. In computing process, it makes only a single scan on uncertain data and requires very little space, but obtains a summary that can approximate the real observed distribution within a given error. Our main contributions can be summarized as follows: •

•

• •

We give the deﬁnition of quantile over uncertain datasets based on probabilistic cardinality, and state the problem of computing quantiles over such datasets. On the basis of GK algorithm, we develop uGK algorithm to compute quantile summaries over uncertain datasets incrementally. We theoretically analyze space and time complexity of our uGK algorithm. We conduct a comprehensive experimental study on both real and synthetic datasets, illustrating the effectiveness of our proposed uGK algorithms.

In the rest of this paper, we ﬁrstly discuss related works in Section 2. Formal problem deﬁnition is described in Section 3. We discuss the details of our uGK algorithm in Section 4. Later, in Section 5 we analyze the time and space complexity of our uGK algorithm. Finally, the experimental study is presented in Section 6, and we conclude our work in Section 7. 2. Related works 2.1. Uncertain data management The studies of management of uncertain data begun in the end of 1980s. Since then there have been growing interests in this domain, and numerous works have been reported, spanning a wide range of issues from data models [13] and preprocessing [5,12] to storing and indexing [46], and query processing [29]. In addition, several institutes have developed prototype systems for managing such data, e.g., Trio [51], MystiQ [3], Orion [44], BayesStore [49] and Avatar [22]. In these works, data uncertainty is generally classiﬁed into existential uncertainty and value uncertainty. The existential uncertainty describes whether the object or data tuple exists or not [45]; while in value uncertainty, a data item is modeled as a series of alternate values with a probability density function (pdf) over these values [45,48]. Among the wide range of issues of uncertain data management, the most important one is query processing with uncertainty. This ﬁeld mainly involves answering various of queries over uncertain data, such as top-k query [29,39], inverse ranking query [30,52], skyline query [18,42], nearest neighbor query [26,27] and so on. The existing techniques for managing uncertain data have been provided by a nice survey article [36]. 2.2. Summarization of uncertain datasets The area of summarization of uncertain datasets has also seen much works in database community. A vast majority of them have focused on estimating a single aggregate in the presence of uncertainty. For example, [34] introduced a general framework for cost eﬃciency computing the minimum and maximum values of uncertain datasets from distributed sensor

176

C. Liang et al. / Information Sciences 456 (2018) 174–190

Fig. 1. A certain data stream with arrival time stamps.

networks; in addition to estimating the minimum and maximum values, Tran et al. [47], and Jayram et al. [23] proposed algorithms to compute the sum, average, and count in the context uncertain data streams; Jayram et al. [24] also presented algorithms to compute statistical aggregates over uncertain data streams in a single pass. These algorithms, however, can be only applied to simple applications [14]. Larusso and Singh [25], and Haider et al. [17] provided algorithms to cluster uncertain data streams in a single scan, thereby obtaining a small set of represented data that reserved important information of the stream data. However, since these algorithms used clustering points to represent the whole dataset, the precision was very low, and they could not bound the errors. Capturing the distribution of a dataset is a fundamental operation in exact data management, e.g., query analysis or query optimization. This is also true in uncertain data management. Therefore, Cormode and Garofalakis [8], Cormode et al. [6], and Iqbal et al. [20] discussed approaches to construct optimal histograms and wavelets over uncertain datasets, which were used to approximate the probabilistic distribution of these datasets. They also showed how to use the approximated distributions to implement a query optimization. Their algorithms, however, could only cope with static datasets; for a dataset of size n, the algorithms run in O(Bn2 ) time to build histogram with B buckets; so they were not suited to large datasets or data streams. Besides, they handled only uncertain categorical data. In our previous work [32], to classify uncertain data streams, we gave methods to approximate incrementally the Gaussian distribution of these streams. However the proposed algorithm did not bound the error between obtained distribution and the real one. Cormode and Garofalakis [7] proposed an algorithm to compute the essential aggressive, thereby answering quantile queries over uncertain datasets. To the best our knowledge, this is the only work concerning quantile problem over uncertain data. but this work only addresses uncertain categorical data. 2.3. Study of quantile Quantile summary problems have been extensively studied over decades, partially due to the space limitation in management of massive datasets. Generally, quantile can be divided into exact quantile and approximate quantile. Munro and Paterson [38] shows that an algorithm for answering exact quantile requires space of (n1/p ) over a dataset of size n in p passes. Thus, a single pass algorithm needs (n) space. To reduce space requirement, approximate quantile summary is proposed in literature and many algorithms have been reported. Many works have revealed the crucial properties that the algorithms for computing approximate quantiles require to eﬃciently support applications [14,33]: (1) the algorithm should provide a tunable bound on the error of the approximation; (2) the algorithm should be implemented independently of data distributions; (3) the algorithm should process the data in a single scan; (4) the algorithm should use memory as small as possible. Several such algorithms have been proposed [9,14,33,53]. Among them, the best known one was presented by Greenwald and Khanna [14]. Given a dataset of size n, their algorithm could run in space O(log ( n)/ ). Based on Greenwald and Khanna’s (GK) algorithm, [33] presented algorithms to compute quantiles for most recent N elements. Zhang and Wang [53] proposed a fast algorithm for approximate quantiles in high speed data streams. Many other quantile algorithms were proposed, but they could not specify an apriori guarantee on error [21], or only provided probabilistic guarantees [4,37], or required the prior knowledge of dataset size n. Chaudhuri et al. [4] estimated quantiles under a different error metric, but their algorithms require multiple passes over the data. Considering the case that the input streams were randomly ordered, Guha and Mcgregor [16] presented an algorithm that required space O(log (n)) and provided quantile queries with probabilistic guarantee on errors. Ma et al. [35] proposed frugal algorithms using one unit of memory for quantile computation problems. However, their algorithms can compute only one quantile, rather than the summaries of input streams for supporting any quantile query. Wang et al. [50] provided detailed experimental comparisons among some existing important algorithms, regarding the tradeoffs between space, time, and accuracy for quantile computation. 3. Problem deﬁnition In this section, we begin with introducing the data model. Then we discuss some notions regarding quantile on exact datasets, and extend to uncertain data streams (we summarize large uncertain datasets in a stream fashion, hence, we may use large uncertain datasets or uncertain data streams alternatively hereafter). Lastly, we formally deﬁne the problem of computing quantile over uncertain data streams. 3.1. Data model Generally, an exact data steam is a sequence S =< e1 , e2 , . . . , en , . . . >, in which each element en contains a deterministic numerical value. Fig. 1 illustrates such a data stream with time stamps. Similarly, in this paper we deﬁne an uncertain data

C. Liang et al. / Information Sciences 456 (2018) 174–190

177

Fig. 2. An uncertain data stream with arrival time stamps.

stream as a sequence Su =< eu1 , eu2 , . . . , eun , . . . >. However, each element contains not a single value, but multiple ones with a probability density function (pdf) [43,48]. In this paper, for simplicity we only consider the case with multiple values forming a continuous interval (however, our work can be easily extended to discrete case); an uncertain element eun thus is a deﬁned as an interval [an , bn ] and a continuous pdf fn (x) over this interval such that bnn fn (x )dx = 1. Besides, each element eun associates with a weight wn ∈ [0, 1], which denotes the probability to observe that element. Generally, wn = 1. Fig. 2 illustrates such an uncertain data stream with time stamps. At time t4 , for example, the observed eu4 contains an interval [20.68,22.66] and a pdf f4 (x ) = 1/(22.66 − 20.68 ). 3.2. Quantile over uncertain data streams In this paper, we use n to denote time n and the number of elements observed up to this time. In this section, ﬁrst we revisit the deﬁnition of exact quantile over a certain dataset. Then we extend to uncertain data streams. Finally we formally deﬁne the problem of computing quantiles over an uncertain data stream. For clarity, we ﬁrst present a general deﬁnition of quantile in literature [14,33]. Deﬁnition 1 (Exact Quantile over Certain Datasets). Given a sorted data sequence S of size n, a φ −quantile is deﬁned as the element in position r = φ n. Given a value of φ , a quantile query returns that element. A straight-forward method to perform a φ −quantile query over a certain dataset is to sort the dataset and return the element in the given position. Example 1. In Fig. 1, the dataset contains 10 elements, and the permutation in ascending ordered is 3, 5, 11, 11, 16, 19, 21, 22, 23, 23. Accordingly, 0.5−quantile is the element in the 5th position (r = 10 ∗ 0.5), i.e., 16; similarly, 0.75−quantile is 22 in the 8th position; 0.3−quantile is 11 in the 3th position; and 0.4−quantile is also 11, but in the 4th position. Traditionally, quantile on a certain dataset is deﬁned according to the ordered permutation. In an uncertain dataset, however, the value of an uncertain element is not a single value but multiple values with a pdf. Therefore, the dataset cannot be sorted, and traditional deﬁnition of quantile is not suitable to uncertain datasets. Here we give another deﬁnition of quantile based on cardinality. Deﬁnition 2 (Cardinality-based exact quantile over certain datasets [2]). Let C (S ) = e ∈S wi denote the cardinality of all i elements in dataset S (sorted or unsorted), and C (S, v ) = e ∈S and e ≤v wi denote the cardinality of all elements in dataset i i S whose values are smaller than v, where wi is the weight of element ei . For elements in certain data streams, we have wi = 1, and hence C (S ) = n. The φ −quantile over S can be deﬁned as the smallest element e ∈ S such that C(S, e) ≥ r, where r = φC (S ). Given a φ value, a quantile query returns such element. In essence the quantile deﬁned by Deﬁnition 2 is the same as that by Deﬁnition 1; both are corresponding to an element speciﬁed by r = φC (S ) (note that C (S ) = n). Example 2. The cardinality of dataset in Fig. 1 is 10. Hence, 0.5-quantile query returns the smallest element e ∈ S such that C(S, e) ≥ 0.5∗ 10, i.e., 16; similarly, 0.75−quantile query returns 22, which is the smallest element e ∈ S such that C(S, e) ≥ 8; and 0.3−quantile and 0.4−quantile query both return 11, which is the smallest element e ∈ S such that C(S, e) ≥ 3 and C(S, e) ≥ 4. Deﬁnition 3 (Exact quantile over uncertain datasets). Given an uncertain dataset Su . Let PC (Su ) = eu ∈Su wi denote the probi v abilistic cardinality [31,32] of all elements in Su , and PC (Su , v ) = eu ∈Su wi a fi (x )dx denote that of all values in Su who are i

i

not bigger than v. The φ −quantile over Su is deﬁned as the smallest value v ∈ [ai , bi ] of ei ∈ Su such that PC(Su , v) ≥ r, where r = φ PC (Su ). Deﬁnition 3 considers a point value, instead of the whole, of an uncertain element as a quantile. Note that an uncertain element contains inﬁnitely many values. Again, quantiles in the deﬁnitions above are all speciﬁed by r = φC (S ), and they are the same in essence. Example 3. In Fig. 2, the dataset contains 10 elements. Assume that each element has a weight 1. The 0.5−quantile query v v u returns the smallest value such that eu ∈Su wi a f i (x )dx ≥ φ PC (S ), i.e. eu ∈Su a f i (x )dx ≥ 5. Solving this inequation by i

i

i

i

visiting every element in Su , we obtain the solution v ≥ 16.48. Thus, the 0.5−quantile is the smallest point value v = 16.48. Similarly, 0.75−quantile query returns the value v = 21.56.

178

C. Liang et al. / Information Sciences 456 (2018) 174–190

To maintain exact quantiles on an uncertain data stream Su in a single pass, all elements have to be stored in memory. In fact, for most applications, approximate quantiles are suﬃcient. Deﬁnition 4 (Approximate quantile over uncertain datasets). Given ∈ [0, 1] and φ ∈ [0, 1], the −approximate φ −quantile over uncertain dataset Su is the point value v ∈ [ai , bi ] of eui ∈ Su , such that (φ − )PC (Su ) ≤ PC (Su , v ) ≤ (φ + )PC (Su ). Example 4. Continue the dataset in Fig. 2. The 0.01−approximate 0.5−quantile is a point value v ∈ [ai , bi ] such that (0.5 − 0.01 ) ∗ 10 ≤ PC (Su , v ) ≤ (0.5 + 0.01 ) ∗ 10. Solve this inequation by visiting every element in Su , we obtain solution v ∈ [16.336, 18.031]. Hence, the quantile is any point value v ∈ 10 i=1 [16.336, 18.031] ∩ [ai , bi ], i.e., v ∈ [16.336, 16.480]∪[17.860, 18.031]. Similarly, 0.01−approximate 0.75−quantile query returns any point value v ∈ [21.493, 21.633]. Problem Deﬁnition. Give an uncertain data stream Su , the goal of this paper is to maintain some summary data online; at any time n, the data can support any −approximate φ −quantile query over the data stream. 4. Quantile summarization algorithm for uncertain data streams For each element from an uncertain data stream, the proposed uGK algorithm performs inserting and merging operations over a summary of a dedicated design data structure. 4.1. Data structure From Deﬁnition 4 and Example 4, we can see that there may be an inﬁnite number of values that can be estimated as the −approximate φ −quantile of an uncertain data stream. However, to reduce space complexity, the uGK algorithm stores only few of these values and related statistical data. In this work, we will use similar data structure as those described in [14] to maintain such values. Deﬁnition 5 (Data structure for approximate quantile summarization over uncertain datasets). At any time n, the uGK algorithm maintains a subset of point values from streaming data Su observed so far in an ordered sequence of tuples T (n ) =< T0 , T1 , T2 , . . . , Ts >. Each tuple Ti ∈ T(n) stores one point value and implicit bounds on possible minimum and maximum probabilistic cardinality of the value in current Su . Speciﬁcally, let PCmin (v) and PCmax (v) denote the lower bound and the upper bound of probabilistic cardinality PC(Su , v), respectively. Each tuple Ti = (vi , gi , i ) includes three parameters: a point value vi ∈ [ai , bi ] of eui ∈ Su (for simpliﬁcation, in following sections we will abuse the notation somewhat, and use Ti to denote both the tuple and its parameter vi ); gi = PCmin (vi ) − PCmin (vi−1 ); and i = PCmax (vi ) − PCmin (vi ). It is not hard to see that PCmin (vi ) = j≤i g j , PCmax (vi ) = j≤i g j + i , and PC (Su ) = j≤s g j , where s is the index of the last tuple. Proposition 1. At any time n, if summary T(n) in the form above satisﬁes the property that maxi (gi + i ) ≤ 2 PC (Su ), then a φ -quantile query over T(n) can return a value v within the given error , i.e., (φ − )PC (Su ) ≤ PC (Su , v ) and PC (Su , v ) ≤ (φ + )PC (Su ). Proof. Let e = maxi (gi + i )/2. If we can ﬁnd a tuple Ti ∈ T(n) that satisﬁes φ PC (S ) − e ≤ PCmin (vi ) and φ PC (S ) + e ≥ PCmax (vi ), then have:

(φ − )PC (Su ) ≤ φ PC (S ) − e ≤ PCmin (vi ) ≤ PC (Su , vi ), and

PC (Su , vi ) ≤ PCmax (vi ) ≤ φ PC (Su ) + e ≤ (φ + )PC (Su ). This indicates that vi is the φ −quantile over Su with error PC(Su ). Here we argue that we can always ﬁnd such Ti in T(n). If φ PC (Su ) > PC (Su ) − e, notice that φ ∈ [0, 1], then we have φ PC (Su ) − e < PC (Su ) = PCmin (vs ) and φ PC (Su ) + e > PC (Su ) = PCmax (vs ). This means that Ts is the tuple we want to search for. Otherwise φ PC (Su ) ≤ PC (Su ) − e, we can ﬁnd in T(n) the tuple Tj with greatest index j such that PCmax (v j ) ≤ φ PC (Su ) + e, and have

PCmax (v j+1 ) > φ PC (Su ) + e, ⇒ φ PC (Su ) + e < PCmin (v j ) + g j + j ≤ PCmin (v j ) + 2e, ⇒ φ PC (Su ) − e < PCmin (v j ). This means that Tj is the tuple we would like to ﬁnd.

Proposition 2. Inserting a tuple (v, g, ) into right position in T(n), i.e., between adjacent tuples Ti and Ti+1 such that vi ≤ v ≤

vi+1 , does not change the parameters of each tuple in T(n).

Proof. If neither Ti nor Ti+1 exists, the proposition is trivially true. Otherwise, for j ≤ i, the inserting of (v, g, ) does not change PCmin (vj ) and PCmax (vj ), and thus Tj keeps unchanged; for j ≥ i + 1, the inserting results in the increasing of PCmin (vj ) and PCmax (vj ) both by g, and thus such Tj does so as well.

C. Liang et al. / Information Sciences 456 (2018) 174–190

179

4.2. Inserting operation It is impossible and unnecessary to maintain in summary T(n) the inﬁnitely many values contained in an uncertain element. Instead, our uGK algorithm only selects small subset of them. By correctly maintaining the implicit bounds on possible probabilistic cardinalities of the chosen values, the uGK can support any approximate quantile of a given precision. To process a coming element eun , uGK partitions its interval [an , bn ] into sub-intervals SI = {[an1 , bn1 ], [an2 , bn2 ], · · · , [anm , bnm ]} by tuples in current summary T (n − 1 ), where bn j = an( j+1) . The uGK then chooses some values from each sub-interval, creates corresponding tuples, and inserts into T (n − 1 ). According to Proposition 2, new created tuples can be trivially inserted between two adjacent tuples that contain the sub-interval. The issues lie in the selection of values and creation of tuples. The goal here is to choose as few values as possible and create related tuples that satisfy the condition in Proposition 1. So the main idea to address the issues is as follows: Suppose vl and vh are selected; in particular assume vl < vh and no value is selected from interval (vl , vh ). For vh , we need v to create a tuple Th = (vh , gh , h ), in which gh = PCmin (vh ) − PCmin (vl ) = wn v l fn (x )dx. Then, on one hand, the selection h

should maximize the difference (vh − vl ), since we hope to select fewer values; on the other hand, the selection should v guarantee wn v h fn (x )dx ≤ 2 PC (Su ), since gh + h ≤ 2 PC (Su ). Besides, because gh ≤ wn ≤ 1, h is at most max(2 PC (Su ) − l

1, 0 ). Speciﬁcally, uGK handles a sub-interval [anj , bnj ] ∈ SI that lies between two adjacent tuples Ti and Ti+1 as following steps: 1. Select bnj ;

b 2. If Ti+1 exists, select the smallest vl ∈ [anj , bnj ] such that wn vl n j fn (x )dx ≤ 2 PC (Su ) − gi+1 − i+1 and update Ti+1 by gi+1 = b gi+1 + wn vl n j fn (x )dx. 3. Repeat Steps 4 ∼ 6 until vl ≤ anj . v 4. For previous selection vh , select the smallest vl ∈ [anj , bnj ] such that wn v h fn (x )dx ≤ 2 PC (Su ); l v 5. Create a tuple Th = (vh , wn v h fn (x )dx, max(2 PC (Su ) − 1, 0 )); l

6. Insert Th after Ti if Ti exists; otherwise insert Th as the ﬁrst tuple. Overall, uGK uses a simple heuristic approach to select values from right to left of a sub-interval. The endpoint bnj should be selected (in Step 1), otherwise the largest value v in selection is smaller than bnj and thus the interval [v, bnj ] will miss. b Step 2 implicitly maintains a tuple (bn j , gn j , max(2 PC (Su ) − 1, 0 )), where gn j = wn vl n j fn (x )dx ≤ 2 PC (Su ) − gi+1 − i+1 ; in next paragraph we will see this tuple can be merged into Ti+1 , which results in gi+1 = gi+1 + gn j . So uGK does not actually maintain the tuple for bnj but updates Ti+1 . The goal here is again to generate fewer tuples. Steps 3 ∼ 6 heuristically select values and create related tuples following the ideas above, and insert into current summary. Since the initial g-value of a tuple is actually the fractional weight of a sub-interval [vh , vl ), we may say a tuple summarizes or covers the sub-interval in following discussions. Note that the sum of fractional weights of all elements, i.e., that of g-values of all tuples, is equal to the probabilistic cardinality PC(Su ). Example 5. In following examples, we set wn = 1 and fn (x ) = 1/(bn − an ) for an uncertain element eun , and = 0.1 for our uGK algorithm. (a) See Fig. 3(a). The uncertain interval of eu1 is trivially partitioned into only one sub-interval [10.7, 12.5], since T(0) is empty. At this time n = 1, we have 2 PC (Su ) = 0.2 and max(2 PC (Su ) − wn , 0 ) = 0. From that sub-interval, following Step 3 ∼ Step 6 above, uGK chooses a sequence of points < 12.5, 12.14, 11.78, 11.42, 11.06 > , and maintains summary T (1 ) =< (11.06, 0.2, 0 ), (11.42,0.2,0), (11.78,0.2,0), (12.14,0.2,0), (12.5, 0.2, 0) > . (b) See Fig. 3(b). uGK partitions the uncertain interval [16.5, 18.1] into three sub-intervals [16.5,16.8], [16.8,17.6], and [17.6,18.1] by 16.8 of Tm−2 and 17.6 of Tm−1 . At this time n = 10, we have 2 PC (Su ) = 2.0 and max(2 PC (Su ) − wn , 0 ) = 1.0. From the ﬁrst sub-interval, uGK selects the endpoint 16.8, ﬁnds the smallest point vl = 16.5, and then up 16.8 dates Tm−2 by gm−2 = 0.7 + 1 ∗ 16.5 1/(18.1 − 16.5 )dx = 0.8875; Similarly, from the second one uGK selects 17.6 and 17.6 vl = 17.28, and then updates Tm−1 by gm−1 = 0.9 + 1 ∗ 17 .28 1/ (18.1 − 16.5 )dx = 1.2; besides uGK creates and inserts a new tuple (17.28, 0.2, 1) after Tm−2 following Step 3 ∼ Step 6 above. To process the last sub-interval, uGK updates 18.1 Tm by gm = 1.4 + 1 ∗ 17.78 1/(18.1 − 16.5 )dx = 1.6 and creates a new tuple (17.78, 0.1125, 1). At last, uGK maintains summary T (10 ) =< · · · , (16.0, 1.2, 0.0, ), (16.8, 0.8875, 0.0), (17.28, 0.2, 1), (17.6, 1.2, 0.8), (17.78, 0.1125, 1), (18.9, 1.6, 0.4), > . (c) See Fig. 3(c). The uncertain interval [16.5, 18.1] is split into three sub-intervals [16.5,16.8],[16.8,17.6], and [17.6,18.1]. At this time n = 10, we have 2 PC (Su ) = 2.0 and max(2 PC (Su ) − wn , 0 ) = 1.0. The ﬁrst two can be handled similarly to the example (b). From the last interval [17.6,18.1], uGK selects 18.1 and 17.6, and creates a new tuple T(18.1, 0.3125, 1.0). In conclusion, uGK maintains summary T (10 ) =< · · · , (16.0, 1.2, 0.0, ), (16.8, 0.8875, 0.0), (17.28, 0.2, 1), (17.6, 1.2, 0.8), (18.1, 0.3125, 1.0) >. (d) See Fig. 3(d). The element can be handled the same as the example (b), which results in T (10 ) =< (16.8, 0.8875, 0.0 ), (17.28, 0.2, 1), (17.6, 1.2, 0.8), (17.78, 0.1125, 1), (18.9, 1.6, 0.4), > . (e) See Fig. 3(e). The uncertain interval of eu10 is trivially cut into one sub-interval [17.8, 18.7], for which uGK creates a tuple (18.7, 1.0, 1.0) and maintains T (10 ) =< (17.6, 0.9, 0.8 ), (18.7, 1.0, 1.0), · · · >

180

C. Liang et al. / Information Sciences 456 (2018) 174–190

Fig. 3. Examples of a coming element eun , existing tuples maintained in T (n − 1 ) by uGK, and their relative positions on the x-axis. (a) see eu1 , and T(0) is empty; (b) see eu10 with uncertain interval containing Tm−2 and Tm−1 ; (c) see eu10 with uncertain interval containing the last two tuples Ts−1 and Ts ; (d) see eu10 with uncertain interval containing the ﬁrst two tuples T1 and T2 ; (e) see eu10 with uncertain interval entirely lying on the right of the last tuple Ts ; (f) see eu10 with uncertain interval entirely lying on the left of the ﬁrst tuple T1 . Note that these examples have nothing to do with each other.

(f) See Fig. 3(f). The interval of eu10 is trivially cut into only one sub-interval [15.5, 16.6], for which uGK updates T1 by g1 = 0.7 + 1.0 = 1.7 and creates no tuples. 4.3. Merging operation Merging operation combines some tuples in current summary, thereby reducing the space complexity. Before discussing the details, we introduce two notions. Deﬁnition 6 (Tuple capacity). At any time n, the capacity of a tuple Ti is deﬁned below:

cap(Ti , n ) = 2 PC (Su )− i where PC (Su ) =

n

i=1

(1)

wi .

We will ﬁnd it is useful to keep the -value of a tuple unchanged in both inserting operation and merging operation. Thus, a tuple created earlier has greater capacity, and as more elements arrive in, the capacity will get larger. Deﬁnition 7 (Band of a tuple). According to their capacities, we divide tuples into bands. Let p = 2 PC (Su ), if tuple Ti satisﬁes:

2α −1 + p mod 2α −1 < cap(Ti , n ) ≤ 2α + p mod 2α then we refer α to as the band of Ti .

(2)

C. Liang et al. / Information Sciences 456 (2018) 174–190

181

Algorithm 1 Pseudo-code for merging operation in uGK algorithm. Input: T (n ) a sequence of tuples before merging. Output: T (n ) a sequence of tuples after merging. 1: For i from |T (n )| − 2 to 0 do 2: If (band (Ti ) ≤ band (Ti+1 )) && g∗i + gi+1 + i+1 ≤ 2 PC (Su (n )) then Update Ti+1 by Ti+1 = (vi+1 , g∗i + gi+1 , i+1 ). 3: 4: Delete Ti and the sequence. 5: return T (n ). Algorithm 2 Pseudo-code of uGK algorithm. Input: Su , an uncertain data stream, Output: T (n ) an ordered sequence of tuples. u 1: For each eu n ∈ S do Insert eun into T (n )following Section 4.2. 2: Merge T (n ) following Algorithm 1. 3: 4: return T (n ).

It is not hard to see that a tuple with larger capacity will have higher band. In following sections, we will denote band (Ti , n ) = α as at time n the band α of tuple Ti , and denote bandα as the set of tuples with band α . To combine tuples, the main idea is to merge tuples with smaller capacity into ones with larger capacity. Speciﬁcally, uGK considers to merge the longest sequence of tuples < Ti−m , · · · , Ti−1 > and Ti (by removing them) into Ti+1 , such that the band of each tuple in that sequence is less than that of Ti . Let g∗i be the sum of g-value of tuple Ti and tuples in the sequence. According to Deﬁnition 5, to guarantee the correctness of relations between g-value and -value, tuple Ti+1 should be updated by (vi+1 , g∗i + gi+1 , i+1 ). According to Proposition 1, at any time n, we say two adjacent tuples Ti and Ti+1 are mergeable if g∗i + gi+1 + i+1 ≤ 2 PC (Su (n )) and band (Ti , n ) ≤ band (Ti+1 , n ). Algorithm 1 shows the pseudo code for merging operation in uGK algorithm. The algorithm visits each tuple in T(n) from right to left (bigger index to smaller one, Step 2); once it ﬁnds two mergeable adjacent tuples Ti and Ti+1 (Step 2), it updates g-value of Ti+1 and removes corresponding tuples. By such way, uGK preserves tuples with larger capacity in T(n). It is worth to notice that as more elements arrive in, 2 PC(Su ) will increase; two adjacent but unmergeable tuples at time n may become mergeable at time m > n. Finally, Algorithm 2 shows the uGK algorithm at a high level. 4.4. Quantile query At any time n, to pose an −approximate φ −quantile query over the stream, we can ﬁrstly calculate r = φ PC (Su ), then search in T(n) for the tuple Ti such that r − PCmin (vi ) ≤ PC (Su ) and PCmax (vi ) − r ≤ PC (Su ), and return v-value of Ti . 5. Complexity analysis In this section, we analyze space complexity and time complexity of our proposed uGK algorithm. 5.1. Space complexity Recall that for each element, uGK may create many tuples and insert into the summary T(n). In this subsection, however, we argue that at any time n, the number of tuples in T(n) is a function of and PC(Su ), i.e., O(log2 ( PC(Su ))/ ), independent of the range of uncertain interval and the number of tuples induced by a new element. The main idea follows [14]: a tree structure is imposed over T(n), by which the number of bandα tuples preserved in T(n) is estimated; the total number is obtained by sum of numbers of all bands. For completeness of this paper, we discuss the details here. Deﬁnition 8 (Tuple tree). Tuples in T(n) can be organized as a tree Tr, which contains a special root node R and some general nodes, each of which, denoted as Vi , is corresponding to a tuple Ti . The parent of node Vi is the tuple Vj with least index j > i such that band(Tj , n) > band(Ti , n); if there is no such tuple for Vi , then R will be the parent of Vi . It is not hard to see that the set of descendants of a node Vi in the tuple tree are exactly corresponding to the longest sequence of tuples before Ti , whose bands are all smaller than that of Ti . Therefore, the merging operation on T(n) is es-

182

C. Liang et al. / Information Sciences 456 (2018) 174–190

sentially to merge a node Vi and its descendants (by deleting them) to Vi+1 , and g∗i is the sum of g-values of these nodes. Correspondingly, two nodes Vi and Vi+1 are mergeable only if the two adjacent tuples Ti and Ti+1 are mergeable. For ease of discussion, in this section we use PC(Su (n)) to denote at time n the probabilistic cardinality of an uncertain data stream Su , Tr(n) to denote at time n the tuple tree over T(n), gi (n) to denote at time n the g-value of tuple Ti , and bandα (n) to denote at time n the set of tuples in band α . Lemma 1. At any point of time, a tuple Ti from band α can merge only the ones from band β ≤ α . Proof. Suppose at some time n a tuple Ti from band α merges a tuple Tj from band β > α . Because the merge operation never merge Tj into Ti if band(tj , n) > band(ti , n), such case occurs only in a way that if at time m < n, we have band(m, ti ) ≤ band(m, tj ) and at time n we have band(n, ti ) > band(n, tj ). However, this cannot happen, because if at time m < n, band(m, ti ) ≤ band(m, tj ), then we have i ≤ j . At any time in future, since both i and j keep unchanged, we also have band(m, ti ) ≤ band(m, tj ) according to Deﬁnitions 6 and 7. Lemma 2. At any time n, for any α ∈ Z, the upper bound on sum of g-values of all tuples in band [0, . . . , α ] is 2α / . Proof. Let Tm and Tk be the tuples in bandβ (n) with largest capacity and smallest capacity, respectively, where m and k denote the time when they were created. According to Eqs. (1) and (2), in bandβ (n), Tm is created the earliest, while Tk the latest. At time n, their -value can be estimated respectively below:

m = 2 PC (Su (n )) − 2β − p mod 2β , k = 2 PC (Su (n )) − 2β −1 − p mod 2β −1 . According to Step 5 in Section 4.2, at the time when Tm and Tk were created, their initial -values can be computed respectively as follows:

m = 2 PC (Su (m )) − 1, k = 2 PC (Su (k )) − 1. Therefore, PC(Su (m)) and PC(Su (k)) are calculated respectively by:

PC (Su (m )) = (2 PC (Su (n )) − 2β − p mod 2β + 1 )/2 , PC (Su (k )) = (2 PC (Su (n )) − 2β −1 − p mod 2β −1 + 1 )/2 . Note that tuples in bandβ (n) were all created between time m and time k. If they did not merge any other tuples, then the sum of their g-values, i.e., the sum of fractional weights (see Step 5 in Section 4.2) of uncertain elements observed during that period, is bounded above by

PC (Su (m )) − PC (Su (k )) = 2β −1 − ( p mod 2β − p mod 2β −1 ) ≤ 2β /2 . Lemma 1 shows that a tuple from bandβ can merge only ones from a band no greater than β . Hence, the sum of g-values α β α β =0 2 /2 = 2 / .

of tuples from band [0, . . . , α ] is at most

Lemma 3. At any time n, the tree Tr(n) has at most 3/2 nodes having a child from bandα , i.e., bandα have at most 3/2 parents. Proof. Let mmin and mmax denote the time at which uGK created the earliest and the latest tuples in bandα (n), respectively. Similar to the proof of Lemma 2 above, we have

PC (Su (mmin )) = (2 PC (Su (n )) − 2α − (2 PC (Su (n )) mod 2α ))/2 , PC (Su (mmax )) = (2 PC (Su (n )) − 2α −1 − (2 PC (Su (n )) mod 2α −1 ))/2 . Let Vi be a parent node having at least one child from bandα (n), Vj be the rightmost of such child, and mj be the time when Vj was created. We ﬁrst argue that the sum of fractional weight (of uncertain elements) observed after time mmin and associated to the parent-child pair (Vi , Vj ) is bounded below by 2 (PC (Su (n )) − PC (Su (mmax )). Note that the set of nodes lying between Vj and Vi must contain node Vi−1 and its descendants, since they are all from a band equal to or less than α , we have g∗j (n ) + ik−1 g (n ) ≥ g∗i−1 (n ). Besides, because the failing to merge Vj into Vi is due to g∗i−1 (n ) + gi (n )+ i ≥ = j+1 k u 2 PC (S (n )), we have

g∗j (n ) +

i−1 k= j+1

gk (n ) + gi (n )+ i ≥ 2 PC (Su (n )).

C. Liang et al. / Information Sciences 456 (2018) 174–190

183

Moreover, we have gi (m j )+ i < 2 PC (Su (m j )) ≤ 2 PC (Su (mmax )) for any time mj ≤ mmax . Accordingly, after time mmin , the sum of fractional weight that needs to be mapped into pair (Vi , Vj ) is bounded below by

g∗i (n ) +

i−1

gk (n ) + (gi (n ) − gi (m j )) ≥ 2 (PC (Su (n )) − PC (Su (mmax ))).

k= j+1

Both inserting operation and merging operation in our proposed uGK algorithm make sure that the fractional intervals [vl , vh ) (actually their fractional weights, see Step 4 in Section 4.2) associated to any other parent-child pairs differ from those to (Vi , Vj ). After time mmin , the probabilistic cardinality of the stream was increased by PC (Su (n )) − PC (Su (mmin )). Thus, the number of parent-child pairs is

PC (Su (n )) − PC (Su (mmin )) 2 (PC (Su (n )) − PC (Su (mmax ))) =

(2α + (2 PC (Su (n )) mod 2α )) 2 (2α −1 + (2 PC (Su (n )) mod 2α −1 ))

≤

(2α + 2α−1 + (2 PC (Su (n )) mod 2α−1 )) 2 (2α −1 + (2 PC (Su (n )) mod 2α −1 ))

<

3(2α −1 + (2 PC (Su (n )) mod 2α −1 )) = 3/2 . 2 (2α −1 + (2 PC (Su (n )) mod 2α −1 ))

In other words, bandα have at most 3/2 sum of parents. Deﬁnition 9 (Full tuple pair). Two adjacent tuples Ti and Ti+1 are referred to as a full tuple pair, if g∗i + gi+1 + i+1 ≥ 2 PC (Su (n )) and band (Ti , n ) ≤ band (Ti+1 , n ). We call Ti and Ti+1 respectively the left participant and the right participant in this full pair. Lemma 4. At any time n, for any band α , bandα (n) has at most 4/ tuples belonging to the right participant of a full tuple pair. Proof. Let X be a subset of tuples in bandα (n), each of which is the right participant of a full tuple pair, and p = |X |. Suppose tuples in X form a single continuous sequence Ti , Ti+1 , · · · , Ti+ p−1 . Because such tuples still exist in T(n), we should have

g∗j−1 + g j + j > 2 PC (Su (n ))

i ≤ j < i + p.

Summing up the inequality above for each j, we have i+ p−1

g∗j−1 +

j=i

i+ p−1

i+ p−1

gj +

j=i

j > 2 p PC (Su (n ))

i ≤ j < i + p.

j=i

Then we can conclude i+ p−1

2

j=i−1

g∗j +

i+ p−1

j > 2 p PC (Su (n )).

(3)

j=i

According to Lemma 2, the upper bound on

i+ p−1 j=i−1

g∗j is 2α / ; and according to Eq. (1) and inequation (2)we have

2α −1 + 2 PC (Su (n )) mod 2α −1 < 2 PC (Su (n ))− j ⇒ j < 2 PC (Su (n )) − 2α −1 . Substituting this inequality into inequation (3), we have

2α +1 / + p(2 PC (Su (n )) − 2α −1 ) > 2 p PC (Su (n )). Solving this inequation, we obtain p < 4/ . The discussion above can be applied to the case in which tuples in X form many continuous sequences, by taking the summation above over all such sequences. Lemma 5. At any time n, for a given α , at most 11/2 tuples with band α still exist in T(n). Proof. Lemma 4 shows that in bandα (n) there are at most 4/ nodes belonging to the right of participants of any full pair; other nodes in bandα (n) either exist in a none full pair, or as the left participants. Following the idea in [14], it can be veriﬁed that a parent of a node from bandα (n) has at most one of such nodes. Lemma 3 shows that there are at most 3/2 parents of bandα (n) nodes. Therefore, the upper bound on the number of tuples in bandα (n) is 4/ + 3/2 = 11/2 .

184

C. Liang et al. / Information Sciences 456 (2018) 174–190

Theorem 1. At any time n, the upper bound on the number of tuples in T(n) is (1/ )log2 ( PC(Su (n))) Proof. At any time n, there are at most 1 + log2 2 PC (Su (n )) bands. Lemma 5 shows that each band has at most 11/2 tuples existing in T(n). The claim follows.

5.2. Time complexity To summarize an uncertain element eun , uGK partitions its uncertain interval by tuples in summary T (n − 1 ) into a sequence of sub-intervals. Note that after some point of time, we must have 2 PC (Su (n )) ≥ 1 ≥ wn . To handle each sub-interval v [anj , bnj ), at Step 4 (in Section 4.2) uGK can safely select the smallest vl = an j , because wn a h fn (x )dx ≤ wn ≤ 2 PC (Su (n )). nj

Therefore, Step 3 ∼ Step 6 run only once, and uGK copes with each sub-interval in constant time. In the worst case, there will be O((1/ )log ( PC(Su (n)))) sub-intervals. Thus, the time complexity of uGK to summarize an uncertain element is also O((1/ )log ( PC(Su (n)))).

6. Experimental study In this section we present results from an extensive empirical study over a series of datasets to illustrate the effectiveness of the proposed uGK algorithm. We compare uGK with two naive methods using GK algorithm [14], namely SPL-GK and AVG-GK, and our previous method Gaussian approximation (GA) [32] in terms of tuple numbers and quantile query errors. •

•

•

SPL-GK. This method samples a point from each item of the input uncertain data stream and computes quantile summaries on this deterministic sequence of points by GK algorithm. AVG-GK. This method computes the mean of each uncertain element and calculates the summaries on this deterministic input by GK algorithm. GA. This approach incrementally computes mean μ and variance σ from an uncertain data stream to approximate an Gaussian distribution, which is used to estimate the distribution of the stream data.

6.1. Experiment setup All of the algorithms are implemented in Java language, and all of the experiments are conducted on a PC with 3.4 GHz CPU, Windows 7 OS, and 8GB memory. Datasets. Due to the unavailability of public uncertain datasets, we conduct our evaluation on both synthetic and reallife datasets, by converting them into uncertain ones. We thus use the datasets described in [14,31], including Hard Dataset, Sorted Dataset, and Random Dataset. The ﬁrst two are synthetic, and the last one is from real applications. •

• •

Hard Dataset. In this dataset, each data value is generated adversely to GK algorithm. Speciﬁcally, the new data is randomly selected from the interval (vi−1 , vi ), such that Ti is the fullest tuple (i.e., with the greatest sum of g-value and v-value) in current T(n). With this adverse method, each data is hard to be summarized into T(n). Sorted Dataset. This dataset is a data sequence in ascending order. Random Dataset. We use the real-life dataset Forest CoverType [31] as the random dataset here. The dataset is about the actual forest cover type for given observation that is determined from US Forest Service Region to Resource Information System. There are 581,012 samples in this dataset. Each has 54 attributes, including 10 remotely sensed (numerical) data and 44 cartographic (categorical) data.

Simulating uncertainty. We simulate uncertainty into numerical value following the approaches described in [31,48]. Concretely, for each element with original value xt , a point value xt is randomly picked from the interval (xit − ω|Xi |, xit + ω|Xi | ), where |Xi | denotes the width of the range of Xi , and ω is a user speciﬁed parameter indicating uncertain level. To determine the uncertain interval, we let ait = xit − ω|Xi | and bit = xit + ω|Xi |. In experiments, we simulate two kinds of uncertainties. One is Gaussian uncertainty with mean (ait + bit )/2 and deviation (bit − ait )/8; another kind is uniform uncertainty with pdf fit (x ) = (bit − ait )−1 . However, we only reported the results from Gaussian uncertainty in this paper, since those from uniform uncertainty are very similar.

6.2. Experimental results on hard dataset To generate a hard dataset, we ﬁrst randomly choose 10 0 0 values from [0,10 0 0 0] and input to the GK algorithm; we then obtain the next data value with the adverse method above. Three datasets of size n = 105 , 106 , 107 are generated. We set ω = 0.1, 0.3, 0.5, wn = 1, = 0.1, 0.05, 0.01, 0.005, 0.001, and run SPL-GK and uGK on these datasets.

C. Liang et al. / Information Sciences 456 (2018) 174–190

185

Table 1 Number of tuples preserved by SPL-GK for hard dataset. n↓

ω = 0.1

→

0.1 16 14 17

105 106 107

0.05 26 27 31

ω = 0.3 0.01 220 288 578

0.005 736 1061 1420

0.001 1955 4935 9247

0.1 15 18 15

0.05 37 21 32

ω = 0.5 0.01 225 212 380

0.005 504 844 1156

0.001 1984 4957 9774

0.1 14 15 19

0.01 77 53 53

0.005 105 102 102

0.001 520 513 512

0.1 8 8 8

0.05 49 29 26

0.01 203 286 699

0.005 605 802 1900

0.001 1946 4822 8257

0.01 76 53 53

0.005 108 105 105

0.001 516 510 510

Table 2 Number of tuples preserved by uGK for hard dataset. n↓

ω = 0.1

→

0.1 8 8 8

105 106 107

0.05 14 14 14

ω = 0.3 0.01 79 54 54

0.005 104 102 103

0.001 513 506 508

0.1 8 8 8

0.05 15 15 15

ω = 0.5 0.05 15 15 15

Tables 1 and 2 show the results of SPL-GK and uGK, respectively. In the ﬁrst row of these two tables, the underlined parameter ω indicates that the datasets will be simulated with uncertainty ω; the second row presents the setting of parameter to SPL-GK and uGK; the last three rows report the number of tuples preserved by SPL-GK or uGK on Hard Datasets, whose uncertainty and size are given in the ﬁrst row and the ﬁrst column, respectively. In the third row of Table 1, for example, the number 1955 indicates that SPL-GK with setting = 0.001 summarizes a Hard Dataset (ω = 0.1) of size n = 105 into 1955 tuples, while the number 513 in the third row of Table 2 indicates that uGK with the same setting summarizes the same dataset into 513 tuples. We can observe from the two tables that, with the same settings, uGK maintains tuples substantially less than SPL-GK, especially when is small (note that a summary with smaller provides more precise query results). Besides, the number is far smaller than that indicated by theoretical analysis O((11/2 )log2 (2 PC(Su (n)))) and close to the lower bound (1/2 ). For example, with settings n = 106 , = 0.001, and ω = 0.3, uGK maintains only 513 tuples, but SPL-GK 4957, and theoretical analysis value is 60312. The reason behind this is that, on one hand, with uncertainty, the interval of an uncertain element contains many tuples, which enables uGK to merge the tuples for new element easily; on the other hand, adversely arriving elements pose diﬃculties to merging operations of SPL-GK on T(n). We can also observe that, with increasing of the size of the input datasets, the number of tuples maintained by uGK keeps almost unchanged, but that by SPL-GK grows fast, and with extending of uncertain level, uGK performs very stable. These results demonstrate that uGK can summarize a dataset using very little space and is robust against uncertainty. 6.3. Experimental results on sorted dataset In this group of experiments, we synthesize three sorted datasets of size n = 105 , 106 , 107 , respectively. We set ω = 0.1, 0.3, 0.5, wn = 1, = 0.001 and run SPL-GK and uGK algorithms on these datasets. On the summary T(n) obtained in q each running, we pose φ = 16j quantile queries, where q j = 1, 2, · · · , 15. For the returned value vj corresponding to the

φ=

qj 16

q

query, we compute its real rank r j on the input dataset, and compute the error by | 16j n − r j |/n. Tables 3 and 4 show the results of SPL-GK and uGK, respectively. In the tables, the underlined parameter ω in row 1 indicates the same as the section above; row 2 lists the size of the input datasets; row 3 presents the number of tuples maintained by SPL-GK or uGK. Other rows report the query errors divided by 10−4 . In row 11, column 6 of these two tables, for example, the value 36.372 8 and 3.378 are errors corresponding to 16 -quantile query on sorted dataset (n = 106 and ω = 0.3), incurred by SPL-GK and uGK, respectively. From Tables 3 and 4, it can be observed that the number of tuples stored by uGK is smaller than that by SPL-GK, and far smaller than that indicated by theoretical analysis (11/2 )log2 (2 PC(Su (n))). Besides, the number increases much slower than that by SPL-GK with raising of input sizes by orders of magnitude, and cuts down quickly with increasing of uncertain level. It can also be observed that all query errors incurred by uGK are less than the given error = 0.001 (note that reported errors in the tables are real errors divided by 10−4 ) and independent of uncertain level of input datasets. By contrast, most of query errors incurred by SPL-GK are larger than the given error, and when the uncertain level of datasets is high, the errors are much worse. These results demonstrate the effectiveness of uGK in computing quantile summaries over large uncertain datasets, and again reveal the robustness of uGK against uncertainty. 6.4. Experimental results on random dataset

In this group of experiments, we run algorithms SPL-GK, AVG-GK, GA, and uGK on the values of each attribute of Forest CoverType dataset. We set wn = 1, = 0.001, ω = 0.1, 0.2, 0.3, 0.4, 0.5, and in particular for GA set M = 1/ , the number of

186

C. Liang et al. / Information Sciences 456 (2018) 174–190 Table 3 Number of tuples and query error of SPL-GK on sorted datasets. qi ↓

ω = 0.1

n→

10

|T(n)| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

3392 1.371 4.767 1.13 9.365 4.81 7.237 6.397 4.456 15.835 5.455 4.387 4.593 1.238 0.016 17.42

5

ω = 0.3 10

6

10

4514 5.116 1.191 0.341 3.504 3.6 4.316 0.37 2.545 1.42 0.79 1.264 0.577 1.415 1.361 14.767

7

7385 0.963 2.262 6.231 0.658 4.993 0.306 2.415 2.827 2.206 3.403 0.131 1.989 1.206 3.367 16.023

10

5

2790 1.598 17.65 22.075 17.875 17.987 23.315 27 26.865 29.553 40.773 21.387 18.487 21.934 23.699 36.758

ω = 0.5 10

6

3714 10.772 8.787 15.793 18.034 12.89 31.142 31.848 36.372 32.828 36.102 23.237 5.237 18.677 32.993 42.177

10

7

6936 6.8 2.308 17.976 16.553 24.602 25.213 33.353 34.787 36.934 35.913 24.198 2.313 13.078 32.003 46.374

105

106

107

2519 26.385 65.875 80.775 80.08 94.777 94.834 83.324 71.838 54.786 35.578 15.378 22.052 23.378 43.189 42.702

3459 25.354 54.074 62.815 75.25 85.565 84.619 84.129 68.812 47.702 26.531 2.942 14.775 36.923 52.33 56.272

6148 21.464 47.467 67.369 78.521 85.441 87.337 85.144 65.019 40.277 17.542 4.798 17.661 39.095 52.139 61.883

Table 4 Number of tuples and query error of uGK on sorted datasets. qi ↓

ω = 0.1

n→

105

106

107

ω = 0.3 105

106

107

ω = 0.5 105

106

107

|T(n)| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2962 8.062 1.357 7.947 4.422 0.49 9.153 4.534 6.121 4.998 6.24 9.045 8.892 7.154 9.171 8.435

3165 7.808 8.042 2.377 1.682 0.524 6.015 2.153 8.621 3.265 5.739 6.551 7.71 7.164 9.218 9.756

3443 7.431 5.492 2.172 4.665 2.316 5.107 4.359 4.921 4.249 4.534 7.324 7.706 8.138 7.491 9.102

2381 6.321 1.47 3.35 8.182 4.3 9.627 0.89 7.969 6.081 8.206 8.31 5.584 7.356 7.296 8.576

2565 0.187 2.416 6.96 2.262 7.903 4.34 8.969 3.378 5.833 6.22 7.7 6.328 8.253 6.181 6.985

2917 3.63 6.371 5.153 2.887 9.689 2.707 1.566 3.466 5.033 6.562 4.5 6.658 6.353 5.715 7.654

2086 0.207 6.032 5.044 2.309 6.17 8.261 0.112 0.268 7.616 8.066 3.761 2.128 5.224 7.31 6.929

2309 6.687 8.108 2.583 5.518 9.244 0.938 3.834 6.859 6.904 4.94 6.438 4.018 6.41 4.691 7.591

2483 5.203 7.599 1.68 0.661 6.948 6.431 7.428 5.523 3.646 8.251 7.113 6.141 5.212 4.115 4.535

sub-intervals that can somewhat determine the precision of GA [31]. We sample the dataset 20 rounds without replacement and input it to the algorithms; in each round, we pose a series of quantile queries and calculate the errors as the sections above. Tables 5–9 list the results on values of the ﬁrst attribute (i.e., Elevation). In these tables, the |T(n)| row shows the number of tuples stored by each algorithm, while the following 1 ∼ 15 rows report query errors divided by 10−4 , all in the form of their range and variance over 20 round experiments. It is observed from Tables 5 to 9 that the number of tuples kept by uGK is always smaller than that by SPL-GK and AVG-GK under various ω; with extending of ω (over the tables), uGK performs stable. Besides, all of the query errors caused by uGK listed in tables are bounded above the given error = 0.001. By contrast, most of query errors incurred by SPL-GK, AVG-GK, and GA are substantially worse than those by uGK, and are even high above the given error = 0.001. Besides, with larger ω, SPL-GK and AVG-GK cause more serious errors. We believe the reason behind this is that with handling uncertainty by sampling or averaging techniques, SPL-GK or AVG-GK lose uncertain information badly. Using the same settings, we also compute the quantile summarizes over other attributes of Forest CoverType and observe similar results. This group of experiments shows again the effectiveness of uGK in computing quantile summaries over large uncertain datasets and the robustness against uncertainty.

C. Liang et al. / Information Sciences 456 (2018) 174–190

187

Table 5 Number of tuples and query error on random dataset (ω = 0.1). qi

SPL–GK

AVG–GK

GA

uGK

|Tn |

[3512–4086] 3804.1 ± 178.7 [18.500–30.694] 24.619 ± 3.270 [43.693–56.605] 51.550 ± 4.426 [60.721–76.689] 69.055 ± 4.849 [63.727–79.267] 69.867 ± 5.117 [59.588–77.387] 67.809 ± 5.921 [51.023–74.537] 61.789 ± 6.597 [34.690–62.486] 48.078 ± 7.984 [21.604–40.074] 30.783 ± 6.274 [0.594–25.766] 9.029 ± 7.999 [2.672–23.399] 10.616 ± 6.643 [16.249–43.790] 30.335 ± 7.669 [34.258–59.342] 48.759 ± 7.564 [56.446–69.027] 63.896 ± 4.179 [65.012–78.268] 72.807 ± 3.758 [64.124–78.289] 71.640 ± 4.995

[1152–1229] 1193.3 ± 23.1 [7.329–28.061] 16.641 ± 7.921 [30.102–49.674] 36.177 ± 5.755 [60.564–85.994] 72.436 ± 7.582 [75.981–102.294] 92.605 ± 7.524 [130.524–151.974] 141.495 ± 8.394 [152.869–175.111] 161.560 ± 8.847 [91.578–97.630] 94.761 ± 2.440 [14.511–17.676] 16.168 ± 1.276 [23.527–24.349] 23.927 ± 0.331 [3.777–26.565] 12.155 ± 5.604 [9.968–11.211] 10.568 ± 0.501 [3.382–10.597] 7.587 ± 2.586 [64.222–82.470] 68.795 ± 5.374 [158.226–168.105] 163.424 ± 3.982 [178.776–204.579] 193.080 ± 8.647

–

[705–746] 722.9 ± 13.3 [1.197–9.977] 5.558 ± 3.180 [2.275–9.286] 6.167 ± 2.378 [0.374–8.327] 3.933 ± 2.652 [0.860–8.582] 2.552 ± 2.317 [0.622–8.032] 4.703 ± 2.361 [0.036–9.933] 5.499 ± 3.468 [0.180–8.988] 4.312 ± 3.130 [0.252–8.186] 3.904 ± 2.823 [0.719–9.123] 4.353 ± 2.597 [0.933–8.260] 5.352 ± 2.001 [0.039–9.566] 5.423 ± 3.230 [1.993–7.880] 5.033 ± 1.662 [0.835–9.739] 4.862 ± 3.233 [0.432–9.776] 4.411 ± 3.113 [0.235–9.901] 3.779 ± 3.263

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

[102.878–113.796] 108.573 ± 3.952 [10.400–10.967] 10.667 ± 0.229 [159.475–160.188] 159.813 ± 0.288 [317.143–319.281] 318.160 ± 0.862 [447.452–451.802] 449.518 ± 1.753 [533.888–540.514] 537.024 ± 2.671 [563.355–570.706] 566.833 ± 2.963 [536.785–542.885] 539.677 ± 2.459 [468.874–473.160] 470.915 ± 1.727 [376.163–379.864] 377.926 ± 1.492 [263.080–267.497] 265.171 ± 1.781 [119.064–123.281] 121.051 ± 1.700 [65.120–88.326] 76.726 ± 7.480 [406.680–438.921] 422.093 ± 10.821 [893.356–948.875] 921.586 ± 23.858

Table 6 Number of tuples and query error on random dataset (ω = 0.2). qi

SPL–GK

AVG–GK

GA

uGK

|Tn |

[3582–4125] 3872.7 ± 165.1 [96.864–117.116] 108.497 ± 6.743 [132.241–150.098] 143.276 ± 5.987 [128.782–154.345] 144.3 ± 9.417 [107.275–128.674] 118.518 ± 6.265 [81.505–107.511] 91.53 ± 8.81 [60.714–76.92] 66.577 ± 5.558 [24.341–47.312] 37.912 ± 7.221 [2.452–22.662] 12.356 ± 6.239 [12.038–20.802] 17.072 ± 2.773 [34.127–55.991] 44.776 ± 6.472 [62.344–73.407] 67.802 ± 3.565 [86.133–99.25] 93.427 ± 4.721 [107.348–119.211] 112.839 ± 4.542 [120.644–137.959] 128.676 ± 5.759 [114.256–133.923] 122.173 ± 5.991

[1152–1229] 1193.3 ± 23.1 [82.496–121.821] 100.529 ± 13.346 [187.761–240.605] 209.890 ± 16.311 [285.486–344.738] 314.391 ± 22.666 [339.879–405.831] 379.629 ± 22.222 [404.215–463.122] 434.778 ± 21.856 [390.717–441.622] 415.114 ± 19.421 [273.264–303.498] 289.120 ± 12.183 [132.436–152.386] 142.920 ± 8.041 [25.839–33.655] 29.981 ± 3.152 [16.010–48.546] 32.363 ± 7.949 [74.425–95.244] 85.255 ± 8.386 [151.446–197.333] 175.156 ± 18.280 [294.259–345.344] 318.806 ± 19.490 [456.592–510.218] 484.614 ± 21.605 [527.252–593.230] 564.085 ± 26.380

–

[704–775] 735.8 ± 25.9 [1.853–8.436] 4.832 ± 2.291 [1.778–8.548] 5.012 ± 2.488 [1.012–6.440] 4.018 ± 2.078 [1.549–9.974] 6.896 ± 2.963 [0.195–9.618] 5.029 ± 3.837 [0.190–5.480] 2.941 ± 2.012 [0.482–8.772] 4.328 ± 3.281 [0.470–7.801] 4.437 ± 2.569 [0.591–7.460] 3.622 ± 2.455 [2.300–9.311] 5.748 ± 2.661 [1.999–8.868] 5.093 ± 2.199 [1.404–9.046] 6.051 ± 2.733 [1.009–9.941] 5.273 ± 3.492 [0.703–8.350] 3.630 ± 2.303 [1.121–9.254] 4.874 ± 2.739

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

[121.313–127.601] 124.248 ± 1.752 [3.033–3.113] 3.063 ± 0.033 [117.743–126.192] 121.800 ± 3.404 [217.714–234.537] 225.751 ± 6.778 [290.237–314.151] 301.634 ± 9.636 [332.879–361.646] 346.572 ± 11.592 [345.556–376.489] 360.271 ± 12.465 [329.760–360.160] 344.217 ± 12.251 [288.098–315.507] 301.127 ± 11.046 [224.043–246.289] 234.603 ± 8.966 [141.965–157.06] 149.111 ± 6.084 [88.953–102.328] 95.040 ± 4.921 [111.477–123.569] 117.749 ± 4.448 [318.727–331.969] 325.483 ± 4.288 [836.269–858.321] 846.106 ± 7.720

188

C. Liang et al. / Information Sciences 456 (2018) 174–190 Table 7 Number of tuples and query error on random dataset (ω = 0.3). qi

SPL–GK

AVG–GK

GA

uGK

|Tn |

[3374–4070] 3842.2 ± 189.8 [171.276–192.269] 183.679 ± 6.183 [166.813–190.297] 180.871 ± 6.238 [141.953–158.367] 150.549 ± 4.728 [103.831–127.805] 119.958 ± 7.466 [75.318–97.151] 88.291 ± 6.819 [47.021–75.114] 56.905 ± 8.008 [18.335–42.245] 28.738 ± 7.494 [0.754–13.042] 4.781 ± 3.778 [16.862–35.322] 27.448 ± 5.704 [42.995–60.661] 51.222 ± 5.811 [68.142–91.087] 78.891 ± 6.686 [93.826–118.215] 104.534 ± 6.664 [120.17–137.407] 128.902 ± 5.584 [138.375–158.95] 148.302 ± 6.347 [141.744–160.392] 150.858 ± 5.405

[1152–1229] 1193.3 ± 23.1 [230.110–331.276] 279.241 ± 36.200 [456.405–589.523] 517.743 ± 48.505 [602.864–733.606] 672.352 ± 53.583 [657.808–782.739] 730.620 ± 47.442 [685.387–790.412] 740.192 ± 40.242 [615.799–695.716] 656.291 ± 31.329 [439.621–488.210] 465.245 ± 19.588 [235.446–261.402] 249.260 ± 10.472 [55.732–56.752] 56.401 ± 0.421 [72.703–121.039] 96.959 ± 14.566 [213.689–269.101] 242.452 ± 22.317 [368.194–458.843] 415.238 ± 36.022 [573.055–674.352] 626.755 ± 41.852 [785.131–903.354] 846.740 ± 47.622 [895.523–1029.462] 966.749 ± 55.304

–

[696–753] 719.8 ± 17.5 [0.941–7.993] 3.558 ± 2.239 [1.314–9.369] 4.797 ± 2.545 [0.609–7.894] 4.877 ± 2.002 [3.619–8.153] 5.544 ± 1.508 [0.412–9.392] 4.241 ± 3.402 [0.281–7.837] 3.377 ± 2.549 [2.116–9.568] 5.026 ± 2.638 [0.748–9.164] 5.696 ± 2.684 [0.383–9.206] 3.723 ± 2.823 [1.041–7.766] 4.258 ± 2.518 [1.037–8.668] 4.625 ± 2.746 [0.487–5.666] 2.884 ± 1.765 [0.013–8.802] 3.897 ± 3.098 [0.735–9.041] 4.981 ± 2.581 [0.314–9.919] 6.074 ± 3.523

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

[69.577–85.876] 77.358 ± 6.567 [7.064–9.530] 8.349 ± 0.995 [50.788–70.742] 60.291 ± 8.042 [98.493–132.56] 114.683 ± 13.731 [131.155–175.304] 152.100 ± 17.795 [148.88–198.847] 172.551 ± 20.141 [152.638–204.155] 177.014 ± 20.767 [143.786–192.758] 166.934 ± 19.742 [123.946–166.593] 144.086 ± 17.193 [97.433–128.018] 111.270 ± 12.518 [87.403–100.436] 93.982 ± 5.078 [84.020–96.241] 90.310 ± 4.611 [105.653–122.398] 112.971 ± 6.196 [230.089–297.198] 261.666 ± 27.275 [620.215–677.929] 646.302 ± 22.951

Table 8 Number of tuples and query error on random dataset (ω = 0.4). qi

SPL–GK

AVG–GK

GA

uGK

|Tn |

[3571–4052] 3863.1 ± 140.2 [202.337–217.466] 208.852 ± 4.983 [170.829–194.496] 184.114 ± 6.464 [138.317–162.912] 151.304 ± 7.716 1.703 [109.692–136.222] 123.15 ± 7.017 [81.039–109.45] 92.654 ± 8.097 [47.355–76.259] 62.301 ± 8.157 [25.341–47.863] 33.14 ± 7.487 [1.254–18.842] 7.92 ± 4.546 [20.381–35.142] 26.76 ± 4.788 [41.421–66.373] 56.306 ± 6.866 [73.638–92.665] 82.603 ± 6.048 [102.256–117.204] 109.476 ± 4.44 [119.291–144.443] 132.673 ± 6.912 [144.559–164.803] 156.562 ± 6.431 [156.055–177.396] 165.546 ± 6.497

[1152–1229] 1193.3 ± 23.1 [456.567–668.211] 562.021 ± 80.760 [776.97–1008.517] 887.181 ± 89.446 [924.911–1130.220] 1034.435 ± 84.268 [945.056–1122.329] 1044.796 ± 68.577 [915.270–1050.962] 986.457 ± 52.757 [783.754–877.325] 832.225 ± 37.153 [551.277–603.384] 579.008 ± 21.022 [290.798–309.717] 301.187 ± 7.661 [32.582–49.235] 41.010 ± 6.707 [150.443–222.333] 186.421 ± 25.060 [357.183–451.21] 405.929 ± 37.867 [573.609–711.123] 645.007 ± 54.884 [828.817–987.33] 912.077 ± 64.716 [1083.259–1267.641] 1179.187 ± 74.266 [1232.578–1442.778] 1342.494 ± 86.438

–

[671–731] 703.5 ± 22.3 [0.341–9.899] 4.984 ± 3.250 [0.296–9.103] 5.978 ± 2.950 [0.049–8.011] 3.280 ± 2.821 [1.362–8.233] 4.557 ± 2.339 [0.198–8.015] 3.844 ± 2.633 [0.258–9.932] 3.094 ± 3.018 [0.575–7.644] 3.526 ± 2.591 [1.135–9.960] 4.502 ± 2.768 [0.183–8.821] 4.734 ± 2.943 [0.029–8.557] 2.877 ± 2.454 [0.796–8.869] 3.392 ± 3.305 [0.411–8.036] 3.293 ± 2.030 [0.361–8.933] 4.281 ± 3.035 [0.946–9.948] 5.377 ± 2.716 [1.368–9.638] 6.197 ± 2.930

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

[23.286–45.827] 33.988 ± 9.087 [12.699–14.001] 13.505 ± 0.548 [0.448–23.263] 11.069 ± 9.212 [13.023–52.091] 31.286 ± 15.767 [22.032–71.862] 45.346 ± 20.106 [27.235–82.623] 53.153 ± 22.347 [28.892–85.069] 55.177 ± 22.664 [27.459–80.142] 52.110 ± 21.253 [23.488–68.930] 44.756 ± 18.331 [32.221–61.242] 44.926 ± 11.915 [42.742–65.636] 53.240 ± 9.413 [67.274–94.333] 78.892 ± 11.468 [123.728–143.587] 135.419 ± 8.071 [226.144–243.099] 235.779 ± 6.868 [429.780–493.627] 454.599 ± 25.205

C. Liang et al. / Information Sciences 456 (2018) 174–190

189

Table 9 Number of tuples and query error on random dataset (ω = 0.5). qi

SPL–GK

AVG–GK

GA

uGK

|Tn |

[3682–3952] 3841 ± 86.3 [197.88–214.435] 207.76 ± 5.713 [166.491–193.826] 176.259 ± 7.504 [139.916–165.232] 153.818 ± 7.37 [111.728–134.069] 123.922 ± 6.98 [81.059–97.561] 90.236 ± 5.029 [47.506–76.415] 61.33 ± 9.486 [17.842–45.819] 34.39 ± 8.08 [1.426–14.703] 8.176 ± 4.821 [17.244–37.784] 25.966 ± 6.518 [39.215–64.925] 53.762 ± 7.723 [72.141–102.311] 83.214 ± 8.288 [96.458–121.063] 110.854 ± 7.703 [133.862–148.954] 142.159 ± 4.856 [152.133–168.556] 159.444 ± 5.436 [165.896–187.409] 176.729 ± 6.337

[1152–1229] 1193.3 ± 23.1 [728.780–1066.302] 899.433 ± 132.921 [1088.999–1406.145] 1242.421 ± 125.713 [1204.605–1465.4] 1344.206 ± 107.171 [1175.904–1386.611] 1292.782 ± 82.243 [1087.306–1238.2] 1166.861 ± 59.102 [900.103–995.908] 950.393 ± 38.277 [620.279–666.945] 645.451 ± 18.852 [314.288–320.085] 318.030 ± 2.458 [5.989–24.054] 13.70 0 ± 7.0 06 [231.026–327.907] 279.419 ± 36.186 [488.098–619.008] 555.911 ± 52.719 [752.679–933.81] 846.705 ± 72.488 [1048.145–1259.136] 1158.487 ± 85.722 [1338.345–1583.559] 1465.787 ± 98.763 [1523.881–1807.041] 1671.488 ± 115.648

–

[653–733] 691.3 ± 26.4 [0.077–9.377] 5.606 ± 3.091 [0.118–8.044] 3.263 ± 2.623 [0.302–6.216] 3.325 ± 1.932 [0.433–9.420] 4.813 ± 2.595 [0.426–7.848] 3.42 ± 2.139 [1.960–9.348] 5.130 ± 2.513 [1.831–9.732] 5.684 ± 2.542 [0.203–9.437] 5.758 ± 3.207 [0.377–7.882] 3.108 ± 2.246 [1.563–9.798] 5.609 ± 2.502 [3.369–9.776] 5.681 ± 2.462 [0.578–9.218] 3.973 ± 3.096 [0.663–9.913] 5.436 ± 3.345 [0.166–9.074] 4.137 ± 3.161 [0.618–8.713] 4.351 ± 2.716

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

[1.890–15.270] 8.430 ± 4.831 [9.955–13.796] 12.040 ± 1.559 [6.574–23.104] 15.789 ± 6.714 [0.317–31.759] 17.224 ± 12.917 [3.724–36.970] 20.266 ± 13.415 [6.804–39.358] 21.875 ± 13.340 [8.130–39.299] 21.961 ± 12.955 [7.951–37.035] 20.688 ± 12.145 [6.604–32.246] 17.923 ± 10.577 [0.034–20.271] 13.326 ± 7.844 [3.313–39.220] 21.889 ± 14.473 [43.893–63.186] 55.976 ± 8.263 [101.828–111.03] 106.931 ± 3.597 [181.348–225.177] 199.535 ± 17.864 [370.693–427.466] 398.218 ± 23.260

7. Conclusion We studied the problem of computing −approximate φ −quantile in the presence of numerical uncertainty. We deﬁned quantile for uncertain datasets base on probabilistic cardinality, rather than an ordered sequence. On the basis of GK algorithm, we presented a novel algorithm uGK for eﬃciently maintaining such quantile summaries over large uncertain datasets online. The proposed uGK algorithm used similar data structure and had similar space complexity to those of GK. By making only a single pass over datasets in very little space, uGK can provide any −approximate φ −quantile query. A series of experiments on synthetic datasets and real-life datasets veriﬁed the effectiveness of our uGK algorithm. Acknowledgments This research is substantially supported by the National Natural Science Foundation of China (61402375) and the National High-tech R&D Program of China (2013AA10230402). The authors would like to thank the anonymous reviewers of this paper. Their valuable and constructive suggestions have played a signiﬁcant role in improving the quality of this work. References [1] P.K. Agarwal, G. Cormode, Z. Huang, J.M. Phillips, Z. Wei, Y. Ke, Mergeable Summaries, in: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’12), 2012, pp. 23–34. [2] B.R. Agrawal, A. Swami, A one-pass space-eﬃcient algorithm for ﬁnding quantiles, in: Proceedings of International Conference Management of Data (COMAD’99), 1999. [3] L. Antova, C. Koch, D. Olteanu, MayBMS: managing incomplete information with probabilistic world-set decompositions, in: Proceedings of the IEEE International Conference on Data Engineering (ICDE’07), 2007, pp. 1479–1480. [4] S. Chaudhuri, R. Motwani, V. Narasayya, Random sampling for histogram construction: how much is enough? in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’98), 1998, pp. 436–447. [5] R. Cheng, J. Chen, X. Xie, Cleaning uncertain data with quality guarantees, in: Proceedings of VLDB Endowment (VLDB’08), 2008, pp. 722–735. [6] G. Cormode, A. Deligiannakis, M. Garofalakis, A. Mcgregor, Probabilistic histograms for probabilistic data, in: Proceedings of the VLDB Endowment (VLDB’09), 2009, pp. 526–537. [7] G. Cormode, M. Garofalakis, Sketching probabilistic data streams, in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’07), 2007, pp. 281–292. [8] G. Cormode, M. Garofalakis, Histograms and wavelets on probabilistic data, in: Proceedings of IEEE International Conference on Data Engineering (ICDE’09), 2009, pp. 293–304.

190

C. Liang et al. / Information Sciences 456 (2018) 174–190

[9] G. Cormode, F. Korn, S. Muthukrishnan, D. Srivastava, Effective computation of biased quantiles over data streams, in: Proceedings of International Conference on Data Engineering (ICDE’05), 2005, pp. 20–31. [10] A. Deshpande, C. Guestrin, S.R. Madden, J.M. Hellerstein, W. Hong, Model-driven Data Acquisition in Sensor Networks, in: Proceedings of VLDB Endowment (VLDB’04), 2004, pp. 588–599. [11] P. Domingos, G. Hulten, Mining high-speed data streams, in: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’00), 2000, pp. 71–80. [12] X.L. Dong, A. Halevy, C. Yu, Data integration with uncertainty, VLDB J. (2009) 469–500. [13] T.J. Green, V. Tannen, Models for Incomplete and Probabilistic Information, Springer Berlin Heidelberg, 2006. [14] M. Greenwald, S. Khanna, Space-eﬃcient online computation of quantile summaries, in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’01), 2001, pp. 58–66. [15] M.B. Greenwald, S. Khanna, Power-conserving computation of order-statistics over sensor networks, in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’11), 2004, pp. 275–285. [16] S. Guha, A. Mcgregor, Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams, Society for Industrial and Applied Mathematics, 2009. [17] R. Haider, F. Mandreoli, R. Martoglia, S. Sassatelli, Fast on-line summarization of RFID probabilistic data streams, Commun. Comput. Inf. Sci. (2012) 211–223. [18] G. He, L. Chen, C. Zeng, Q. Zheng, G. Zhou, Probabilistic skyline queries on uncertain time series, Neurocomputing (2016) 224–237. [19] Z. Huang, L. Wang, K. Yi, Y. Liu, Sampling based algorithms for quantile computation in sensor networks, in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’11), 2011, pp. 745–756. [20] A. Iqbal, H. Wang, Q. Gao, A histogram method for summarizing multi-dimensional probabilistic data, Procedia Comput. Sci. (2013) 971–976. [21] R. Jain, I. Chlamtac, The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations, Commun. ACM (1985) 1076–1085. [22] T. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Zhu, Avatar information extraction system, IEEE Data Eng. Bull. (2006) 40–48. [23] T.S. Jayram, S. Kale, E. Vee, Eﬃcient aggregation algorithms for probabilistic data, in: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA’07), 2007, pp. 346–355. [24] T.S. Jayram, A. Mcgregor, S. Muthukrishnan, E. Vee, Estimating statistical aggregates on probabilistic data streams, ACM Trans. Database Syst. (2008) 133–135. [25] N.D. Larusso, A. Singh, Synopses for probabilistic data over large domains, in: Proceedings of International Conference on Extending Database Technology (EDBT’11), 2011, pp. 389–400. [26] E.Y. Lee, H.J. Cho, T.S. Chung, K.Y. Ryu, Moving range K nearest neighbor queries with quality guarantee over uncertain moving objects, Inf. Sci. (2015) 324–341. [27] G. Li, L. Li, J. Li, Y. Li, Network Voronoi diagram on uncertain objects for nearest neighbor queries, Inf. Sci. (2015) 241–261. [28] J. Li, A. Deshpande, Ranking continuous probabilistic datasets, in: Proceedings of the VLDB Endowment (VLDB’10), 2010, pp. 638–649. [29] J. Li, H. Wang, Range queries on uncertain data, Theor. Comput. Sci. (2015) 32–48. [30] X. Lian, L. Chen, Probabilistic inverse ranking queries over uncertain data, in: Proceedings of International Conference on Database Systems for Advanced Applications (DASFAA’09), 2009, pp. 35–50. [31] C. Liang, Y. Zhang, P. Shi, Z. Hu, Learning very fast decision tree from uncertain data streams with positive and unlabeled samples, Inf. Sci. (2012) 50–67. [32] C. Liang, Y. Zhang, P. Shi, Z. Hu, Learning accurate very fast decision trees from uncertain data streams, Int. J. Syst. Sci. (2015) 3032–3050. [33] X. Lin, H. Lu, J. Xu, J.X. Yu, Continuously maintaining quantile summaries of the most recent N elements over a data stream, in: Proceedings of International Conference on Data Engineering (ICDE’04), 2004, pp. 362–373. [34] Z. Liu, K.C. Sia, J. Cho, Cost-eﬃcient processing of MIN/MAX queries over distributed sensors with uncertainty, in: Proceedings of ACM Symposium on Applied Computing (SAC’05), 2005, pp. 634–641. [35] Q. Ma, S. Muthukrishnan, M. Sandler, Frugal Streaming for Estimating Quantiles, Springer Berlin Heidelberg (2013) 77–96. [36] Z. Ma, L. Yan, Advances in Probabilistic Databases for Uncertain Information Management, Springer Publishing Company, Incorporated, 2013. [37] G.S. Manku, S. Rajagopalan, B.G. Lindsay, Approximate medians and other quantiles in one pass and with limited memory, in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’98), 1998, pp. 426–435. [38] J.I. Munro, M.S. Paterson, Selection and sorting with limited storage, Theor. Comput. Sci. (1980) 315–323. [39] H.T.H. Nguyen, J. Cao, Trustworthy answers for top-k queries on uncertain Big Data in decision making, Inf. Sci. (2015) 73–90. [40] J. Niedermayer, M.A. Nascimento, M. Renz, P. Kröger, K. Ammar, H.-P. Kriege, Cost-based quantile query processing in wireless sensor networks, in: Proceedings of IEEE International Conference on Mobile Data Management (ICMDM’2013), 2013, pp. 237–246. [41] L. Peng, Y. Diao, A. Liu, Optimizing probabilistic query processing on continuous uncertain datas, in: Proceedings of the VLDB Endowment (VLDB’11), 2011, pp. 1169–1180. [42] A.K. Pujari, V.R. Kagita, A. Garg, V. Padmanabhan, Eﬃcient computation for probabilistic skyline over uncertain preferences, Inf. Sci. (2015) 146–162. [43] P. Sen, A. Deshpande, L. Getoor, Representing tuple and attribute uncertainty in probabilistic databases, in: Proceedings of IEEE International Conference on Data Mining Workshops (ICDEW’07), 2007, pp. 507–512. [44] S. Singh, C. Mayﬁeld, S. Mittal, S. Prabhakar, S. Hambrusch, R. Shah, Orion 2.0: native support for uncertain data, in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’08), 2008, pp. 1120–1142. [45] S. Singh, C. Mayﬁeld, R. Shah, S. Prabhakar, S. Hambrusch, J. Neville, R. Cheng, Database support for probabilistic attributes and tuples, in: Proceedings of IEEE International Conference on Data Engineering (ICDE’08), 2008, pp. 1053–1061. [46] Y. Tao, R. Cheng, X. Xiao, W.K. Ngai, B. Kao, S. Prabhakar, Indexing multi-dimensional uncertain data with arbitrary probability density functions, in: Proceedings of VLDB Endowment (VLDB’05), 2005, pp. 922–933. [47] T.T.L. Tran, A. Mcgregor, Y. Diao, L. Peng, A. Liu, Conditioning and aggregating uncertain data streams: going beyond expectations, in: Proceedings of the VLDB Endowment (VLDB’10), 2010, pp. 1302–1313. [48] S. Tsang, B. Kao, K.Y. Yip, W.-S. Ho, S. Lee, Decision trees for uncertain data, in: Proceedings of International Conference on Data Engineering (ICDE’09), 2009, pp. 441–444. [49] D.Z. Wang, E. Michelakis, M. Garofalakis, J.M. Hellerstein, BAYESSTORE: managing large, uncertain data repositories with probabilistic graphical models, PVLDB (2010) 340–351. [50] L. Wang, G. Luo, K. Yi, G. Cormode, Quantiles over data streams: an experimental study, in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’13), 2013, pp. 737–748. [51] J. Widom, Trio: A System for Integrated Management of Data, Accuracy, and Lineage, Cidr, 2005, pp. 262–276. [52] G. Xiao, K. Li, X. Zhou, K. Li, Eﬃcient monochromatic and bichromatic probabilistic reverse top-k query processing for uncertain big data, J. Comput. Syst. Sci. (2017) 92–113. [53] Q. Zhang, W. Wang, A fast algorithm for approximate quantiles in high speed data streams, in: Proceedings of International Conference on Scientiﬁc and Statistical Database Management (SSDBM’07), 2007, pp. 29–40.

Continuously maintaining approximate quantile summaries over large uncertain datasets

Continuously maintaining approximate quantile summaries over large uncertain datasets

Recommend Documents