PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data

PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data

Knowledge-Based Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/k...

1MB Sizes 0 Downloads 22 Views

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data✩ ∗

Han Liu a , , Xianchao Zhang b,c , Xiaotong Zhang a a

The Hong Kong Polytechnic University, Hong Kong, China Dalian University of Technology, China c Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, China b

article

info

Article history: Received 13 March 2019 Received in revised form 7 August 2019 Accepted 9 August 2019 Available online xxxx Keywords: Uncertain data Classification Possible world AdaBoost

a b s t r a c t Possible world has become one of the most effective tools to deal with various types of data uncertainty in uncertain data management. However, few uncertain data classification algorithms are proposed based on possible world. Most existing uncertain data classification algorithms are simply extended from traditional classification algorithms for certain data. They deal with data uncertainty based on relatively ideal probability distribution and data type assumptions, thus are difficult to be applied for various application scenarios. In this paper, we propose a novel possible world based AdaBoost algorithm for classifying uncertain data, called PwAdaBoost. In the training procedure, PwAdaBoost uses the possible world set generated from the uncertain training set sampled in each iteration to train the sub-basic classifiers, and employs the possible world set generated from the whole uncertain training set to adjust the weights of the sub-basic classifiers and detect the quality of the basic classifiers. In the prediction procedure, PwAdaBoost utilizes the possible world set generated from the predicted object to get the results of the basic classifiers via majority voting and weighted voting. Furthermore, we analyze the stability and give the parallelization strategies for its training procedure and prediction procedure respectively. The proposed PwAdaBoost can deal with various types of data uncertainty, and use any existing classification algorithms for certain data to serve for uncertain data. As far as we know, it is the first ensemble classification algorithm for uncertain data. Extensive experiment results demonstrate the superiority of our proposed algorithm in terms of effectiveness and efficiency. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Classification plays a crucial role in machine learning and data mining. Traditional classification algorithms focus on certain data. However, data uncertainty arises naturally in many real applications [1–4]. For example, when we track the location of an object with GPS devices, the reported location may have errors of a few meters [5]. For another example, due to the existence of various noisy factors, sensor measurements may be imprecise to some extent [6]. Moreover, in biomedical research domain, handling probe-level uncertainty of gene expression microarray data is also a key research aspect [7]. Uncertain data has posed a huge challenge to traditional classification algorithms. ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.104930. ∗ Corresponding author. E-mail addresses: [email protected] (H. Liu), [email protected] (X. Zhang), [email protected] (X. Zhang).

Several models have been proposed for dealing with data uncertainty [8], and among them the possible world model has shown to be effective to deal with various types of data uncertainty in uncertain data management [9–12]. However, as far as we know, few uncertain data classification algorithms are designed based on possible world. Existing uncertain data classification algorithms mainly rely on random sampling or probabilistic definitions to extend the traditional classification algorithms to serve for uncertain data, such as Bayes-based algorithms [13,14], decision tree based algorithms [15,16], nearest neighbor based algorithms [17,18], SVMbased algorithms [19,20], rule-based algorithms [21,22], FDAbased algorithms [23], neural network based algorithms [24,25] and so on. Each of these algorithms is simply extended from a single traditional classification algorithm for certain data, thus inevitably inherit the inherent shortcomings of the original algorithms [26]. Moreover, they deal with data uncertainty based on relatively ideal probability distribution and data type assumptions, thus are difficult to be applied for various application scenarios. Recently, a possible world based model has been proposed for classifying uncertain data (PWCLA) [27], which utilizes

https://doi.org/10.1016/j.knosys.2019.104930 0950-7051/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

2

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

the consistency principle to learn a consensus affinity matrix for uncertain data and then employs the spectral analysis to extend the model to classify uncertain data. However, in essence, PWCLA is a transductive semi-supervised classification algorithm, as it needs to know all the test data during the training process, i.e., it is unusable in the case where the test data is completely unseen during training. In this paper, we propose a novel possible world based AdaBoost algorithm for classifying uncertain data, which is called PwAdaBoost. In contrast to the traditional AdaBoost, in the training procedure, PwAdaBoost uses the possible world set generated from the uncertain training set sampled in each iteration to train the sub-basic classifiers, and employs the possible world set generated from the whole uncertain training set to adjust the weights of the sub-basic classifiers and detect the quality of the basic classifiers. In the prediction procedure, PwAdaBoost utilizes the possible world set generated from the predicted object to get the results of the basic classifiers via majority voting and weighted voting. Furthermore, we analyze the stability and give the parallelization strategies for its training procedure and prediction procedure respectively. The main advantages of PwAdaBoost are summarized as follows.

• By utilizing the possible world model to deal with uncertain data, PwAdaBoost can deal with various types of data uncertainty. • By leveraging the idea of ensemble learning and redesigning the traditional AdaBoost framework via introducing possible world in multiple phases and adding extra majority voting and weighted voting procedures, PwAdaBoost can improve the learning ability and make it possible to use any existing classification algorithms for certain data to serve for uncertain data. As far as we know, it is the first possible world based ensemble learning model for classifying uncertain data. • Extensive experiment results on real benchmark datasets and real world uncertain datasets show that PwAdaBoost consistently outperforms the existing methods in effectiveness, and performs competitively in efficiency. The rest of this paper is organized as follows. In Section 2, we review the related work; In Section 3, we provide some preliminary knowledge; In Section 4, we describe our proposed algorithm in details; In Section 5, we show the experiment results on different types of datasets from effectiveness and efficiency; Finally in Section 6, we conclude the paper and point out the future work. 2. Related work 2.1. Uncertain data management Uncertain data management has been extensively studied in the past two decades [28]. Uncertain database management system, as the foundation of uncertain data management, is a key research content. We briefly review the existing uncertain database management systems. Trio [29] is an uncertain database management system based on the concept of uncertain lineage database. It can process data with uncertainty and lineage. MayBMS [30] is an uncertain database management system which can fit seamlessly into modern database systems and maximize space-efficiency by using the concept of U-relations. MystiQ [31] is an uncertain database management system which consists of four main components, a data modeling language, a data definition language, a preprocessor and a query translation

engine. It can manage uncertain data from a probabilistic view. UDBMS [32] is an uncertain database management system which extends the database system with uncertainty management functionalities. It models an uncertain object with an interval and a probability distribution function, and aims to provide uncertainty management for constantly-evolving data. Other earlier uncertain database management systems can refer to [33,34]. These systems can accomplish the uncertain data management task to some extent. However, they completely rely on the extended relational model to address uncertainty, which result in that they are difficult to deal with some new and unanticipated types of data uncertainty and cannot permit the uncertainty model to be dynamically parameterized according to the current state of the database [9]. Compared with the above systems, [9] proposes a new database management system for uncertain data (MCDB). The main idea of MCDB is to generate different possible worlds for uncertain data, process the possible worlds independently, and in the end each independent result of the possible worlds will be aggregated to a final result. As the possible world model can be parameterized dynamically and deal with various types of data uncertainty, MCDB seems more flexible and promising to manage uncertain data [9]. 2.2. Uncertain data classification Several methods have been proposed for classifying uncertain data. Table 1 provides a brief summary for existing uncertain data classification methods.1 In the following, we briefly review these methods. (1) Bayes-based methods: [13] employs the average-based, sample-based and formula-based methods to extend the naive Bayes method for uncertain data, and shows that the formulabased method performs the best among the three methods. [14] employs a new class conditional probability computation method to extend the naive Bayes algorithm for classifying uncertain data. (2) Decision tree based methods: [15] introduces the definitions about probability cardinality, probability entropy, probability information gain to extend the traditional decision tree algorithm to classify uncertain data. [16] employs a very similar method to extend the traditional decision tree algorithm to deal with uncertain data, and it provides a series of pruning methods to improve the efficiency. (3) Nearest neighbor based methods: [17] introduces the concept of nearest neighbor class and designs a new nearest neighbor based classifier for uncertain data. [18] introduces the concept of probabilistic distance measure, and then proposes a novel K-nearest neighbor classifier for uncertain data based on the object-to-object probabilistic distances. (4) SVM-based methods: [19] and [20] respectively extend the traditional SVM algorithm by using the farthest and the nearest points in the distributions of uncertain objects as reference to obtain the optimal hyperplane. (5) Rule-based methods: [21] proposes a rule-based algorithm for classifying uncertain data. It uses the probabilistic information gain for generating, pruning and optimizing rules. [22] presents an algorithm which can mine discriminative patterns directly and effectively from uncertain data as the classification rules to train the classifiers. 1 In Table 1, for the column Applicability, Numerical means that the method is applicable to numerical datasets; Categorical means that the method is applicable to categorical datasets; Both means that the method is applicable to both numerical and categorical datasets; Not mentioned means that the method does not explicitly point out the application scopes. For the column Dist. assumption, No means that the method does not rely on some ideal probability distributions; Yes means that the method relies on some ideal probability distributions.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

3

Table 1 Summary of existing uncertain data classification methods. Category

Method

Applicability

Dist. assumption

Bayes-based methods

SBC [13] FBC [13] NBU [14]

Numerical Numerical Both

No Yes Yes

Decision tree based methods

DTU [15] UDT [16]

Both Both

No No

Nearest neighbor based methods

UNN [17] Uncertain-KNN [18]

Numerical Not mentioned

No No

SVM-based methods

TSVC [19] USVC [20]

Numerical Numerical

Yes Yes

Rule-based methods

uRule [21] uHARMONY [22]

Both Categorical

No No

FDA-based methods

UFLDA [23] UKFDA [23]

Not mentioned Not mentioned

No No

Neural networks based methods

UNN [24] UELM [25]

Numerical Not mentioned

Yes No

Possible world based methods

PWCLA [27] PwAdaBoost (Ours)

Both Both

No No

Other methods

Object-to-group [18] AFC-means [35]

Not mentioned Not mentioned

No No

(6) FDA-based methods: [23] first defines covariance matrix, within and between scatter matrices as the measures of scatter for uncertain data objects, then it uses the developed measures of scatter to extend the Fisher discriminant analysis (FDA) for uncertain data (UFLDA). Also, [23] develops kernel Fisher discriminant analysis for uncertain data (UKFDA). Experiments show that the obtained decision boundaries of UFLDA and UKFDA are very reasonable. (7) Neural network based methods: [24] extends the traditional neural network classifier by changing the traditional perceptron through a new activation function, thereby making it applicable to uncertain data classification. [25] uses the gridbased method to extend the special neural network ELM [36] to deal with uncertain data. (8) Possible world based methods: [27] first proposes a possible world based consistency learning model for classifying uncertain data, which utilizes the consistency principle to learn a consensus affinity matrix for uncertain data and then employs the spectral analysis to extend the model to classify uncertain data. As this method needs to know all the test data during the training process, it belongs to a transductive semi-supervised classification algorithm in nature. (9) Other methods: [18] proposes a new probabilistic distance measure for the distance between an uncertain object and a group of uncertain objects, and then utilizes it for classifying uncertain data. Experiments show that this method can provide excellent classification performance for uncertain data. [35] proposes an automatic soft classifier for uncertain data, which combines the fuzzy c-means method [37] with a fuzzy distance function and an evaluation function to achieve the classification task. The main idea of existing uncertain data classification methods is to use random sampling or probabilistic definitions to extend the traditional classification methods to serve for uncertain data, which will cause that existing methods inevitably inherit the inherent shortcomings of the original algorithms [26]. In addition, some methods deal with data uncertainty based on relatively ideal probability distribution and data type assumptions, thus are difficult to be applied for various application scenarios. For the proposed PwAdaBoost, it is a novel possible world based algorithm for classifying uncertain data. By using the possible world model to deal with uncertain data, PwAdaBoost can deal with various types of data uncertainty. By redesigning the AdaBoost framework via introducing possible world in multiple phases and adding extra majority voting and weighted voting procedures, PwAdaBoost can enhance the learning ability

Fig. 1. Example of uncertain data and possible world.

and capture the uncertain information better. In addition, PwAdaBoost can choose any classification algorithm for certain data as the basic algorithm, thus overcoming the issue about inheriting the inherent shortcomings of a single traditional algorithm.

3. Preliminary 3.1. Uncertain data and possible world Uncertain data, where each attribute can be represented as a random variable with a probability density function or a probability mass function in a domain [38], is ubiquitous in many real applications. Possible world is an effective tool to deal with various types of uncertain data. To better understand uncertain data and possible world, we first show a simple example to explain their concepts, and then give the formal definitions for them. Fig. 1 gives a simple example to illustrate the concepts of uncertain data and possible world. Assume that x1 , x2 and x3 are uncertain data, x′i , x′′i and x′′′ i are the possible instances of xi , i ∈ {1, 2, 3}. Now generate three possible worlds pw1 , pw2 and pw3 , and their components are shown in Fig. 1. It is easy to find that each possible world consists of an instance from each uncertain object. In the following, we will further introduce the formal definitions for uncertain data and possible world respectively.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

4

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

3.1.1. Uncertain data Assume that UD is an uncertain dataset in an m-dimensional independent space, different objects and dimensions are respectively independent of each other. When UD is an uncertain numerical dataset, each attribute is an uncertain numerical attribute u denoted by Ai n , where i means the ith attribute of UD. Further, usun u ing Aij to denote the ith attribute of the jth object in UD, Aijn can be represented as a random variable with a probability density u u function f (x) in a continuous domain (interval) [Aijn .l, Aijn .r ]. And they satisfy the following formula: f (x) ⩾ 0, u Aijn .r



u

Aijn .l

∀x ∈ [Auijn .l, Auijn .r ],

f (x)dx = 1.

pk = 1,

1 ⩽ k ⩽ n.

Input: Training set D = {(xi , yi )|i = 1, 2, ..., N1 }, the total iteration number T , the number of objects sampled in each iteration N2 , the basic algorithm L Output: The final classifier H 1:

Initialization: Wi1 =

2:

for t = 1 to T do

(2)

k=1

3.1.2. Possible world Possible world has been widely used to represent and query uncertain data [11,12,39–41], and has shown to be powerful to deal with various types of data uncertainty in uncertain data management [9,10]. Its definition is as follows [27]. Definition 1. Possible world: Let UD = {{x1 , x2 , . . . , xN } }be an uncertain dataset. A possible world pw = x′1 , x′2 , . . . , x′N (x′i ∈ xi ) is a set of instances such that each instance is taken from each corresponding uncertain object. Let PW be the set of all the possible worlds, P(pw ) be the existence probability of pw , then ∑ pw∈PW P(pw ) = 1. From this definition, we can get that the possible world set is a number of independent and identically distributed realizations of an uncertain dataset. Possible world can be generated according to the probability distribution of uncertain data [9]. In the following, we briefly review how to generate the possible world for uncertain numerical dataset and categorical dataset respectively. For uncertain numerical dataset, as the probability density function and the corresponding interval have been given, we can generate the possible world directly. Specifically, assume u that Aijn is the ith numerical attribute of the jth object in UD, which is associated with a probability density function f (x) and a u u continuous domain [Aijn .l, Aijn .r ], we can firstly get the probability u distribution function F (x) of Aijn , and then use the inversion method to generate the possible world. More details and proofs about the inversion method can refer to [42]. For uncertain categorical dataset, we can also use the analogous inversion method to generate the possible world which is u based on the following fact. Assume that Aijc is the ith categorical attribute of the jth object in UD, which is associated with a probability mass function p = {p1 , p2 , . . . , pn } and a discrete domain Dom = {v1 , v2 , . . . , vn }. Let U be a random number uniformly distributed in [0, 1]. Set V = vJ , where J is a random ∑j variable defined as J = min{j|1 ⩽ j ⩽ n, U < k=1 pk }. Then

1 N1

3:

Generate the training set Dt according to Wit

4:

Get the basic classifier H t from Dt with L

5:

Calculate the weighted error of H t : ε t =

(1)

When UD is an uncertain categorical dataset, each attribute u is an uncertain categorical attribute denoted by Ai c , where i uc means the ith attribute of UD. Further, using Aij to denote the u ith attribute of the jth object in UD, Aijc can be represented as a random variable with a probability mass function p in a discrete u domain Dom, where Dom = {v1 , v2 , . . . , vn } means that Aijc has n possible values v1 , v2 , . . . , vn , and p = {p1 , p2 , . . . , pn } means the existence probabilities of v1 , v2 , . . . , vn are p1 , p2 , . . . , pn respecu tively, i.e., P(Aijc = vk ) = pk and 1 ⩽ k ⩽ n. And they satisfy the following formula: n ∑

Algorithm 1 AdaBoost

N1 ∑

Wit δ[H t (xi ) ̸ = yi ]

i =1

(If ε t > 0.5, then set T = t − 1 and abort loop) εt

6:

Set α t =

7:

Set the new value of Wit +1 to be Wit +1 =

1−ε t

8:

end for

9:

Obtain H, and H(x) = arg max

T ∑

yi ∈Y t =1

∑j−1

Wit Zt

αt

(1−δ[H t (xi )̸ =yi ])

log α1t δ[H t (x) = yi ]

∑j

P {J = j} = P { k=1 pk ⩽ U < k=1 pk } = pj , where 1 ⩽ j ⩽ n. In this case, V is distributed according to the probability mass function p, and its value domain is Dom. Based on the above fact, when generating the possible world, we can firstly produce a random number uniformly distributed in [0, 1], and then use the probability mass function p to map the random number to the corresponding value in Dom, thus generating the possible world. 3.2. AdaBoost AdaBoost [43] is a popular and powerful algorithm for classification. Its core principle is to construct a better classifier by combining multiple simple classifiers. In the following, we review the original AdaBoost algorithm. The details of AdaBoost is shown in Algorithm 1. D is the whole training set, D = {(xi , yi )|i = 1, 2, . . . , N1 }, xi is an object, yi is the class label of xi , yi ∈ Y = {1, 2, . . . , K }, N1 is the number of objects in D, K is the number of classes in D. L is a basic algorithm. Dt is the training set sampled from D according to Wit in the tth iteration, Dt = {(xti , yti )|i = 1, 2, . . . , N2 }, N2 is the number of objects in Dt . Wit is the weight that the object xi will be sampled when generating Dt . Z t is a normalization factor in the tth iteration. For δ[·], if [·] is right, δ[·] equals 1, otherwise it equals 0. AdaBoost starts from assigning equal sampled weight for each object in D and then goes into its main loop. In each iteration, it generates the training set Dt from D according to Wit and trains the basic classifier H t from Dt with L. After this, it uses all the objects in D to detect the quality of H t and calculates the corresponding weighted error ε t . The error ε t will be used to update the sampled weight of each object in the next iteration and calculate the weight of H t . This procedure is repeated for T rounds, the final classifier H will be obtained from the basic classifiers and their corresponding weights. The AdaBoost algorithm has many variants for different applications, such as multi-class imbalance learning [44], financial distress prediction [45,46], traffic flow forecasting [47], hyperspectral image classification [48] and so on. For more detailed information about AdaBoost, one can refer to [43,49–51]. 4. Possible world based AdaBoost algorithm We propose a novel possible world based AdaBoost algorithm for classifying uncertain data (PwAdaBoost), which consists of two procedures: the training procedure and the prediction procedure.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

4.1. Training procedure

Algorithm 2 PwAdaBoost (Training procedure)

In the training procedure, PwAdaBoost uses the possible world set generated from the uncertain training set sampled in each iteration to train the sub-basic classifiers, and employs the possible world set generated from the whole uncertain training set to adjust the weights of the sub-basic classifiers and detect the quality of the basic classifiers. Algorithm 2 shows the details of the training procedure of PwAdaBoost. UD denotes the whole uncertain training set, UD = {(xi , yi )|i = 1, 2, . . . , N1 }, xi is an uncertain object, yi is the class label of xi , yi ∈ Y = {1, 2, . . . , K }, N1 is the number of objects in UD, K is the number of classes in UD. L denotes a basic classification algorithm for certain data. T denotes the total iteration number of the algorithm. PWUD denotes the possible world set generated from UD, which can be used to adjust the weights of the sub-basic classifiers and detect the quality of the basic classifiers, PWUD = {pw udk |k = 1, 2, . . . , M1 }, M1 is the number of possible worlds in PWUD, pw udk = {(xi,k , yi )|i = 1, 2, . . . , N1 }. UDt denotes the uncertain training set sampled from UD according to Wit in the tth iteration, UDt = {(xti , yti )|i = 1, 2, . . . , N2 }, t is the current iteration number of the algorithm, N2 is the number of objects in UDt , Wit denotes the weight that the uncertain object xi will be sampled when generating UDt . PW t denotes the possible world set generated from UDt , which can be used to train the sub-basic classifiers, PW t = {pwjt |j = 1, 2, . . . , M2 }, M2 is the number of possible worlds in PW t , pwjt = {(xti,j , yti )|i = 1, 2, . . . , N2 }. htj denotes the sub-basic classifier trained from pwjt . TRtj denotes the result of PWUD by using htj via majority voting, trjt (xi ) denotes the result of xi by using htj via majority voting, TRtj = {trjt (xi )|i = 1, 2, . . . , N1 }. Here xi = {xi,k |k = 1, 2, . . . , M1 }

∑M

t 1 and trjt (xi ) = arg maxyi ∈Y k=1 δ[hj (xi,k ) = yi ]. For δ[·], if [·] is right, δ[·] equals 1, otherwise it equals 0. Note that when referring to δ[·] hereinafter, it has the same meaning. H t denotes the basic classifier in the tth iteration, it can be expressed as:

H t (x) = arg max

M2 ∑

yi ∈Y

utj δ[htj (x) = yi ],

(3)

j=1

where utj denotes the weight of the sub-basic classifier htj , and it can be calculated as: accuracy(TRtj ) utj

=

1 − accuracy(TRtj ) M2 ∑ accuracy(TRtl ) 1 − accuracy(TRtl )

,

(4)

∑N

t 1 where accuracy(TRtj ) = N1 i=1 δ[trj (xi ) = yi ]. This parameter 1 computation method can make H t obtain the best result, and the proof procedure can refer to Appendix. FRt and ε t respectively denote the result and the weighted error of PWUD by using H t via majority voting and weighted voting, and ε t can be calculated as: N1 ∑

Wit δ[H t (xi ) ̸ = yi ].

(5)

i=1

α t denotes the weight of the basic classifier H t , it can be calculated as: αt =

1 2

ln

1 − εt

εt

.

Input: Uncertain training set UD = {(xi , yi )|i = 1, 2, ..., N1 }, the total iteration number T , the number of objects sampled in each iteration N2 , the basic algorithm L, the number of possible worlds in PWUD and PW t : M1 and M2 Output: The final uncertain data classifier H 1:

Preparation: Initialize Wi1 =

2:

for t = 1 to T do

(6)

1 N1

, and generate PWUD from UD

3:

Get UDt according to Wit , UDt = {(xti , yti )|i = 1, 2, ..., N2 }

4:

Generate PW t from UDt

5:

for j = 1 to M2 do

6:

Get the sub-basic classifier htj from pwjt with L

7:

Get TRtj via majority voting end for

8: 9:

Compute utj for each htj by Eq. (4)

10:

Get the basic classifier H t by Eq. (3)

11:

Compute the weighted error ε t by Eq. (5)

12:

If ε t > 0.5 then Wit +1 =

13:

1 N1

, go to step 2

14:

end if

15:

Compute α t by Eq. (6)

16:

Update Wit +1 by Eq. (7)

17:

end for

18:

Obtain the final uncertain data classifier H by Eq. (8)

And in each iteration, Wit +1 can be updated by the following formula: Wit +1 =

Wit Zt

{ ·

e−α t eα

t

if if

H t (xi ) = yi , H t (xi ) ̸ = yi ,

(7)

where Z t is the normalization factor in the tth iteration, which ∑N1 t +1 can be used to ensure = 1. By using Eq. (7), PwAdi=1 Wi aBoost can increase the sampled weights of the wrongly predicted objects and decrease the sampled weights of the correctly predicted objects. H denotes the final uncertain data classifier, and it can be presented as: H(x) = arg max

l=1

εt =

5

yi ∈Y

T ∑

α t δ[H t (x) = yi ].

(8)

t =1

As shown in Algorithm 2, in the preparation phase, PwAdaBoost initializes the sampled weights uniformly and generates the possible world set PWUD from UD (Line 1), which can be used to adjust the weights of the sub-basic classifiers and detect the quality of the basic classifiers, and then enters into the main loop (Line 2). It gets the uncertain training set UDt from UD according to Wit , generates the possible world set PW t from UDt , uses PW t to train the sub-basic classifiers and assigns them the corresponding weights according to their accuracy for PWUD via majority voting (Lines 3–9). Then it constructs the basic classifier H t by using the sub-basic classifiers and their corresponding weights, utilizes H t to test PWUD via majority voting and weighted voting, calculates the weighted error ε t and the weight of the basic classifier α t , and updates the sampled weight of each object in the next iteration (Lines 10–16). This procedure will be repeated until the main loop is over, and the final uncertain data classifier H will be obtained by Eq. (8) (Line 18).

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

6

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

Algorithm 3 PwAdaBoost (Prediction procedure) Input: Uncertain data classifier H, an uncertain object xp , the number of possible worlds in PWP: M3

Table 2 The example for discussion. Classifier

Result

Accuracy

Weight

ht1

1 1 1 1

0.75

ht2

1 1 0 0

0.75

1:

Generate PWP for the uncertain object xp

ht3

0 0 0 1

0.5

2:

for t = 1 to T do

ht4

0 1 1 0

0.25

ht5

1 0 1 0

0.25

9 23 9 23 3 23 1 23 1 23

H′

1 1 1 0

0.5



Ht

1 1 0 1

1



Output: The class label of xp

3: 4: 5:

for j = 1 to M2 do Get the result of sub-basic classifier htj via majority voting end for

6:

end for

7:

for t = 1 to T do

8: 9:

Obtain the result of basic classifier H t via weighted voting end for

10: Predict the class label of xp via weighted voting by Eq. (8)

4.2. Prediction procedure In the prediction procedure, PwAdaBoost utilizes the possible world set generated from the predicted object to get the results of the basic classifiers via majority voting and weighted voting. Algorithm 3 shows the details of the prediction procedure of PwAdaBoost. PWP denotes the possible world set generated from an uncertain object xp which is to be predicted, PWP = {pw pl |l = 1, 2, . . . , M3 }, each pwpl is a possible world of xp , M3 is the number of possible worlds in PWP. Other symbols which are similar with the training procedure have been omitted. As shown in Algorithm 3, PwAdaBoost first generates the possible world set PWP for the uncertain object xp (Line 1). Then it gets the result of each sub-basic classifier htj for PWP via majority voting (Lines 2–6), and further it obtains the result of each basic classifier H t via weighted voting (Lines 7–9). Finally, it predicts the class label of xp via weighted voting by Eq. (8) (Line 10). 4.3. Discussion 4.3.1. Stability In most cases, the possible world set can provide appropriate realizations for the uncertain dataset. However, due to randomness, sometimes improper possible worlds may exist. In PwAdaBoost, there are three places which use the possible world, they are PWUD, PW t and PWP. For PWUD and PWSP, we can generate larger possible world sets to avoid the randomness; For PW t , we add the extra weighted voting procedure to address this issue. In the following we take a simple example to show how it works. Assume that there are four uncertain objects and their true class labels are 1, 1, 0, 1 respectively. PW t includes five possible worlds and the corresponding sub-basic classifiers are ht1 , ht2 , ht3 , ht4 , ht5 . Their training results for PWUD via majority voting are shown in Table 2. Obviously, some sub-basic classifiers like ht4 and ht5 perform not well, i.e., their corresponding possible worlds are improper. If we directly integrate these sub-basic classifiers into one basic classifier via majority voting, we can get the basic classifier H ′ , and its accuracy is just 0.5. However, by adding the extra weighted voting procedure, we can increase the weights of the sub-basic classifiers obtained from the appropriate possible worlds and decrease the weights of the sub-basic classifiers obtained from the improper possible worlds, thus avoiding the influence of randomness. For the current example, by adding the extra weighted voting procedure, each sub-basic classifier will be assigned a weight by Eq. (4). And the accuracy of the basic classifier H t , which is constructed by the sub-basic classifiers and their corresponding weights, can achieve 1.

Fig. 2. The training procedure of PwAdaBoost.

4.3.2. Parallelization The traditional AdaBoost algorithm is not relatively independent in each iteration, so it is difficult to be parallelized. However, for PwAdaBoost, some partial parallelization strategies can be conducted during its training procedure and prediction procedure respectively, which can improve the efficiency significantly. In order to clearly explain the reasons that we can use the parallelization strategies, we provide the flow charts for the training procedure and the prediction procedure of PwAdaBoost, which are shown in Figs. 2 and 3. From Fig. 2, we can see that the steps in the rectangles with thick lines are relatively independent. It indicates that in each iteration the work of training different sub-basic classifiers can proceed simultaneously. From Fig. 3, we can find that the steps in the rectangles with thick lines are also relatively independent. It indicates that when predicting the class label of an uncertain object, no matter the basic classifiers and the

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

7

Table 3 Real benchmark datasets. Dataset

#Objects

#Attributes

#Classes

Glass Ecoli Vehicle Abalone Pendigits

214 327 946 4124 10 992

9 7 18 7 16

6 5 4 17 10

Zoo Hayes-Roth Dermatology Car Nursery

101 160 358 1728 12 960

16 3 34 6 8

7 3 6 4 5

5.1. Datasets

Fig. 3. The prediction procedure of PwAdaBoost.

sub-basic classifiers, they can work simultaneously, i.e., they can achieve the parallelization. 4.4. Time complexity Assume that T is the iteration number in the training procedure, N1 is the number of objects in uncertain training set, N2 is the number of objects in UDt , and N3 is the number of objects in uncertain test set. M1 is the number of possible worlds in PWUD, M2 is the number of possible worlds in PW t , and M3 is the number of possible worlds in PWP. E g is the time complexity of generating one possible world for an uncertain object, and E h and E p are respectively the training and prediction time complexity of the basic algorithm. (i) For the training procedure of PwAdaBoost, in the preparation phase, the time complexity is O(M1 N1 E g ); In the main loop, the time complexity is O(TM2 (N2 E g + N2 E h + M1 N1 )). So the total time complexity of the training procedure is O(M1 N1 E g + TM2 (N2 E g + N2 E h + M1 N1 )). (ii) For the prediction procedure of PwAdaBoost, when generating the possible world set, the time complexity is O(M3 N3 E g ); when predicting the class label for the test set, the time complexity is O(TM2 M3 N3 E p ). So the total time complexity of the prediction procedure is O(M3 N3 (E g + TM2 E p )). 5. Experiments We conduct experiments on real benchmark datasets and real world uncertain datasets, compared with the state-of-theart uncertain data classification algorithms. Using real benchmark datasets, we aim to validate that PwAdaBoost can leverage different basic algorithms to deal with various types of datasets and uncertainty. Using real world uncertain datasets, we aim to further demonstrate the superiority of PwAdaBoost in real applications. Our programs are implemented in Matlab R2013a on a computer with an Intel Core i5-3470 3.2 GHz CPU and a 32 GB main memory.

5.1.1. Real benchmark datasets We perform experiments on 10 public real benchmark datasets, which include 5 numerical datasets and 5 categorical datasets, and they can be obtained from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). The detailed dataset statistics are shown in Table 3. In terms of the distribution of classes, some datasets are balanced and some datasets are unbalanced. As these datasets are originally established as collections of data with determinate values, we need to generate the uncertainty. For numerical datasets, like glass, ecoli, vehicle, abalone and pendigits, we follow the method in [14,16,52,53] to generate the uncertainty. Specifically, for the ith numerical attribute of the jth u object xj of UD, denoted by Aijn , where i ∈ [1, 2, . . . , m], m is the total dimensionality of the object, we generate the interval I (ij) and the probability density function f (ij) over the interval. For I (ij) , it is randomly chosen as a subinterval within [min(xi ) , max(xi ) ], j

j

where min(xi ) and max(xi ) are respectively the minimum and the j

j

maximum determinate values of the ith attribute over all the objects belonging to the same class of xj , and we need to ensure that the determinate value of the ith attribute of xj is within I (ij) . For f (ij) , we generate 3 kinds of different distributions: uniform distribution, Gaussian distribution and Laplace distribution. For uniform distribution, we can directly use the interval I (ij) to get the probability density function; for Gaussian distribution and Laplace distribution, we set the expected value to the determinate value of the ith attribute of xj , and set the variance/scale parameter to a random value in [0, 31 (max(xi ) − min(xi ) )]. j

j

For categorical datasets, like zoo, Hayes-Roth, dermatology, car and nursery, we follow the method in [14,16] to generate the uncertainty. Specifically, for the ith categorical attribute of the u jth object xj of UD, denoted by Aijc , where i ∈ [1, 2, . . . , m], m is the total dimensionality of the object, we generate the categorical domain Dom and the probability mass function p. For Dom, we set Dom = {v1 , v2 , . . . , vn }, n is the number of different values of the ith attribute over all the objects belonging to the same u class of xj , Aijc can take any value vk from Dom, 1 ⩽ k ⩽ n. For p, we randomly generate p = {p1 , ∑ p2 , . . . , pn }, where pk is n u the probability that Aijc equals vk , and k=1 pk = 1. To retain u some original information, we ensure that Aijc can always take the original determinate attribute value with more than 10% probability. 5.1.2. Real world uncertain datasets We perform experiments on 4 real world uncertain datasets: Japanese vowels (http://archive.ics.uci.edu/ml/), Movement (http: //archive.ics.uci.edu/ml/), NBA (http://espn.go.com/nba/) and ADL (http://archive.ics.uci.edu/ml/). (1) Japanese vowels (Jap): it consists of 640 time series of 12 LPC cepstrum coefficients taken from 9 speakers, which means

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

8

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

this dataset is divided into nine classes. Each time series has 7∼29 records, each record has 12 dimensions (coefficients). Each time series is treated as an uncertain object and each record of the time series is treated as a possible value of the uncertain object. (2) Movement: it consists of 13,197 radio signal strength records about 314 temporal sequences. Each record has four dimensions which are respectively corresponding to four sensor nodes. According to the user movement path, the dataset is divided into six classes. Each temporal sequence is treated as an uncertain object and each record of the temporal sequence is treated as a possible value of the uncertain object. (3) NBA: it consists of 2197 records about the top 300 players in ESPN 2015 rank. Each record has five dimensions: number of points, number of rebounds, number of assists, number of steals and number of blocks. According to their season average performance, they are divided into four classes: super star player, star player, key player and role player. Each player is treated as an uncertain object and each season average performance of the player is treated as a possible value of the uncertain object. (4) ADL: it consists of 262,778 records about 503 accelerometer data collected from activities of daily living. Each record has three dimensions: acceleration along the x axis of the accelerometer, acceleration along the y axis of the accelerometer, acceleration along the z axis of the accelerometer. According to the daily living activities, the dataset is divided into five classes: climbing stairs activity, drinking activity, getting up activity, pouring water activity and walking activity. Each accelerometer data is treated as an uncertain object and each record of the accelerometer data is treated as a possible value of the uncertain object. 5.2. Baselines and settings 5.2.1. Baselines (1) For real benchmark datasets, to demonstrate that when using the same basic algorithm, PwAdaBoost can better deal with various types of datasets and uncertainty, we compare PwAdaBoost with the algorithms NBU (Bayes based) [14], UDT (decision tree based) [16], UNN (nearest neighbor based) [17] and EpAdaBoost, where EpAdaBoost means the original AdaBoost algorithm which takes use of expected value to deal with data uncertainty. In order to ensure fairness, when using Bayes as the basic algorithm, we compare PwAdaBoost (using Bayes) with EpAdaBoost (using Bayes) and NBU; When using decision tree as the basic algorithm, we compare PwAdaBoost (using decision tree) with EpAdaBoost (using decision tree) and UDT; When using KNN as the basic algorithm, we compare PwAdaBoost (using KNN) with EpAdaBoost (using KNN) and UNN. (2) For real world uncertain datasets, to further validate the superiority of PwAdaBoost in real applications, we compare PwAdaBoost with existing uncertain data classification algorithms NBU [14], UDT [16], UNN [17], USVC [20], UELM [25], PWCLA [27] and EpAdaBoost. For convenience, we consistently use KNN as the basic algorithm for PwAdaBoost and EpAdaBoost. 5.2.2. Settings In PwAdaBoost, for T and M2 , we provide a detailed parameter investigation for them, the setting method can refer to Section 5.4. In addition, for M1 and M3 , the investigation results in previous possible world based methods [27,54] show that setting them to 100 is enough to obtain satisfactory results, so we set M1 = 100 and M3 = 100. For N2 , we follow the suggestions in previous boosting methods [55,56] to set N2 = N1 , where N1 denotes the number of objects in the training set. For EpAdaBoost, we ensure that EpAdaBoost uses the same number of weak classifiers with PwAdaBoost. And to avoid the massive calculations, we use the Monte Carlo method to calculate the expected value for

EpAdaBoost. In addition, for PwAdaBoost and EpAdaBoost, when using Bayes as the basic algorithm, we respectively employ the kernel density estimation method and the statistical method to deal with the numerical datasets and categorical datasets. When using KNN as the basic algorithm, we adjust the parameter K ∈ {1, 3, . . . , 9} continuously until the result of each method becomes the best and stable. For the parallelization of PwAdaBoost, we use 4 cores to achieve it. For other compared algorithms, we follow the same methods introduced in their original papers to adjust the parameters. For all datasets except for Japanese vowels, we randomly take 60% of the objects in the dataset as the training set and the remaining objects as the test set to conduct experiments. For Japanese vowels, as this dataset has been divided into the training set and the test set originally, we directly use them to perform experiments. For each algorithm, we average the results over 10 different runs to avoid the classification results are affected by random chance. 5.3. Evaluation measures We use accuracy (ACC) and F-measure [57], which are the most commonly used criteria, to evaluate the effectiveness of different algorithms. Accuracy: Given a classification result, it can be calculated as: N ∑

χ (ri , li )

i=1

, (9) N where N denotes the number of objects in the test set, ri denotes the class label of object oi obtained from the classification method, li denotes the true class label of object oi . χ (x, y) equals 1 if x = y, and equals 0 otherwise. F-measure: Given a classification result, it can be calculated as: 2PR , F − measure = P +R Accuracy =

P = Pi =

K 1 ∑

K

Pi ,

R=

i=1

T Pi T Pi + F Pi

,

K 1 ∑

K

Ri =

Ri ,

(10)

i=1

T Pi T Pi + F Ni

,

where P is the average precision and R is the average recall, K is the total number of classes in the test set, Pi is the precision of class i, Ri is the recall of class i, TPi is the number of objects correctly labeled as class i, FPi is the number of objects incorrectly labeled as class i, FNi is the number of objects which are not labeled as class i but should have been labeled as class i. 5.4. Parameter investigation (1) Parameter T : Fig. 4 shows the effectiveness results of PwAdaBoost with different parameter T values on real world uncertain datasets. From the results, it can be seen that the performance of PwAdaBoost first increases and then remains stable. Specifically, when the parameter T is set in the range {10, 15, . . . , 50}, the classification performance is always satisfactory and stable. As a larger parameter T will affect the efficiency, in this paper we set the parameter T = 10 consistently. (2) Parameter M2 : Fig. 5 shows the effectiveness results of PwAdaBoost with different parameter M2 values on real world uncertain datasets. From the results, it can be seen that when parameter M2 is set in the range {10, 15, . . . , 50}, the classification performance is always good and stable. As a larger M2 is meaningless to improve the effectiveness and will reduce the efficiency, in this paper we set M2 = 10 consistently.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

9

Fig. 4. Effectiveness results of PwAdaBoost with different parameter T values.

Fig. 5. Effectiveness results of PwAdaBoost with different parameter M2 values.

5.5. Effectiveness

tainty, PwAdaBoost always performs better than other compared algorithms. In some cases, PwAdaBoost even can obtain better

5.5.1. Real benchmark datasets Tables 4–9 show the effectiveness results on numerical and categorical real benchmark datasets by using different basic algorithms (the mean classification results (mean) ± the standard deviations (std)). For NBU, as it is based on the Gaussian distribution assumption, we only report the results of Gaussian distribution. For UNN, it is just applicable to numerical datasets, we only report the results of numerical datasets. From the results, it can be seen that when using the same basic algorithm, no matter what types of datasets and uncer-

performance by a margin of more than 0.1. For example, when using Bayes as the basic algorithm, for car, PwAdaBoost improves upon EpAdaBoost by 0.1291 ACC and 0.228 F-measure, and improves upon NBU by 0.1887 ACC and 0.4 F-measure. The reason is that compared with EpAdaBoost, PwAdaBoost introduces the possible world model to deal with data uncertainty instead of the expected value, thus avoiding the loss of uncertain information. Compared with NBU, UDT and UNN, PwAdaBoost leverages the idea of ensemble learning and redesigns the traditional AdaBoost framework, thus enhancing the learning ability greatly.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

10

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

Table 4 Effectiveness on numerical real benchmark datasets (using Bayes). Dataset

Glass

Ecoli

Vehicle

Abalone

Pendigits

Distribution

Metric

Method NBU

EpAdaBoost

PwAdaBoost

Uniform

ACC F-measure

– –

0.8012±0.0201 0.4559±0.0228

0.8442 ± 0.0195 0.5178 ± 0.0315

Gaussian

ACC F-measure

0.3488±0.0690 0.1162±0.0364

0.7872±0.0257 0.4433±0.0381

0.8535 ± 0.0133 0.5119 ± 0.0211

Laplace

ACC F-measure

– –

0.8058±0.0198 0.4303±0.0205

0.8860 ± 0.0127 0.5687 ± 0.0230

Uniform

ACC F-measure

– –

0.8908 ± 0.0072 0.6776 ± 0.0175

0.9405 ± 0.0032 0.8113 ± 0.0085

Gaussian

ACC F-measure

0.7939 ± 0.0239 0.4744 ± 0.0259

0.8855 ± 0.0114 0.6855 ± 0.0267

0.9122 ± 0.0074 0.7514 ± 0.0177

Laplace

ACC F-measure

– –

0.8740 ± 0.0103 0.6675 ± 0.0200

0.9153 ± 0.0043 0.7609 ± 0.0098

Uniform

ACC F-measure

– –

0.8006±0.0090 0.6650±0.0125

0.8467 ± 0.0110 0.7284 ± 0.0175

Gaussian

ACC F-measure

0.5947 ± 0.0182 0.4189 ± 0.0159

0.7938 ± 0.0112 0.6459 ± 0.0171

0.8047 ± 0.0042 0.6621 ± 0.0066

Laplace

ACC F-measure

– –

0.7636 ± 0.0088 0.6120 ± 0.0119

0.8077 ± 0.0098 0.6726 ± 0.0144

Uniform

ACC F-measure

– –

0.3424 ± 0.0060 0.0530 ± 0.0013

0.3670 ± 0.0013 0.0581 ± 0.0003

Gaussian

ACC F-measure

0.3224 ± 0.0078 0.0494 ± 0.0014

0.3082 ± 0.0053 0.0463 ± 0.0011

0.3366 ± 0.0023 0.0519 ± 0.0005

Laplace

ACC F-measure

– –

0.2973 ± 0.0062 0.0442 ± 0.0012

0.3130 ± 0.0059 0.0472 ± 0.0011

Uniform

ACC F-measure

– –

0.9584 ± 0.0022 0.8206 ± 0.0083

0.9756 ± 0.0020 0.8881 ± 0.0083

Gaussian

ACC F-measure

0.6602 ± 0.0086 0.2696 ± 0.0075

0.8947 ± 0.0025 0.6276 ± 0.0062

0.9104 ± 0.0016 0.6678 ± 0.0044

Laplace

ACC F-measure

– –

0.8943 ± 0.0044 0.6266 ± 0.0109

0.9124 ± 0.0032 0.6730 ± 0.0089

5.5.2. Real world uncertain datasets Table 10 shows the effectiveness results on real world uncertain datasets (the mean classification results (mean) ± the standard deviations (std)). From the overall perspective, it can be seen that the ACC and F-measure scores of PwAdaBoost are always higher than those of the compared algorithms. This is because that PwAdaBoost employs the possible world model to deal with uncertain data, which can reduce the loss of uncertain information. Moreover, PwAdaBoost utilizes the idea of ensemble learning, which can improve the classification performance significantly. NBU performs worse than PwAdaBoost, this is because that NBU relies on the ideal Gaussian distribution assumption, so it is difficult to perform well in some real world uncertain datasets with complex distributions. UDT performs not so well as PwAdaBoost, this is because that UDT introduces some definitions like probabilistic entropy, probabilistic information gain to construct the decision tree, but these definitions are lack of explicit statistical meanings. UNN and USVC perform not so well as PwAdaBoost. This is because that although UNN and USVC integrate the probability distribution information into traditional classification algorithms to deal with uncertain data, they still do not leverage the idea of ensemble learning. UELM and EpAdaBoost also perform worse than PwAdaBoost, this is because that UELM and EpAdaBoost respectively rely on the grid-based method and the expected value to deal with uncertain data, which may cause the loss of uncertain information. PWCLA usually performs better than other compared algorithms, but worse than PwAdaBoost, this is because that PWCLA is a transductive semi-supervised classification method, it uses the feature information of the test set during the training process. However, it does not use the idea of ensemble

learning to achieve the classification. On average, PwAdaBoost performs the best among all the algorithms. From the score on each dataset, it can be seen that PwAdaBoost performs more stably than the compared algorithms. In particular, for Japanese vowels (Jap), this dataset is relatively simple and most algorithms can obtain the satisfactory classification results, PwAdaBoost performs better than other algorithms with more than 0.97 ACC and 0.86 F-measure; For ADL, this dataset is relatively complex and other algorithms cannot work well, PwAdaBoost also can get more than 0.81 ACC and 0.64 Fmeasure. This is because that compared with other algorithms, PwAdaBoost employs the possible world model to deal with uncertain data and utilizes the idea of ensemble learning to improve the classification performance. Overall, in terms of effectiveness, our proposed algorithm PwAdaBoost performs excellently compared with the baseline algorithms. 5.6. Efficiency Fig. 6 shows the efficiency results of different algorithms on real world uncertain datasets (in milliseconds). From the results, it can be seen that UNN is the slowest among all the algorithms. This is because that the computation of the nearest neighbor class probability in UNN is very complex and time consuming. USVC and UELM are the two fastest algorithms. The reason is that the optimization methods of USVC and UELM are easy, so their efficiency is good. NBU and UDT are slower than USVC and UELM, this is because that both NBU and UDT need a lot of integral calculations, which affects their running speed seriously. PWCLA is slower than USVC and UELM, but faster than other algorithms,

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

11

Table 5 Effectiveness on numerical real benchmark datasets (using decision tree). Dataset

Glass

Ecoli

Vehicle

Abalone

Pendigits

Distribution

Metric

Method UDT

EpAdaBoost

PwAdaBoost

Uniform

ACC F-measure

0.8605 ± 0.0362 0.5746 ± 0.0828

0.8186 ± 0.0275 0.4928 ± 0.0300

0.8953 ± 0.0116 0.6373 ± 0.0284

Gaussian

ACC F-measure

0.8488±0.0382 0.5582±0.0510

0.8733 ± 0.0177 0.6108 ± 0.0347

0.8953 ± 0.0087 0.6642 ± 0.0152

Laplace

ACC F-measure

0.8488±0.0475 0.5197±0.0551

0.8453 ± 0.0078 0.5308 ± 0.0136

0.8814 ± 0.0052 0.5902 ± 0.0106

Uniform

ACC F-measure

0.8779 ± 0.0268 0.6536 ± 0.0440

0.8977 ± 0.0064 0.6968 ± 0.0148

0.9160 ± 0.0054 0.7354 ± 0.0132

Gaussian

ACC F-measure

0.8855 ± 0.0238 0.6755 ± 0.0517

0.8947 ± 0.0118 0.6943 ± 0.0242

0.9145 ± 0.0034 0.7292 ± 0.0074

Laplace

ACC F-measure

0.8626 ± 0.0199 0.5966 ± 0.0337

0.8878 ± 0.0052 0.6610 ± 0.0120

0.9191 ± 0.0042 0.7318 ± 0.0097

Uniform

ACC F-measure

0.8225 ± 0.0252 0.6854 ± 0.0467

0.8349 ± 0.0072 0.7121 ± 0.0113

0.8604 ± 0.0173 0.7524 ± 0.0275

Gaussian

ACC F-measure

0.7929 ± 0.0615 0.6531 ± 0.0723

0.7902 ± 0.0111 0.6480 ± 0.0159

0.8142 ± 0.0032 0.6833 ± 0.0047

Laplace

ACC F-measure

0.7396 ± 0.0214 0.5713 ± 0.0289

0.7991 ± 0.0126 0.6615 ± 0.0184

0.8130 ± 0.0082 0.6826 ± 0.0121

Uniform

ACC F-measure

0.3673 ± 0.0218 0.0551 ± 0.0060

0.3808 ± 0.0059 0.0603 ± 0.0013

0.4143 ± 0.0021 0.0669 ± 0.0005

Gaussian

ACC F-measure

0.3370 ± 0.0116 0.0484 ± 0.0025

0.4421 ± 0.0094 0.0747 ± 0.0023

0.4698 ± 0.0016 0.0817 ± 0.0005

Laplace

ACC F-measure

0.3673 ± 0.0241 0.0553 ± 0.0063

0.4004 ± 0.0060 0.0652 ± 0.0014

0.4275 ± 0.0034 0.0710 ± 0.0008

Uniform

ACC F-measure

0.9481 ± 0.0027 0.7848 ± 0.0073

0.9853 ± 0.0014 0.9304 ± 0.0062

0.9906 ± 0.0006 0.9547 ± 0.0029

Gaussian

ACC F-measure

0.9123 ± 0.0068 0.6751 ± 0.0186

0.9676 ± 0.0016 0.8560 ± 0.0063

0.9915 ± 0.0005 0.9588 ± 0.0026

Laplace

ACC F-measure

0.9736 ± 0.0031 0.8804 ± 0.0084

0.9703 ± 0.0012 0.8669 ± 0.0048

0.9929 ± 0.0005 0.9654 ± 0.0023

Fig. 6. Efficiency on real world uncertain datasets.

this is because that the convergence speed of PWCLA is very fast.

the massive integral calculations. In addition, the parallelization

Compared with other algorithms, PwAdaBoost performs compet-

also improves the efficiency of PwAdaBoost greatly.

itively. The reason is that PwAdaBoost introduces the possible world model to deal with uncertain data, which does not need

In summary, our proposed algorithm PwAdaBoost is competitive compared with other algorithms in efficiency.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

12

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

Table 6 Effectiveness on numerical real benchmark datasets (using KNN). Dataset

Distribution

Glass

Ecoli

Vehicle

Abalone

Pendigits

Metric

Method UNN

EpAdaBoost

PwAdaBoost

Uniform

ACC F-measure

0.6279 ± 0.0242 0.2899 ± 0.0323

0.7174 ± 0.0182 0.3984 ± 0.0172

0.7907 ± 0.0273 0.4799 ± 0.0329

Gaussian

ACC F-measure

0.6977 ± 0.0355 0.3313 ± 0.0156

0.7070 ± 0.0294 0.3817 ± 0.0274

0.7605 ± 0.0133 0.4367 ± 0.0172

Laplace

ACC F-measure

0.7326 ± 0.0440 0.3633 ± 0.0226

0.7349 ± 0.0244 0.3923 ± 0.0252

0.7698 ± 0.0052 0.4162 ± 0.0078

Uniform

ACC F-measure

0.8931 ± 0.0245 0.6900 ± 0.0570

0.8634 ± 0.0150 0.6385 ± 0.0273

0.9130 ± 0.0087 0.7234 ± 0.0178

Gaussian

ACC F-measure

0.8855 ± 0.0233 0.6799 ± 0.0520

0.8450 ± 0.0157 0.6035 ± 0.0295

0.9053 ± 0.0042 0.7209 ± 0.0087

Laplace

ACC F-measure

0.8931 ± 0.0245 0.6785 ± 0.0687

0.8977 ± 0.0145 0.6950 ± 0.0304

0.9160 ± 0.0132 0.7272 ± 0.0125

Uniform

ACC F-measure

0.7012 ± 0.0090 0.5275 ± 0.0102

0.6429 ± 0.0160 0.4666 ± 0.0188

0.7639 ± 0.0099 0.6147 ± 0.0133

Gaussian

ACC F-measure

0.7071 ± 0.0146 0.5271 ± 0.0157

0.7006 ± 0.0130 0.5351 ± 0.0154

0.7592 ± 0.0045 0.6074 ± 0.0054

Laplace

ACC F-measure

0.6982 ± 0.0275 0.5209 ± 0.0348

0.6382 ± 0.0125 0.4601 ± 0.0148

0.7485 ± 0.0151 0.5913 ± 0.0201

Uniform

ACC F-measure

0.4345 ± 0.0070 0.0726 ± 0.0011

0.3376 ± 0.0065 0.0524 ± 0.0013

0.4720 ± 0.0019 0.0808 ± 0.0005

Gaussian

ACC F-measure

0.4224 ± 0.0078 0.0704 ± 0.0012

0.3277 ± 0.0081 0.0507 ± 0.0016

0.4602 ± 0.0030 0.0797 ± 0.0007

Laplace

ACC F-measure

0.4224 ± 0.0064 0.0702 ± 0.0011

0.3081 ± 0.0110 0.0468 ± 0.0022

0.4610 ± 0.0044 0.0800 ± 0.0011

Uniform

ACC F-measure

0.9182 ± 0.0014 0.6915 ± 0.0040

0.8632 ± 0.0030 0.5565 ± 0.0063

0.9448 ± 0.0010 0.7729 ± 0.0034

Gaussian

ACC F-measure

0.8569 ± 0.0034 0.5442 ± 0.0071

0.8614 ± 0.0024 0.5530 ± 0.0050

0.9031 ± 0.0021 0.6500 ± 0.0055

Laplace

ACC F-measure

0.8571 ± 0.0059 0.5446 ± 0.0121

0.8562 ± 0.0045 0.5423 ± 0.0092

0.9129 ± 0.0013 0.6764 ± 0.0035

Table 7 Effectiveness on categorical real benchmark datasets (using Bayes). Dataset

Metric

Method NBU

EpAdaBoost

PwAdaBoost

Zoo

ACC F-measure

0.9000 ± 0.0411 0.6440 ± 0.1404

0.9225 ± 0.0275 0.6997 ± 0.0730

0.9700 ± 0.0197 0.8481 ± 0.0964

Hayes-Roth

ACC F-measure

0.5156 ± 0.0621 0.3575 ± 0.0562

0.5453 ± 0.0708 0.4222 ± 0.0626

0.8266 ± 0.0115 0.7329 ± 0.0168

Dermatology

ACC F-measure

0.9790 ± 0.0111 0.9270 ± 0.0334

0.9490 ± 0.0171 0.8310 ± 0.0505

0.9993 ± 0.0003 0.9974 ± 0.0014

Car

ACC F-measure

0.7164 ± 0.0179 0.2087 ± 0.0053

0.7760 ± 0.0102 0.3807 ± 0.0157

0.9051 ± 0.0096 0.6087 ± 0.0211

Nursery

ACC F-measure

0.8374 ± 0.0040 0.4628 ± 0.0528

0.7979 ± 0.0061 0.4393 ± 0.0055

0.9010 ± 0.0015 0.5326 ± 0.0022

Table 8 Effectiveness on categorical real benchmark datasets (using decision tree). Dataset

Metric

Method UDT

EpAdaBoost

PwAdaBoost

Zoo

ACC F-measure

0.8500 ± 0.0335 0.4853 ± 0.0481

0.9025 ± 0.0142 0.6096 ± 0.0430

0.9500 ± 0.0112 0.7681 ± 0.0213

Hayes-Roth

ACC F-measure

0.7500 ± 0.0461 0.6322 ± 0.0540

0.6328 ± 0.0355 0.5010 ± 0.0376

0.8594 ± 0.0215 0.7728 ± 0.0224

Dermatology

ACC F-measure

0.9231 ± 0.0152 0.7698 ± 0.0451

0.9650 ± 0.0074 0.8890 ± 0.0215

0.9790 ± 0.0067 0.9296 ± 0.0104

Car

ACC F-measure

0.6932 ± 0.0170 0.2047 ± 0.0029

0.7813 ± 0.0171 0.4312 ± 0.0185

0.8933 ± 0.0049 0.5984 ± 0.0132

Nursery

ACC F-measure

0.6584 ± 0.0039 0.2633 ± 0.0280

0.7931 ± 0.0032 0.4345 ± 0.0025

0.9283 ± 0.0033 0.5601 ± 0.0054

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx Table 9 Effectiveness on categorical real benchmark datasets (using KNN). Dataset

Metric

Method UNN

EpAdaBoost

PwAdaBoost

Zoo

ACC F-measure

– –

0.9200 ± 0.0197 0.6873 ± 0.0523

0.9850 ± 0.0129 0.9239 ± 0.0655

Hayes-Roth

ACC F-measure

– –

0.7359 ± 0.0238 0.6182 ± 0.0310

0.8125 ± 0.0134 0.6947 ± 0.0301

Dermatology

ACC F-measure

– –

0.8958 ± 0.0166 0.7117 ± 0.0353

0.9301 ± 0.0074 0.7857 ± 0.0179

Car

ACC F-measure

– –

0.7893 ± 0.0151 0.4156 ± 0.0188

0.8781 ± 0.0048 0.5174 ± 0.0153

Nursery

ACC F-measure

– –

0.7864 ± 0.0040 0.4276 ± 0.0032

0.9155 ± 0.0025 0.5300 ± 0.0029

13

For brevity, we formalize the concerned problem as follows. Let H(x) denote an ensemble classifier which consists of a set of individual classifiers, hj (x) denote the individual classifier, wj denote the weight of hj (x), Y denote the class label set, yi ∈ Y , then we have: H(x) = arg max yi ∈Y

M ∑

wj δ[hj (x) = yi ],

(11)

j=1

where M is the number of the individual ∑Mclassifiers, j ∈ {1, 2, . . . , M }. For wj , it needs to satisfy wj ⩾ 0, j=1 wj = 1. For [·], if [·] is right, δ[·] equals 1, otherwise it equals 0. Our goal is to find out how to set the weight wj to make H(x) achieve the best result.

6. Conclusion

A.1. The proof procedure

In this paper, we propose a novel possible world based AdaBoost algorithm for classifying uncertain data (PwAdaBoost). By introducing the possible world model, PwAdaBoost can deal with various types of datasets and uncertainty. By leveraging the idea of ensemble learning and redesigning the traditional AdaBoost framework, PwAdaBoost improves the learning ability and makes it possible to use any existing classification algorithms for certain data to serve for uncertain data. Extensive experiment results on different types of datasets show that the proposed algorithm PwAdaBoost outperforms other algorithms in effectiveness and performs competitively in efficiency. For future work, we will try to extend the possible world model and the idea of ensemble learning to uncertain data stream classification and uncertain data clustering.

Assume that the individual classifier set h = {h1 , h2 , . . . , hM }, the accuracy of h is p = {p1 , p2 , . . . , pM }, the predicted result of an object x by using h is r = {r1 , r2 , . . . , rM }. From the viewpoint of Bayes, getting the best H(x) is equivalent to find the best computation method for P(yi |r). As [58] stated, there is a Bayes optimal discriminant function for a class label yi as follow:

Acknowledgments The authors are grateful to the editor in chief, the associate editor and the reviewers for their valuable comments and suggestions. This work is supported by National Science Foundation of China (No. 61876028). Appendix

P(yi |r) =

P(yi )P(r |yi ) P(r)

,

P(r) =

K ∑

P(yi )P(r |yi ),

(12)

i=1

where K is the number of classes. As P(r) is a const, the optimal discriminant function can be written as: ln P(yi |r) = ln P(yi )P(r |yi ) = ln P(yi ) + ln P(r |yi ).

(13)

As each classifier ∏Mis relatively independent, for a given class label yi , P(r |yi ) = j=1 P(rj |yi ), then we have:

ln P(yi |r) = ln P(yi ) + ln

M ∏

P(rj |yi ) = ln P(yi ) +

j=1

= ln P(yi ) +

M ∑

ln P(rj |yi ) +

j=1,rj =yi

This part explains the reason that Eq. (4) is the best choice to set the weights of the sub-basic classifiers.

M ∑

ln P(rj |yi )

j=1 M ∑

ln P(rj |yi ).

j=1,rj ̸ =yi

(14)

Table 10 Effectiveness on real world uncertain datasets. Method

Metric

Dataset Jap

Movement

NBA

ADL

NBU

ACC F-measure

0.6243 ± 0.0000 0.2236 ± 0.0000

0.4048 ± 0.0092 0.1549 ± 0.0019

0.6083 ± 0.0363 0.3185 ± 0.0393

0.4187 ± 0.0057 0.2009 ± 0.0027

UDT

ACC F-measure

0.8919 ± 0.0000 0.6198 ± 0.0000

0.6111 ± 0.0795 0.3010 ± 0.0601

0.3833 ± 0.1001 0.1921 ± 0.0366

0.6059 ± 0.0114 0.3449 ± 0.0186

UNN

ACC F-measure

0.9027 ± 0.0192 0.6521 ± 0.0683

0.6349 ± 0.0399 0.3422 ± 0.0349

0.5750 ± 0.0167 0.2612 ± 0.0111

0.6404 ± 0.0261 0.3748 ± 0.0291

USVC

ACC F-measure

0.8973 ± 0.0103 0.6397 ± 0.0212

0.3651 ± 0.0183 0.1239 ± 0.0141

0.6167 ± 0.0255 0.3104 ± 0.0211

0.6453 ± 0.0099 0.3870 ± 0.0125

UELM

ACC F-measure

0.8432 ± 0.0081 0.5027 ± 0.0263

0.5635 ± 0.0046 0.2731 ± 0.0066

0.5667 ± 0.0337 0.2720 ± 0.0261

0.5862 ± 0.0421 0.3162 ± 0.0255

PWCLA

ACC F-measure

0.9595 ± 0.0109 0.8239 ± 0.0365

0.6825 ± 0.0483 0.3790 ± 0.0338

0.5083 ± 0.0268 0.2810 ± 0.0150

0.7685 ± 0.0344 0.5060 ± 0.0380

EpAdaBoost

ACC F-measure

0.9054 ± 0.0106 0.6585 ± 0.0258

0.7817 ± 0.0086 0.5214 ± 0.0108

0.4717 ± 0.0090 0.2453 ± 0.0060

0.7463 ± 0.0110 0.5378 ± 0.0148

PwAdaBoost

ACC F-measure

0.9703 ± 0.0048 0.8661 ± 0.0193

0.8079 ± 0.0242 0.5466 ± 0.0383

0.6333 ± 0.0153 0.3427 ± 0.0177

0.8192 ± 0.0198 0.6412 ± 0.0313

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

14

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx

As if rj = yi , P(rj |yi ) = pj ; if rj ̸ = yi , P(rj |yi ) = 1 − pj , then Eq. (14) can be written as: M ∑

ln P(yi |r) = ln P(yi ) +

j=1,rj =yi M ∑

= ln P(yi ) + M ∑

M ∑

M ∑

ln

j=1,rj =yi

∑M

where j=1

ln

(15)

ln(1 − pj )

j=1,rj ̸ =yi

= ln P(yi ) +

∑M

ln(1 − pj )+

j=1,rj =yi

ln(1 − pj ) +

j=1,rj =yi

ln(1 − pj )

j=1,rj ̸ =yi M ∑

ln pj −

j=1,rj =yi

M ∑

ln pj +

pj 1−pj

j=1,rj =yi

ln

pj

+

1 − pj

M ∑

ln(1 − pj ),

j=1

pj

can

1−pj

be

represented

as

δ[hj (x) = yi ]. Then we have:

ln P(yi |r) = ln P(yi ) +

M ∑

ln(1 − pj ) +

j=1

M ∑ j=1

ln

pj 1 − pj

δ[hj (x) = yi ]. (16)

to the class label yi , ∑MIn Eq. (16), ln P(yi ) is only relevant pj ln(1 − p ) is a const. So ln reveals the optimal weight j j=1 1−pj for each individual classifier, i.e., if we want to get the best H(x), p

wj should satisfy wj ∝ ln 1−jp , it also can be simplified as wj ∝ j

pj

1−pj

.

In order to satisfy the conditions wj ⩾ 0,

wj ∝

pj 1−pj

pj 1−pj

, we set wj = ∑M

pl l=1 1−pl

pj 1−pj

, then wj =

∑M

j=1 X , X +C

wj ∝ X , i.e., wj ∝

1−pj

pj

pj 1−pj

,

where C is a positive C (X +C )2

⩾ 0, so

∝ ln 1−p .

j pj 1−pj

Therefore, setting wj = ∑M best result.

wj = 1 and

wj = 1. For wj ∝

number. The first derivative of wj versus X is pj

j=1

. From the formula, it is easy

to find that wj satisfies wj ⩾ 0 and assume that X =

∑M

pl l=1 1−pl

can make H(x) achieve the

References [1] C.C. Aggarwal, P.S. Yu, A survey of uncertain data algorithms and applications, IEEE Trans. Knowl. Data Eng. 21 (5) (2009) 609–623. [2] C.C. Aggarwal, Data Classification: Algorithms and Applications, CRC Press, 2014. [3] X. Zhang, H. Liu, X. Zhang, Novel density-based and hierarchical densitybased clustering algorithms for uncertain data, Neural Netw. 93 (2017) 240–255. [4] H. Liu, X. Zhang, X. Zhang, Y. Cui, Self-adapted mixture distance measure for clustering uncertain data, Knowl.-Based Syst. 126 (2017) 33–47. [5] G. Trajcevski, O. Wolfson, K. Hinrichs, S. Chamberlain, Managing uncertainty in moving objects databases, ACM Trans. Database Syst. 29 (3) (2004) 463–507. [6] A. Deshpande, C. Guestrin, S. Madden, J.M. Hellerstein, W. Hong, Modelbased approximate querying in sensor networks, VLDB J. 14 (4) (2005) 417–443. [7] X. Liu, M. Milo, N.D. Lawrence, M. Rattray, A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips, Bioinformatics 21 (18) (2005) 3637–3644. [8] A.D. Sarma, O. Benjelloun, A.Y. Halevy, S.U. Nabar, J. Widom, Representing uncertain data: Models, properties, and algorithms, VLDB J. 18 (5) (2009) 989–1019. [9] R. Jampani, F. Xu, M. Wu, L.L. Perez, C. Jermaine, P.J. Haas, MCDB: A Monte Carlo approach to managing uncertain data, in: Proceedings of SIGMOD, 2008, pp. 687–700.

[10] W. Zhang, X. Lin, J. Pei, Y. Zhang, Managing uncertain data: Probabilistic approaches, in: Proceedings of WAIM, 2008, pp. 405–412. [11] M. Hua, J. Pei, Ranking Queries on Uncertain Data, in: Advances in Database Systems, Springer Press, 2011. [12] M. Dallachiesa, T. Palpanas, I.F. Ilyas, Top-k nearest neighbor search in uncertain data series, Proc. VLDB (2014) 13–24. [13] J. Ren, S.D. Lee, X. Chen, B. Kao, R. Cheng, D. Cheung, Naive Bayes classification of uncertain data, in: Proceedings of ICDM, 2009, pp. 944–949. [14] B. Qin, Y. Xia, S. Wang, X. Du, A novel Bayesian classification for uncertain data, Knowl.-Based Syst. 24 (8) (2011) 1151–1158. [15] B. Qin, Y. Xia, F. Li, DTU: A decision tree for uncertain data, in: Proceedings of PAKDD, 2009, pp. 4–15. [16] S. Tsang, B. Kao, K.Y. Yip, W.-S. Ho, S.D. Lee, Decision trees for uncertain data, IEEE Trans. Knowl. Data Eng. 23 (1) (2011) 64–78. [17] F. Angiulli, F. Fassetti, Nearest neighbor-based classification of uncertain data, ACM Trans. Knowl. Discov. Data 7 (1) (2013) 1–34. [18] B. Tavakkol, M.K. Jeong, S.L. Albin, Object-to-group probabilistic distance measure for uncertain data classification, Neurocomputing 230 (2017) 143–151. [19] J. Bi, T. Zhang, Support vector classification with input data uncertainty, in: Proceedings of NIPS, 2004, pp. 161–168. [20] J. Yang, S.R. Gunn, Exploiting uncertain data in support vector classification, in: Proceedings of KES, 2007, pp. 148–155. [21] B. Qin, Y. Xia, S. Prabhakar, Y. Tu, A rule-based classification algorithm for uncertain data, in: Proceedings of ICDE, 2009, pp. 1633–1640. [22] C. Gao, J. Wang, Direct mining of discriminative patterns for classifying uncertain data, in: Proceedings of SIGKDD, 2010, pp. 861–870. [23] B. Tavakkol, M.K. Jeong, S.L. Albin, Measures of Scatter and Fisher discriminant analysis for uncertain data, IEEE Trans. Syst. Man Cybern. (2019) 1–14. [24] J. Ge, Y. Xia, C.H. Nadungodage, UNN: A neural network for uncertain data classification, in: Proceedings of PAKDD, 2010, pp. 449–460. [25] K. Cao, G. Wang, D. Han, M. Bai, S. Li, An algorithm for classification over uncertain data based on extreme learning machine, Neurocomputing 174 (2016) 194–202. [26] S.B. Kotsiantis, Supervised machine learning: A review of classification techniques, Informatica (Slovenia) 31 (3) (2007) 249–268. [27] H. Liu, X. Zhang, X. Zhang, Possible world based consistency learning model for clustering and classifying uncertain data, Neural Netw. 102 (2018) 48–66. [28] C.C. Aggarwal, Managing and Mining Uncertain Data, Springer Press, 2010. [29] P. Agrawal, O. Benjelloun, A.D. Sarma, C. Hayworth, S.U. Nabar, T. Sugihara, J. Widom, Trio: A system for data, uncertainty, and lineage, Proc. VLDB (2006) 1151–1154. [30] L. Antova, C. Koch, D. Olteanu, MayBMS: Managing incomplete information with probabilistic world-set decompositions, in: Proceedings of ICDE, 2007, pp. 1479–1480. [31] J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, D. Suciu, MYSTIQ: A system for finding more answers by using probabilities, in: Proceedings of SIGMOD, 2005, pp. 891–893. [32] R. Cheng, S. Singh, S. Prabhakar, U-DBMS: A database system for managing constantly-evolving data, Proc. VLDB (2005) 1271–1274. [33] A. Motro, VAGUE: A user interface to relational databases that permits vague queries, ACM Trans. Inf. Syst. 6 (3) (1988) 187–214. [34] L.V. Lakshmanan, N. Leone, R. Ross, V.S. Subrahmanian, ProbView: A flexible probabilistic database system, ACM Trans. Database Syst. 22 (3) (1997) 419–469. [35] L. Li, Z. Yu, Z. Feng, X. Zhang, Automatic classification of uncertain data by soft classifier, in: Proceedings of ICMLC, 2011, pp. 679–684. [36] G. Huang, Q. Zhu, C.K. Siew, Extreme learning machine: Theory and applications, Neurocomputing 70 (2006) 489–501. [37] J.C. Bezdek, R. Ehrlich, W. Full, FCM: The fuzzy c-means clustering algorithm, Comput. Geosci. 10 (1984) 191–203. [38] B. Jiang, J. Pei, Y. Tao, X. Lin, Clustering uncertain data based on probability distribution similarity, IEEE Trans. Knowl. Data Eng. 25 (4) (2013) 751–763. [39] S. Abiteboul, P.C. Kanellakis, G. Grahne, On the representation and querying of sets of possible worlds, in: Proceedings of SIGMOD, 1987, pp. 34–48. [40] A.D. Sarma, O. Benjelloun, A.Y. Halevy, J. Widom, Working Models for Uncertain Data, in: Proceedings of ICDE, 2006. [41] N.N. Dalvi, D. Suciu, Management of probabilistic data: Foundations and challenges, in: Proceedings of PODS, 2007, pp. 1–12. [42] L. Devroye, Non-uniform Random Variate Generation, Springer Press, 1986. [43] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. System Sci. 55 (1) (1997) 119–139. [44] C. Zhang, J. Bi, S. Xu, E. Ramentol, G. Fan, B. Qiao, H. Fujita, MultiImbalance: An open-source software for multi-class imbalance learning, Knowl.-Based Syst. 174 (2019) 137–143.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.

H. Liu, X. Zhang and X. Zhang / Knowledge-Based Systems xxx (xxxx) xxx [45] J. Sun, H. Fujita, P. Chen, H. Li, Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble, Knowl.-Based Syst. 120 (2017) 4–14. [46] J. Sun, H. Li, H. Fujita, B. Fu, W. Ai, Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting, Inf. Fusion 54 (2020) 128–144. [47] T. Zhou, G. Han, X. Xu, Z. Lin, C. Han, Y. Huang, J. Qin, δ -agree Adaboost stacked autoencoder for short-term traffic flow forecasting, Neurocomputing 247 (2017) 31–38. [48] L. Li, C. Wang, W. Li, J. Chen, Hyperspectral image classification by Adaboost weighted composite kernel extreme learning machines, Neurocomputing 275 (2018) 1725–1733. [49] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of ICML, 1996, pp. 148–156. [50] R.E. Schapire, Y. Singer, Improved boosting algorithms using confidencerated predictions, Mach. Learn. 37 (3) (1999) 297–336. [51] R.E. Schapire, The boosting approach to machine learning: An overview, in: Nonlinear Estimation and Classification, Springer Press, 2003, pp. 149–171.

15

[52] F. Gullo, A. Tagarelli, Uncertain centroid based partitional clustering of uncertain data, Proc. VLDB (2012) 610–621. [53] X. Zhang, H. Liu, X. Zhang, X. Liu, Novel density-based clustering algorithms for uncertain data, in: Proceedings of AAAI, 2014, pp. 2191–2197. [54] A. Züfle, T. Emrich, K.A. Schmid, N. Mamoulis, A. Zimek, M. Renz, Representative clustering of uncertain data, in: Proceedings of KDD, 2014, pp. 243–252. [55] E. Bauer, R. Kohavi, An empirical comparison of voting classification algorithms: Bagging, boosting, and variants, Mach. Learn. 36 (1–2) (1999) 105–139. [56] P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, AddisonWesley, 2005. [57] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. [58] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, CRC Press, 2012.

Please cite this article as: H. Liu, X. Zhang and X. Zhang, PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data, Knowledge-Based Systems (2019) 104930, https://doi.org/10.1016/j.knosys.2019.104930.