Journal Pre-proof
Evidence Reasoning Rule-based Classifier with Uncertainty Quantification Xiaobin Xu , Deqing Zhang , Yu Bai , Leilei Chang , Jianning Li PII: DOI: Reference:
S0020-0255(19)31148-X https://doi.org/10.1016/j.ins.2019.12.037 INS 15080
To appear in:
Information Sciences
Received date: Revised date: Accepted date:
26 February 2019 6 December 2019 20 December 2019
Please cite this article as: Xiaobin Xu , Deqing Zhang , Yu Bai , Leilei Chang , Jianning Li , Evidence Reasoning Rule-based Classifier with Uncertainty Quantification, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.12.037
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.
Evidence Reasoning Rule-based Classifier with Uncertainty Quantification Xiaobin Xu1,*, Deqing Zhang1, Yu Bai2,*, Leilei Chang1, Jianning Li1 1 School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China; 2 Tongde Hospital of Zhejiang Province, Hangzhou 310012, Zhejiang, China. Abstract- In Dempster-Shafer evidence theory (DST)-based classifier design, the newly proposed evidence reasoning (ER) rule can be used as a multi-attribute classifier to combine multiple pieces of evidence generated from quantitative samples or qualitative knowledge of many attributes. Different from the classical Dempster’s combination (DC) rule and its improved forms, ER rule definitely distinguishes the reliability and importance weight of evidence. The former reflects the ability of a single attribute or the corresponding evidence to give correct classification results whereas the latter clarifies the relative importance of evidence when it is combined with other pieces of evidence. Here how to determine the reliability factor is a key problem because it is the connection between the preceding evidence acquisition and the following evidence combination with the importance weights. Therefore, the main aim of this paper is to present a universal method for obtaining the reliability factor by quantifying the uncertainties of samples and the generated evidence. Experiential results on five popular benchmark databases taken from University of California Irvine (UCI) machine learning database show the improved classifier can give higher classification accuracy than the original ER-based classifier without considering uncertainty quantification and other classical or mainstream classifiers. Keyword- Date classification, Dempster-Shafer evidence theory (DST), Evidential reasoning (ER) rule, Measure of uncertainty, Rough set.
1. Introduction Data classification is one of the most basic issues in data mining and knowledge discovery [1]. Its essential objective is to design an available classifier for modelling the linear or further complicated non-linear relationship among the samples of attributes and their possible classification labels. Hence, various classifiers can provide information supports for real decision problems such as object identification [2-3], industrial alarm and fault diagnosis [4-6], medical diagnosis [7-8], and image recognition [9]. The classifier design methods have been widely studied, including naive Bayes [10], Bayes net [11], k nearest neighbors (k-NN) [12], support vector machine [13], decision tree [14], random forest [15], and other latest techniques, such as deep learning network [16], belief rule base classifier [17], and learning automata (LA)-based classifier [18]. It is beyond argument that various classification methods have respective advantages, but meanwhile also encounter the corresponding constraints. In practice, one has to choose the best one for dealing with a specific classification problem according to the application background. However, in essence almost all of the methods must face the classification uncertainty, e. g, the boundaries of attributes for different classes are commonly imprecise, or even overlapping such that the values of attributes hardly precisely point to a certain class [19-21]. Comparatively,
Dempster-Shafer evidence theory (DST) can provide a belief distribution-based strategy to solve such uncertainty or imprecision by the following two steps: (1) a set including all class memberships (class labels) is defined as a frame of discernment (FoD). Thus, a belief distribution (BD) function defined on the FoD needs to be constructed to measure the belief degrees that the sample (value) of attribute belongs to each class and the subsets of the classes. Hence such a BD function is also called a basic belief assignment (BBA), or a piece of evidence. Some available quantitative and qualitative evidence acquisition methods have been studied mainly involving core sample [19], neural network [20], k-NN [21], expert system [17] and so on; (2) Dempster’s combination (DC) rule or its improved forms can be used to fuse these BBAs derived from different attributes and then a classification decision can be made via the fused results in [19-21]. Such fusion process of the multi-source attribute information can effectively reduce the classification uncertainty. In most of DC rule-based classifier, the above two steps, namely the evidence acquisition and evidence combination are separately considered. But essentially both of them should be closely linked, ER rule-based classifier presented in [22-23] provides such an interdependent mechanism by introducing the reliability factor and importance weight of evidence. The former reflects the ability of a single attribute or the corresponding evidence to give correct classification results; the latter clarifies the importance of evidence relative to other pieces of evidence. Meanwhile, when training samples are available, the latter can be optimized and its initial value can be set as the former. Here the reliability factor makes a significant contribution to the connection between the evidence acquisition and evidence combination. Recently some researchers have further used ER rule to deal with the classification and decision problems by using quantitative data or qualitative knowledge in various fields. Reference [24] determined the levels of customer complaints by using the ER-based classification strategy. Both references [25] and [26] presented the ER combination algorithm and entropy-based weight assignment method respectively to solve multiple criteria decision making (MCDM) problems. Reference [27] proposed Gini coefficient-based ER approach to make decisions in business negotiation. Reference [28] studied on a data-driven inference methodology for track condition monitoring using ER rule. The above studies focus on the applications of ER rule, but rarely involve the general reliability analysis for the information source. Actually, the information to be combined by ER rule has two forms, one is the original sample data of attributes, and the other is the current evidence generated by these sample data. Whatever classification issues are discussed under the framework of DST, the reliabilities of these two information forms should be considered before the ER combination operation which can be reflected or quantified by the classification uncertainties about both of them. Following the early work in [22-23], this paper presents a novel universal method for obtaining the reliability factor by quantifying the uncertainties of these information forms. Firstly, the referential evidence matrix about each attribute is constructed by statistical analysis for training samples; secondly, the ambiguity measure (AU) is introduced to evaluate the total uncertainties of multiple pieces of
evidence in the matrix. The training samples of all attributes are projected into their respective evidence matrix and then transformed into a decision information system using rough set theory. Secondly, the accuracy and quality of approximation is calculated to measure the uncertainty of sample data of each attribute. Thirdly, the reliability factor of each attribute can be obtained by synthesizing the values of AU, accuracy and quality. Finally, ER rule with reliability and weight is used to combine the multiple pieces of evidence provided by the testing samples of all attributes and then make a classification decision according to the fused results. The rest of the paper is organized as follows: Section 2 briefly introduces the concepts and properties of the ER rule, rough set, and the uncertainty measures in both theories. Section 3 details the proposed universal method for calculating the reliability factor. In Section 4, an experiment on the well-known Seeds database shows the specific procedure of the proposed method and ER-based evidence combination, and then it is compared with the original ER rule-based classifier in [22-23] and other six classical classifiers to demonstrate its superiority by using five popular benchmark databases taken from University of California Irvine (UCI) machine learning database. Some concluding remarks are presented in Section 5.
2. Theoretical background This section introduces some necessary concepts of the evidence theory and rough set theory that will be used in the proposed approach. A more detailed explanation and some background information can be found in [29-33]. 2.1 ER rule and ambiguity measure of evidence Let D={d1,...,dn,...,dN} be a set of collectively exhaustive class memberships (N class labels) which is called as a frame of discernment. P(D) or 2D is denoted as the power set of D including all of its subsets. A piece of evidence is defined and constructed as a belief distribution (BD) or a basic belief assignment (BBA) as follows [29-30]
ek {( , p ,k ) | D, D p ,k 1}
(1)
where (θ,pθ,k) is a dual element of ek, denoting that ek supports the single class θ or the subset of class θ in D with the degree of pθ,k referred to as the probability or degree of belief in general, and there is pθ,k=0 when θ=. Thus (θ,pθ,k) is defined as a focal element of ek if pθ,k>0. Obviously, when all focal elements are singletons (θ=dD), such a belief distribution is reduced to a probability distribution, so the former is the generalized form of the latter. In ER rule, the reliability factor rk and importance weight wk are introduced to evaluate the performances of the evidence ek in two different ways. Commonly, in classification problems, ek is generated from the selected sample of attributes by available evidence acquisition methods, hence the reliability rk represents the ability of the information source to provide correct classification result. The reliability is an inherent property of the original sample data and the adopted evidence acquisition method (or the generated evidence). On the other hand, the weight wk of ek can be used
to reflect its relative importance in comparison with other evidence and determined according to the evidence users. This means that weight wk would be subjective and different from reliability rk in situations where different pieces of evidence are generated from different sources and measured in different ways. As a result, a weighted belief distribution with reliability can be given as [29-30] mk {( , m ,k ) | D;( P( D), mP ( D ),k )}
(2)
where m ,k means the support degree for θ from ek considering the weight and reliability specifically m , k
0 crw,k m ,k c (1 r ) k rw,k
D, P( D)
(3)
where m ,k wk p ,k , crw,k =1/ (1 wk rk ) is a normalization factor, which is uniquely determined to satisfy
D
m , k mP ( D ), k 1
given that
D
p , k 1 .
The residual
support degree (1-rk) defined as the unreliability of evidence is earmarked to the power set for redistribution instead of specifically assigning it to the frame of discernment as done in traditional Shafer’s discounting [33]. That is because pD,k is the inner characteristics of ek, so it should be equally discounted by crw,k as a general element θ. In this way, ek and mk will keep the same probability characteristics. Definition 1: For two BBAs e1 and e2 with mutually independent constraint, ER rule can be defined to combine e1 and e2 for generating the combined degree of belief pθ,e(2) to which e1 and e2 jointly support proposition θ as 0 mˆ p ,e (2) , e (2) D, ˆ h ,e (2) m h D mˆ ,e (2) [(1 r2 )m ,1 (1 r1 )m ,2 ] mB ,1mE ,2
D
(4)
B E
mˆ P ( D ),e (2) (1 r2 )(1 r1 )
For L pieces of independent evidence ei (i=1,2,…,L), the combined degree of the belief pθ,e(L) to which they jointly support proposition θ is given by recursively applying in (4) as follows 0 p ,e ( L ) mˆ ,e ( L ) D, ˆ h D mh ,e ( L ) mˆ ,e ( i ) m ,e ( i ) H D mˆ H ,e(i ) mˆ P ( D ),e(i )
mˆ ,e ( i ) [(1 ri )m ,e (i 1) mP ( D ),e (i 1) m ,i ] mˆ P ( D ),e (i ) (1 ri )mP ( D ),e (i 1)
B E
(5) mB ,e (i 1) mE ,i
D
It is proved that Dempster’s rule is a special case of ER rule when any evidence to be combined is fully reliable, namely its reliability factor is 1 [29]. On the other hand, to some extent, the reliability of evidence can be quantified by the measures of the aggregated uncertainty in DST. Here, the ambiguity measure defined in [34] is introduced to do this work in the following Section 3 Definition 2: For a piece of evidence ek on D={d1,...,dn,...,dN}, its ambiguity measure (AM) is defined as AM(ek ) dD BetPek (d ) log 2 (BetPek (d ))
(6)
where BetPe is the pignistic probability distribution of ek k
BetPek (d ) d D
p ,k
(7)
| |
where |θ| denotes the number of singleton element d in θ. It is proved that AM satisfies the five requirements for general measures defined by Klir [35] including probability consistency, set consistency, range, subadditivity and additivity, meanwhile it also has relatively low computation complexity compared with the traditional AU measures. Here, an example is given to illustrate how to calculate AM of evidence. Example 1: Let D={d1,d2,d3}, two attributes c1 and c2 respectively provide two samples, both of them point to the class d2. If e1={({d1},0.3),({d2},0.4), ({d1,d2,d3}, 0.3)} and e2={({d1}, 0.2),({d2},0.7),({d3},0.1)} are two BBAs respectively generated from samples by a certain evidence acquisition method, then Eq.(6) can be used to obtain AM(e1)=1.361, AM(e2)=1.156. AM(e2)
D R, f (c
D, x) f (c
D, y)}
(8)
The family of all equivalence classes of IND(R) will be denoted by U/IND(R), or
simply U/R, [x]R is an equivalence class of IND(R) with x. Definition 3: Let X be a subset of U, the R-lower and R-upper approximations of the set X can be defined as R( X ) {x U |[ x]R X } R( X ) {x U |[ x]R
(9)
X }
The lower approximation of X with respect to IND(R) is the set whose elements belong to X with respect to IND(R). The upper approximation of X is the set whose elements possibly belong to X with respect to IND(R). Furthermore, in rough set, the accuracy measure and quality measure are defined to evaluate the uncertainties of a rough classification [36]. Definition 4: Let U/D={X1,...,Xn,....,XN} be a classification of the universe U, and RC be an conditional attribute set. The approximation accuracy (AA) and approximation quality (AQ) of U/D by R are defined as in (10), respectively N
R (U / D)
i 1 N
N
R( X i )
R( X )
, R (U / D)
R( X ) i
i 1
(10)
U
i
i 1
The accuracy of classification αR expresses the percentage of possible correct decisions when one uses the attribute set R to classify objects. The quality of classification γR clarifies the percentage of objects which can be assigned to the correct class labels by R. Here, an example is given to illustrate how to measure classification ability of the different attributes by αR and γR. Example 2: Table 1 shows a decision system S={U,C∪D,V}, where U ={x1, x2, x3, x4, x5, x6, x7, x8}, C={c1,c2}, D{d1,d2,d3}. When R=c1 and R=c2 respectively, by using (10), one has c (U / D) =(2+0+0)/(8+6+6)=0.1, c 1
2
0.33, c (U / D) =(2+0+0)/8=0.25, c 1
2
(U / D)
=(2+0+2)/(6+4+2)=
(U / D) =(2+0+2)/8=0.5. c1 (U / D) < c2 (U / D)
and
c (U / D) < c (U / D) which mean the classification ability of c2 is better than c1. 1
2
Table 1 A decision information system U U x1 x2 x3 x4 x5 x6 x7 x8
c1 2 1 2 1 1 1 1 1
c2 3 2 3 2 2 1 2 1
D d1 d1 d1 d2 d2 d3 d2 d3
3. ER rule-based classifier considering the uncertainties of sample and evidence
The ER rule-based classifier is proposed to deal with an N-class classification problem with the class label set D={d1,...,dn,…,dN}. The training sample set U={x1,…,xi,…,xI} can be identified by the attribute set C={c1,...,cj,...,cJ}, hence the sample xiU can be described as a vector xi=[c1(i),…cj(i),…,cJ(i),di], diD, cj(i) is the value of the jth attribute cj. Thus, the inputs of ER classifier are c1(i),…cj(i),…,cJ(i), and the output is the estimated class membership. The general design procedure has been given in [22-23]. Based on the ambiguity measure (AM) in Definition 2 and the approximation accuracy αR and the approximation quality γR in Definition 4, this section mainly studies on a universal method for obtaining the reliability factor by quantifying the uncertainties of the attribute cj and its evidence form. The detailed modelling and optimization procedure is described as follows. 3.1 Construction of initial referential evidence matrix via training sample This step is to acquire the available evidence from the training samples of each attribute. Hence the sample vector xi=[c1(i),…cj(i),…,cJ(i),di] can be decomposed into J sample pair (cj(i), di) about the attribute cj. Thus, for total I samples, the sample set U c j ={(cj(i),
di)|i=1,2,…,I} can be obtained to generate the initial referential evidence
matrix (REM) by the likelihood function normalization method in [30]. Firstly a casting matrix is constructed to describe the relationship between cj(i) and di as shown in Table 2. Here, the names of row are the referential values of the attribute cj denoted as a set Aj={Ajk |k=1,…,Kj}, and the names of columns are the class label set D={d1,...,dn,…,dN}.Thus, for any sample pair (cj(i), di), it can be cast into the matrix according to the similarity S(cj(i)) between cj(i) and Ajk [37] S (c j (i)) {( Akj ,i j,k ) | i 1,..., I , k 1,..., K j } i j,k
j i ,k
Akj 1 c j (i ) k 1 j
A
A
k j
,
j i , k +1
1 i , k for c j (i) [ Akj , Akj 1 ]
k 1 j
0 for c j (i ) [ A , A k j
(11a)
(11b)
]
Based on (11), any sample pair (cj(i), di) in
U c j can
be uniquely represented as a
similarity distribution (ij,k , ij, k +1) about the class di. The similarity distributions of all sample pairs are cast into the corresponding cells and accumulated to obtain Table 2, here, n = k 1 anj,k , k = n 1 anj,k , Kj
N
N
δ k j1 ηk I .
n 1 n
K
anj, k
in
Table 2 The casting matrix about sample pairs (cj(i), di) of the attribute c j A1j
…
Akj
…
Aj
Total
d1
a1,1j
…
a1,j k
…
a1,j K j
1
⋮
⋮
⋮
⋮
⋮
⋮
⋮
d
cj
Kj
n
j n, K j
dn
anj,1
…
anj, k
…
⋮
⋮
⋮
⋮
⋮
⋮
⋮
dN
aNj ,1
…
aNj , k
…
aNj , K j
N
Total
1
…
k
…
K
I
The likelihood denoted as
bnj, k
a
j
to which cj points to Ajk given the known class dn
is j n,k
b
By normalizing the likelihoods
anj,k
p( A | d n ) k j
(12)
n
b1,j k ,..., bnj, k ,…, bNj , k ,
a piece of evidence ejk about
the referential value Ajk can be generated as given in Table 3 with the belief degree βnj, k
where
βnj, k
bnj, k
(13)
N
bj n 1 n , k
is the probability that a sample is believed to belong to the class dn given
that the attribute cj =Ajk. Table 3 The referential evidence matrix (REM) of the input attribute cj K ej e1j e kj ej … …
j
d
Kj
A1j
…
Akj
…
Aj
d1
β1,1j
…
β1,j k
…
β1,j K j
⋮
⋮
⋮
⋮
⋮
dn
β
⋮
⋮
dN Total
β
j n ,1
j N ,1
1
…
β
⋮ … …
j n, k
⋮
β
… ⋮
j N ,k
…
1
…
β
⋮ j n, K j
⋮ β
j N ,K j
1
3.2 AM-based parameter optimization of the REM By using the above transformation procedure from Table 2 to Table 3, the REM summaries all classification information of the training sample set
Ucj
with the form
of Kj pieces of referential evidence. Here, the constructed REM is determined by the initial values of Aj, however such REM is commonly too rough to precisely reflect the classification ability of each referential evidence. Hence, it is necessary to optimize
these parameters Aj with the training sample set
U c j so
as to improve the classification
performance of the referential evidence. As shown in Example 1, if the classification ability of evidence is good, then its belief degree will focus on a certain class meanwhile its ambiguity measure (AM) will approach to zero. Therefore, according to (6), the overall uncertainty degrees of the REM can be calculated as the objective function in (14), Kj
( Aj )=AM j ( Aj ) BetPe (d ) log 2 (BetPe (d )) k 1
k j
dD
k j
(14)
The corresponding optimization model is given as in (15) and (16) min Aj ( Aj )
(15a)
Akj 1 Akj Akj 1 , k 2,..., K j 1
s.t.
(15b)
where Aj'={Ajk|k=2,3,…,Kj-1} denotes the set of parameters that needs to be optimized 1 (c j (i)) , AKj = max (c j (i)) . Eq.(15b) represents the besides the endpoints Aj = c min ( i )U c ( i )U j
j
cj
j
cj
unequal constraint for the optimized parameters. Here, the genetic algorithm (GA) in [38] is introduced to solve the optimization model. Since the expert knowledge for determining the initial value of parameters should be considered, the GA-optimized initial population mainly consists of two parts: the individuals provided by experts and the individuals randomly generated. In order to synthesize these two individuals, we add upper and lower bound constraints respectively for the parameters in Aj' as UB( Akj ) Akj *( max (c j (i)) min (c j (i))), LB( Akj ) Akj *( max (c j (i)) min (c j (i))) (16) c j ( i )U c j
c j ( i )U c j
c j (i )U c j
c j (i )U c j
where the perturbation factor is generally controlled within 10%. The optimization process can be done through the global optimization toolbox in Matlab. Then, due to the changes of the optimized parameter set Aj', the belief distributions in Table 3 will also change accordingly and gradually achieve the optimum. In the following classification experiments on popular benchmark databases, the proposed parameter optimization process will be illustrated in detail. 3.3 A universal method for acquiring the reliability factor via AM, AA and AQ The reliability factor seems to be only the performance indicator of evidence about a certain attribute, but actually it should be an integrated indicator for reflecting the classification abilities of all involved information forms including the original sample data and the corresponding evidence. As for the indicator of evidence, the ambiguity measure (AM) is introduced to evaluate the performance of the REM. The optimal value of AM denoted as AM j,o can be obtained by solving the optimization model in (15). Thus, the indicator of the optimal REM can be defined as
Kj rREM j
The larger
rREM j
AM j ,o
(17)
Kt max t ,t{1,2,..., J } AM t ,o
is, the smaller the overall uncertainty degree of the REM generated
by the attribute cj is, and the greater the reliability of the referential evidence is. As for the indicator of the original sample data, here under the framework of rough set, we introduce the approximation accuracy (AA) αR and approximation quality (AQ) γR in Definition 4 to obtain it, and then these two indicators can be combined to produce the integrated reliability factor. In detail, the approximation accuracy αR (R=cj) indicates the percentage of possible correct classification decisions using the attribute R=cj. The higher the percentage is, the less likely the samples are misclassified. Hence αR measures roughness of the boundaries of cj for different classes. The approximation quality γR (R=cj) means the proportion of objects or samples which can be correctly sorted out in all objects. The higher the proportion is, the better the classification effect of cj is, and the more feature information cj has for classification. Hence it is necessary to synthetically use αR and γR to evaluate the classification ability of the training samples of the attribute cj. In rough set theory, the discussed conditional attribute cj must be taken as the discrete value, hence cj(i) in the sample set
Ucj
should be discretized. Such
discretization can be realized by the optimal parameters Aj,o={Aj,ok |k=1,…,Kj} in the optimal REM. Firstly, the mark number k is used as the discretized value corresponding to Aj,ok, for each cj(i), the distances between cj(i) and each parameter point can be calculated so as to find out the closest parameter point Aj,ok. Thus, k is assigned to cj(i) as its discretized value denoted as ti,j{1,...,k,…,Kj}. As a result, the training sample set U={x1,…,xi,…,xI} consisting of
U c j ={(cj(i),
di)|i=1,2,…,I} can be
transformed to a decision information system or a decision table as shown in Table 4. Table 4 The decision table of the training sample set U U
c1
…
cj
…
cJ
D
x1 ⋮ xi ⋮ xI
t1,1 ⋮ ti,1 ⋮ tI,1
… ⋮ … ⋮ …
t1,j ⋮ ti,j ⋮ tI,j
… ⋮ … ⋮ …
t1,J ⋮ ti,J ⋮ tI,J
d1 ⋮ di ⋮ dI
Let U/D={X1,...,Xn,....,XN} be the set of decision classes, R=cj, here Xn is the subset of U with the class label dn. Eq. (9) in Definition 3 can be used to get the cj-upper approximation and cj-lower approximation of the set Xn denoted as c j ( X n ) and c j ( X n ) respectively. Eq.(10) in Definition 4 is used to calculate the approximation
accuracy c
j
(U / D)
and the approximation quality c
indicator of the training samples
Ucj
j
(U / D) .
Finally, the performance
can defined as
c (U / D) c (U / D)
rc j
j
j
max ( ct (U / D) ct (U / D))
(18)
t ,t{1,2,..., J }
Based on (18), it is obvious that the larger rc j is, the smaller the uncertainty degrees of the training samples of the attribute cj are. Finally, considering that the classification ability of the sample is as important as the performance of the REM, the reliability factor rj of the evidence ej can be defined by synthesizing
rREM j
and
rc j
rj =
rREM j rc j
(19)
2
3.4 Evidence fusion based on ER rule and classification decision For a certain sample vector with J attributes c(i)=[c1(i),…,cj(i),…,cJ(i)], cj(i) must belong to a certain inferential interval
[ Aj ,o k , Aj ,o k 1 ] in
the optimal REM obtained
by optimization model in Section 3.2, so it will activate the corresponding two pieces of referential evidence
e kj
and
e kj 1
as shown in Table 3. As a result, cj(i) can be
transformed into the activated evidence ej as e j {(dn , pn, j )|n 1,..., N}
(20a)
pn, j i j,k nj,k i j,k 1nj,k 1
(20b)
where, the belief degree pn,j to the class dn is the weighted sum of the belief degrees of e kj
and
e kj 1
to dn. By using such evidence acquisition process in (20), a total of J
pieces of evidence e1,…,ej,...,eJ about c1(i),…,cj(i),…,cJ(i) can be respectively generated, and then they can be combined with the weights and reliabilities by the ER rule in (5) to obtain the fused evidence as O(c(i)) {(dn , pn,e( J ) )|n 1,..., N}
(21)
On the basis of the resulting evidence O(c(i)), the decision rule is that the sample vector c(i) is assigned to the class label with the maximum belief degree. Moreover, in the ER fusion process, the importance weight wj of ej is firstly initialized as its reliability factor rj, since generally speaking, the evidence with high reliability should have relatively high importance when it is compared with other evidence. Certainly, wj can be trained by the training sample data. Here, we construct
the optimization objective function via the Euclidean distance between the estimated belief vector Vc (i ) [ p1,e( J ) ,..., pn,e( J ) ,..., pN ,e( J ) ] and the reference belief vector Vi (W )
2 1 I d ( O ( c ( i ), V i ) E I i 1
(22)
where I is the number of the training samples, and Vi has the categorical belief degree assigned to the class dn that c(i) actually belongs to. For instance, for a typical four-class case, if c(i) belongs to the class d2, then the corresponding Vi=(0,1,0,0). Thus, the corresponding optimization model is given as
s.t.
minW (W )
(23a)
0 w j 1 , j 1,..., J
(23b)
where W={wj |j=1,….,J} denotes the set of parameters that needs to be optimized. The optimization process can also be implemented through the genetic algorithm. Then, with the change of the parameter set W, the performance of the ER rule-based classifier can be gradually optimized. 3.5 The specific procedure for obtaining the reliability factor and fusion decision The specific procedure of the classification implementation includes two steps. Step 1 is the training sample-based parameter optimization and the reliability factor acquirement, and Step 2 is test sample-fusion and classification decision. In Step 1, given the training sample set, firstly the initial REM of the jth attribute can be constructed by (11)-(13), and then the AM-based optimization model in (15) is used to adjust the referential values of the evidence matrix in Table 3 so that the optimized REM has the smaller uncertainty degree and greater reliability. Secondly, the performance indicator
rREM j
of the optimal REM is calculated by its ambiguity
measure in (17). On the other hand, the training samples are all discretized to form the decision table in Table 4 by the reference values of the optimal REM, and then the performance indicator
rc j
is calculated by the approximation accuracy (AA) and
approximation quality (AQ) of rough set theory in (18). As a result, the indicator rREM j
of evidence and the indicator
rc j of
the training sample set are integrated in (19)
to generate the final reliability factor about the jth attribute or information source. In Step 2, the test sample x with J attribute values are cast into the optimal REMs of J attributes to obtain J pieces of the activated evidence as shown in (20), and then they can be fused by using ER rule in (5). Finally, the classification decision can be made according to the fused result. In the flowing section, some typical experiments are arranged according to the proposed procedure such that readers can comprehensively understand the proposed ER classifier and evaluate its performance.
4. Experiments This section gives some experimental results to illustrate the classification effects of the improved ER classifier. The first experiment on the classical seeds data classification is conducted in detail to demonstrate the specific procedures of ER-based evidence acquisition with uncertainty quantification and evidence combination with the reliability and weight. Secondly, a series of experiments on five popular benchmark databases derived from UCI [39] are arranged to make contrasts between the proposed classifier with other well-known classifiers. Finally, the refined statistic analysis of classification experiment results are presented to further explain the advantages of the improved ER classifier over the original one. 4.1 Seeds data classification experiment The seeds data belong to three different classes of wheat seeds including seed Kama (d1), seed Rosa (d2) and seed Canadian (d3). These three classes can be identified by seven attributes: area (c1), perimeter (c2), compactness (c3), length of kernel (c4), width of kernel (c5), asymmetry coefficient (c6), length of kernel groove (c7). Each class of wheat seeds has 70 samples. Here 42 (60%) of 70 samples in each class are randomly taken out to build the training sample set U={x1,…,xi,…,xI}, here the sample xi=[c1(i),…cj(i),…,cJ(i),di], diD={d1, d2, d3}, J=7, I=126. The remaining 328=84 (40%) samples are used as the testing sample set. Based on the proposed procedure in Section 3, by the change trend analysis for the training sample set U, we firstly roughly choose the initial referential values of the attributes c1~ c7 as A1={10,12.75,17.35,22} for c1, A2={12,13.85,16.45,18} for c2, A3={0.8,0.85,0.9,0.92} for c3, and A4={1,5.85,6.15,6.8} for c4, A5={2.6,2.95,3.65,4.2} for c5, A6={0,1.45,5.95,9} for c6, and A7={4.5,5.45,5.95,6.8} for c7. Secondly, according to the information transformation in (11) and normalization of likelihoods in (13), the training samples in U can be cast into Table 2 and then transformed into the initial REM. Thirdly, by the AM-based optimization model in (15), the training sample set U is used to obtain the optimized REM as shown in Table 5~8 for the attributes c1~ c7, here the corresponding optimized referential values also are listed below ejk. By globally minimizing the ambiguity measure of evidence in the REM, the belief degrees of 57% of the evidence in Table 5~8 all focus on a certain class label, for example e14 , e24 , e44 , e54 , e16 , e74 all categorically support class d1 or d2. Hence, the proposed optimization achieves the desired effect of reducing the uncertainty of evidence and finally gives =0.9269,
rREM1
=1,
rREM 2 =0.9808, rREM 3 =0.5555, rREM 4 =0.9268, rREM 5
rREM 6 =0.6008, rREM 7 =0.9453
by using (17).
Furthermore, based on the universal method in Section 3.3, the training samples set U can be projected into the optimized REM and then transformed into the corresponding decision table. Thus, the performance indicator of the training samples
can be calculated as rc6 =0.0021, rc7 =1
rc1
=0.2675,
rc2
=0.0604,
rc3
=0.0190,
rc4
=0.6744,
rc5
=0.2712,
by using (18). As a result, the reliability factor rj can be synthesized
as r1=0.6338, r2=0.5206, r3=0.2872, r4=0.8006, r5=0.599, r6=0.3014, r7=0.9726 by using (19). Table 5 The optimized REM for the attributes c1~ c2 c1 D d1 d2 d3
1 1
e 10 0.0248 0 0.9752
2 1
e 12.7414 0.4851 0.0352 0.4797
e13 17.3363 0.3318 0.6631 0.0051
e14 22 0 1 0
e12 12 0.0411 0 0.9589
c2 D d1 d2 d3
e22 13.7482 0.4844 0.0914 0.4242
e23 16.4546 0.2117 0.7865 0.0018
e24 18 0 1 0
e43
e44 6.8
Table 6 The optimized REM for the attributes c3~ c4 1 3
c3
e
d1 d2 d3
0.8 0.0203 0 0.9797
D
c5 D d1 d2 d3
e51 2.6 0.0126 0 0.9874
2 3
e 0.8451
e33
e34 0.92
e14
c4
e42 5.8285
D 0.9060 1 0.3068 0.3955 0.5061 d1 0.3206 0.4613 0.2623 0.4482 0.4939 d2 0.0245 0.1087 0.4309 0.1563 0 d3 0.6549 0.4300 Table 7 The optimized REM for the attributes c5~ c6 e53 3,6654 0.3385 0.6348 0.0267
e52 2.9530 0.4596 0.0761 0.4643
e54 4.2 0 1 0
e16 0 1 0 0
c6 D d1 d2 d3
e62 1.4238 0.4657 0.3486 0,1857
6.1512 0.0057 0.9943 0 e63 6.0222 0.1964 0.3308 0.4728
0 1 0 e64 9 0.0891 0.0888 0.8221
Table 8 The optimized REM for the attribute c7 c7
e17
d1 d2 d3
4.5 0.4911 0.0112 0.4977
D
e72 5.4880 0.4620 0.0770 0.4610
e73
5.9203 0.0061 0.9939 0
e74 6.8 0 1 0
For a sample with the attribute values in the training set U, its initial estimated class can be determined based on the combined result of the activated evidence by ER rule. Furthermore, according to the GA-based optimization method in Section 3.4, the optimal parameter set W can be obtained by using the training samples. Then, the importance weights after optimization are w1=0.4825, w2=0.6185, w3=0.0702, w4=0.6360, w5=0.8714, w6=0.5716, w7=0.9354. Following the above procedure, the class label of a testing sample can be estimated using the optimized ER classifier with the optimized parameters A and W. For instance, there is a testing sample [c1=17.36, c2=15.76, c3=0.8785, c4=6.145, c5=3.574, c6=3.526, c7=5.971] with the true class d2. The activated referential evidence (ARE) and the corresponding similarity degree (SD) in (11b) are listed in Table 9. By using (20) to calculate the weighted sum of the belief degrees of the AREs, 7 pieces of evidence e1~e7 activated by c1~c7 can be obtained are as shown in Table 10, respectively.
Table 9 ARE and its SD for each attribute cj
ARE1
SD for ARE1
ARE2
SD for ARE2
c1=17.36
e13
0.9949
e14
0.0051
c2=15.76
2 2
0.2566
3 2
0.7434
3 3
0.5485
4 4
0.9807
3 5
0.8717
3 6
0.4572
4 7
0.0577
c3=0.8785 c4=6.145 c5=3.574 c6=3.526 c7=5.971
e
2 3
e
3 4
e
2 5
e
2 6
e
3 7
e
0.4515 0.0193 0.1283 0.5428 0.9423
e
e e
e e
e
Table 10 7 pieces of evidence e1~e7 respectively activated by c1~c7 cj
ej
d1
d2
d3
c1=17.36
e1
0.3301
0.6648
0.0051
c2=15.76
e2
0.2817
0.6081
0.1102
c3=0.8785
e3
0.3555
0.3643
0.2802
c4=6.145
e4
0.0145
0.9772
0.0083
c5=3.574
e5
0.3540
0.5631
0.0829
c6=3.526
e6
0.3426
0.3404
0.3170
c7=5.971
e7
0.0058
0.9942
0
As a result, by using ER rule in (5) to combine e1~e7 with the reliability factor rj and the importance weight wj, the fused result can be obtained as O(c(i))={(d1, 0.0033), (d2,0.9961), (d3, 0.0006)}, where d2 has the maximum degree of belief. So we predict that this testing sample belongs to seed Rosa (d2), which coincides with its actual class. It can be seen that the ER combination mechanism can effectively make the belief degrees of e1~e7 refocus on the desired class label. Table 11 shows the confusion matrix about all testing samples given by the optimized ER-based classifier. Table 11 The confusion matrix of all testing samples Predicted class Total d1 d2 d3 d1 28 24 1 3 Actual d2 28 0 28 0 class d3 28 0 0 28
In order to get more reliable result to demonstrate the performance of ER classifier, we repeat the above experiment 100 times by randomly setting 60% training samples and 40% testing samples, and then calculate the following three performance values: the average classification accuracy (ACA) of the initial ER-based classifier for the training set is 92.35%, the ACA of the optimized ER (O-ER) classifier for the training set is 95.03%, and the ACA of the O-ER classifier for the testing set is 91.79%. 4.2 Comparisons with some mainstream classifiers on five benchmark datasets In order to further verify the validity of the proposed ER classifier considering uncertainty quantification, the seven well-known classifiers are compared with it, including naive Bayes [40], Bayes net [41], decision tree learner (REP Tree) [42],
random forest [43], one nearest neighbor (1-NN) [44], DC rule-based classifier (DC-core sample) [19], and the original ER classifier [22-23]. Besides Seeds dataset tested in Section 4.1, the other four datasets from the UC Irvine Machine Learning Repository [39], as shown in Table 12, are also used for the comparative study. The Heart dataset (Statlog collection) is concerned with predicting the presence or absence of heart disease depending on some general information about some patients and the corresponding test results. The Wine dataset is the result of the chemical analysis of three types of wines grown in the same region in Italy but derived from three different cultivars. The Haberman dataset (Haberman’s Survival Dataset) contains cases from 1958 to 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgeries of breast cancer. The Iris data set is perhaps the best known database in pattern recognition literature about classification of three classes of the Iris flowers. The Seeds dataset, as mentioned above, comprises of the measurements of seven geometric parameters of kernels belonging to three different varieties of wheat. Table 12 shows the general information about these datasets. Table 12 The general information about the five datasets Dataset Sample Class Attribute Heart 270 2 13 Wine 178 3 13 Haberman 306 2 3 Iris 150 3 4 Seed 210 3 7
Similarly, in the above datasets, 60% of the sample data are randomly selected to build the training dataset, while the remaining 40% is served as testing data. Table 13 presents the ACA indices of the O-ER classifier and the other classifiers for the testing set by 100 times experiments. It can be seen that there is no universal and perfect classifier for all datasets. But as a whole, the ACA of the O-ER classifier is a bit higher than the others. Especially, the original ER classifier considers the reliability factor by simply analyzing the classification quality of attribute samples in [22-23], but the improved O-ER classifier more synthetically and particularly concerns two layers of reliability (attribute samples and evidence) by using the ambiguity measure and the approximation accuracy and approximation quality in the rough set theory. So the latter are better than the former in the classification performance.
Heart Wine Haberman Iris Seeds Average
Table 13 The ACAs of the different classifiers for the testing set Naive Bayes REP Random DC-Core Original 1-NN Bayes net Tree forest samples ER 0.8056 0.7593 0.7778 0.8056 0.7500 0.7778 0.8420 0.9718 0.9859 0.8592 0.9718 0.9437 0.9069 0.9783 0.7623 0.7787 0.7377 0.6639 0.6537 0.8000 0.7076 0.9333 0.9167 0.9167 0.9333 0.9000 0.9667 0.9619 0.8810 0.9048 0.8810 0.8926 0.8690 0.9048 0.8956 0.8708 0.8691 0.8345 0.8543 0.8233 0.8712 0.8771
O-ER 0.8481 0.9731 0.7346 0.9660 0.9179 0.8879
4.3 The refined statistic analysis of classification experiment results In the above experiments, the percentage of test samples (PT) is only set as 40% of the total samples, here we further choose more cases (PT=20%,30%,40%,50%) in
respective 100 times random experiments of Seeds data classification to verify the performance of the proposed reliability factor acquirement method in O-ER classifier. In Table 14, random forest classifier and the original ER classifier are selected to be compared with O-ER classifier under the different PT cases. Here two main statistical indices (SI) of the three classifiers are calculated including ACA and the standard deviation of classification accuracy (SDCA). SDCA describes the deviation between the mean value of classification accuracy and single classification accuracy obtained in one random experiment. The small value of SDCA means the corresponding classifier can provide relative stable performance in every test, namely there is rarely overly-well or overly-poor classification result. From Table 14, it can be seen that O-ER provides a relatively small SDCA compared with the other two methods, meanwhile its ACA is always best in all PT cases. Certainly, with the increase of PT from 20% to 50%, the percentage of training samples (PT) decreases for 80% to 50%, respectively. Moreover, the number of training samples for modeling these three classifiers also decreases in the same time, which causes their ACAs are all reduce gradually. In order to further illustrate the advantage of improved ER classifier over the original ER classifier, the statistic analysis of the reliability factors is given for different PT cases. Table 15 shows the mean value M(r) of reliability factor of each attribute in Seeds classification problem by 100 times random experiments and gives the corresponding standard deviation (SD-M) of seven mean values. Obviously, the standard deviations of O-ER are always larger than that of the original ER classifier in all PT cases, which means O-ER can distinguish the difference of seven attributes contributions in the evidence fusion procedure in (5) more observably and elaborately than the original ER classifier because the former synthetically considers the reliabilities of attribute samples and evidence in the proposed reliability factor acquirement method. Table 14 The statistic analyses of ACA&SDCA for different percentages of test samples (PT) PT SI Random forest Original ER O-ER ACA 0.9021 0.9041 0.9233 20% SDCA 0.0444 0.0374 0.0339 ACA 0.8987 0.9032 0.9202 30% SDCA 0.0389 0.0306 0.0301 ACA 0.8926 0.8956 0.9179 40% SDCA 0.0317 0.0243 0.0226 ACA 0.8896 0.8943 0.9167 50% SDCA 0.0299 0.0241 0.0239 Table 15 The statistic analyses of reliability factors for different percentages of test samples PT Classifiers M(rc1) M(rc2) M(rc3) M(rc4) M(rc5) M(rc6) M(rc7) SD-M Original ER 0.9174 0.8576 0.6909 0.7823 0.9973 0.6603 0.8603 0.1118 20% O-ER 0.6210 0.4830 0.3747 0.5695 0.8728 0.3505 0.6067 0.1638 Original ER 0.9247 0.8635 0.6883 0.7928 0.9933 0.6562 0.8682 0.1133 30% O-ER 0.6310 0.4873 0.3688 0.5972 0.7866 0.3550 0.6870 0.1501 Original ER 0.9226 0.8663 0.6788 0.7866 0.9861 0.6492 0.8643 0.1149 40% O-ER 0.6348 0.4892 0.3676 0.6147 0.7669 0.3511 0.7182 0.1521 Original ER 0.9260 0.8678 0.6530 0.7651 0.9701 0.6230 0.8383 0.1223 50% O-ER 0.6909 0.5310 0.3739 0.6079 0.7364 0.3561 0.7005 0.1447
5. Conclusion In this paper, an improved ER rule-based classifier is proposed to solve the classification problem in which the reliability of evidence is mainly studied from the point of uncertainty quantification. The advantages of the proposed method are summarized as follows: (1) In the proposed method, the information transform from the training samples to the evidence is purely data-driven without knowing the specific statistic distributions of sample data. (2) The reliability of evidence builds the bridge between the preceding evidence acquisition from samples and the following evidence combination with the importance weights such that the performance of DST-based classification methods is improved. (3) Based on the ambiguity measure in DST and the approximation accuracy and approximation quality in rough set theory, a universal method is presented for obtaining the reliability factor by quantifying the uncertainties of attribute sample and the corresponding evidence. The proposed method is fit for any evidence-based classification cases using the quantitative data and qualitative knowledge as long as they can be transformed into the multiple pieces of evidence in different application backgrounds. Certainly, some deep problems need to be researched in the future. For instance, so far, there are rarely some deep researches on the optimization space of ER model, which determines how to choose available optimization algorithms besides the genetic algorithm used in this paper. Furthermore, in experiments in Section 4, the numbers of the attributes are determined by the qualitative expert knowledge. In future research, they can also be considered as the adjustable structure parameters of REMs to be optimized by the available optimization algorithms so that the constructed ER model has better structure and less computational burden.
Acknowledgements This work was supported by the NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization (No. U1709215), the Zhejiang Province Public Welfare Technology Application Research Project (No. LGF20H270004), the NSFC (No. 61433001, 61733009, 61903108, 71601180), the Zhejiang Province Key R&D projects (No.2019C03104, 2018C04020), the Natural Science Foundation of Zhejiang Province (LY15H270013). The corresponding author: Xiaobin Xu, Yu Bai. Email: [email protected] (Xiaobin Xu), [email protected] (Yu Bai).
Reference [1] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley-interscience, New York, 2000. [2] Z. Liu, Q. Pan, J. Dezert, J.W. Han, Y. He, Classifier fusion with contextual reliability evaluation, IEEE Transactions on Cybernetics 48 (5) (2018) 1605-1618. [3] Z.G. Liu, Q. Pan, J. Dezert, G. Mercier, Hybrid classification system for uncertain data, IEEE Transactions on Systems, Man, and Cybernetics: Systems 47 (10) (2017) 2783-2790. [4] X. Xu, H. Xu, C. Wen, J. Li, P. Hou, J. Zhang, A belief rule-based evidence updating method for industrial alarm system design, Control Engineering Practice 81 (2018) 73-84.
[5] T. Wang, J. Qi, H. Xu, Y. Wang, D. Gao, Fault diagnosis method based on FFT-RPCA-SVM for cascaded-multilevel inverter, Isa Transactions 60 (2015) 156-163. [6] T. Wang, H. Wu, M. Ni, M. Zhang, J. Dong, M.E.H. Benbouzid, An adaptive confidence limit for periodic non-steady conditions fault detection, Mechanical Systems and Signal Processing 72 (2016) 328-345. [7] M. Eshtay, H. Faris, N. Obeid, Improving extreme learning machine by competitive swarm optimization and its application for medical diagnosis problems, Expert Systems with Applications 104 (2018) 134-152. [8] T. Bocklitz, M. Putsche, C. Stüber, J. Käs, A. Niendorf, P. Rösch, J. Popp, A comprehensive study of classification methods for medical diagnosis, Journal of Raman Spectroscopy 40 (12) (2010) 1759-1765. [9] H.Y. Jin, K.S. Jin, K. Dayeon, L. Keondo, C.W. Kyun, Super-high-purity seed sorter using low-latency image-recognition based on deep learning, IEEE Robotics & Automation Letters 3 (4) (2018) 3035-3042. [10] M.L. Zhang, V. Robles, Feature selection for multi-label naive Bayes classification, Information Sciences 179 (19) (2009) 3218-3229. [11] A. Antonucci, M. Zaffalon, Fast algorithm for robust classification with Bayesian nets, International Journal of Approximate Reasoning 44 (3) (2007) 200-223. [12] E. Tu, Y. Zhang, L. Zhu, J. Yang, N. Kasabov, A graph-based semi-supervised k nearest-neighbor method for nonlinear manifold distributed data classification, Information Sciences 367-368 (2016) 673-688. [13] M. Wang, Y. Wan, Z. Ye, X. Lai, Remote sensing image classification based on the optimal support vector machine and modified binary coded ant colony optimization algorithm, Information Sciences 402 (2017) 50-68. [14] B. Krawczyk, M. Woźniak, G. Schaefer, Cost-sensitive decision tree ensembles for effective imbalanced classification, Applied Soft Computing 14 (1) (2014) 554-562. [15] A. Paul, D.P. Mukherjee, P. Das, A. Gangopadhyay, A.R. Chintha, S. Kundu, Improved random forest for classification, IEEE Transactions on Image Processing 27 (8) (2018) 4012-4024. [16] X. Gao, L. Wei, M. Loomes, L. Wang, A fused deep learning architecture for viewpoint classification of echocardiography, Information Fusion 36 (2016) 103-113. [17] L.L. Chang, Z.J. Zhou, Y. You, L. Yang, Z. Zhou, Belief rule based expert system for classification problems with new rule activation and weight calculation procedures, Information Sciences 336 (2016) 75-91. [18] M. Ahangaran, N. Taghizadeh, H. Beigy, Associative cellular learning automata and its applications, Applied Soft Computing 53 (2017) 1-18. [19] C. Zhang, Y. Hu, F.T.S. Chan, R. Sadiq, Y. Deng, A new method to determine basic probability assignment using core samples, Knowledge-Based Systems 69 (2014) 140-149. [20] T. Denoeux, A neural network classifier based on Dempster-Shafer theory, IEEE Transactions on Systems Man and Cybernetics-Part A Systems and Humans 30 (2) (2000) 131-150. [21] T. Denoeux, A k-nearest neighbour classification rule based on Dempster-Shafer theory, IEEE Transactions on Systems, Man and Cybernetics 25 (5) (1995) 804-813. [22] X.B. Xu, J. Zheng, J.B. Yang, D.L. Xu, Y.W. Chen, Data classification using evidence reasoning rule, Knowledge Based Systems 116 (2017) 144-151.
[23] J. Zheng, Classification-decision based fault diagnosis using evidence reasoning, Master’s thesis, Hangzhou Dianzi University, 2017. [24] Y. Yang, D.L. Xu, J.B. Yang, Y.W. Chen, An evidential reasoning-based decision support system for handling customer complaints in mobile telecommunications, Knowledge-Based Systems 162 (2018) 202-210. [25] M. Zhou, X.B. Liu, J.B. Yang, Y.W. Chen, J. Wu, Evidential reasoning approach with multiple kinds of attributes and entropy-based weight assignment, Knowledge-Based Systems 163 (2019) 358-375. [26] M. Zhou, X.B. Liu, Y.W. Chen, J.B. Yang, Evidential reasoning rule for MADM with both weights and reliabilities in group decision making, Knowledge-Based Systems 143 (2018) 142-161. [27] X.X. Zhang, Y.M. Wang, S.Q. Chen, J.F. Chu, L. Chen, Gini coefficient-based evidential reasoning approach with unknown evidence weights, Computers & Industrial Engineering 124 (2018) 157-166. [28] X.B. Xu, J. Zheng, J.B. Yang, D.L. Xu, X.Y. Sun, Track irregularity fault identification Based on Evidence Reasoning rule, in: 2016 IEEE International Conference on Intelligent Rail Transportation (ICIRT), Birmingham, UK, 2016, pp. 298-306. [29] J.B. Yang, D.L. Xu, Evidential reasoning rule for evidence combination, Artificial Intelligence 205 (2013) 1-29. [30] J.B. Yang, D.L. Xu, A study on generalising Bayesian inference to evidential reasoning, in: International Conference on belief functions, Springer International Publishing, New York, 2014, pp. 180-189. [31] Z. Pawlak, A. Skowron, Rudiments of rough sets, Information Sciences 177 (1) (2006) 3-27. [32] Z. Pawlak, Rough sets and intelligent data analysis, Information Sciences 147 (1) (2002) 1-12. [33] G. Shafer, A mathematical theory of evidence, Princeton University Press, New Jersey, 1976. [34] A.L. Jousselme, C. Liu, D. Grenier, E. Bosse, Measuring ambiguity in the evidence theory, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 36 (5) (2006) 890-903. [35] D. Harmanec, G.J. Klir, Measuring total uncertainty in Dempster-Shafer theory, International Journal of General Systems 22 (4) (1994) 405-419. [36] J. Dai, Q. Xu, Approximations and uncertainty measures in incomplete information systems, Information Sciences 198 (2012) 62-80. [37] M. Zhou, X.B. Liu, Y.W. Chen, J.B Yang, Evidential Reasoning Rule for MADM with both Weights and Reliabilities in Group Decision Making, Knowledge-based systems 143 (2018) 142-161. [38] S.N. Sivanandam, S.N. Deepa, Introduction to Genetic Algorithm, Springer, Berlin Heidelberg, 2008. [39] K. Bache, M. Lichman, UCI machine learning repository, University of California, School of Information and Computer Science, Irvine, CA, 2013. http://archive.ics.uci.edu/ml. [40] I. Rish, An empirical study of the naive Bayes classifier, in: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Morgan Kaufmann, 2001, pp. 41-46. [41] N. Friedman, D. Geiger, M. Goldszmidt, Bayesian network classifiers, Mach Learn 29 (2-3) (1997) 131-163. [42] Y. Freund, L. Mason, The alternating decision tree learning algorithm, in: Proceedings of the ICML, Morgan Kaufmann, 1999, pp. 124-133.
[43] L. Breiman, Random forests, Mach Learn 45 (1) (2001) 5-32. [44] T. Cover, P. Hart, Nearest neighbour pattern classification, IEEE Trans. Inform. Theory 13 (1) (1967) 21-27.