Extended Naive Bayes classifier for mixed data

Extended Naive Bayes classifier for mixed data

Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 35 (2008) 1080–1083 www.elsevier.com/loca...

111KB Sizes 3 Downloads 89 Views

Available online at www.sciencedirect.com

Expert Systems with Applications Expert Systems with Applications 35 (2008) 1080–1083 www.elsevier.com/locate/eswa

Extended Naive Bayes classifier for mixed data Chung-Chian Hsu a, Yan-Ping Huang

a,b,*

, Keng-Wei Chang

a

a

b

Department of Information Management, National Yunlin University of Science and Technology, 123, Section 3, University Road, Douliu Yunlin 640, Taiwan, ROC Department of Information Management, Chin Min Institute of Technology, 110, Hsueh-Fu Road, Tou-Fen, Miao-Li 351, Taiwan, ROC

Abstract Naive Bayes induction algorithm is very popular in classification field. Traditional method for dealing with numeric data is to discrete numeric attributes data into symbols. The difference of distinct discredited criteria has significant effect on performance. Moreover, several researches had recently employed the normal distribution to handle numeric data, but using only one value to estimate the population easily leads to the incorrect estimation. Therefore, the research for classification of mixed data using Naive Bayes classifiers is not very successful. In this paper, we propose a classification method, Extended Naive Bayes (ENB), which is capable for handling mixed data. The experimental results have demonstrated the efficiency of our algorithm in comparison with other classification algorithms ex. CART, DT and MLP’s. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Naive Bayes classifier; Classification; Mixed data

1. Introduction Naive Bayes classifiers are very robust to discriminate irrelevant attributes, and to classify evidence from many attributes to make the final prediction. Naive Bayes classifiers are generally easy to understand and the induction of these classifiers is extremely fast, requiring only a signal pass through the data. However, the algorithm is limited to categorical or discrete data. In other words, the classification of mixed data, which includes categorical and numeric data, is inapplicable. Traditional method for dealing with numeric data is to discrete numeric attributes data into symbols. However, the difference of distinct discredited criteria has significant effect on performance. Moreover, several researches has recently employed the normal distribution to handle

numeric data, but using only one value to estimate the population easily leads to the incorrect estimation. Hence, the research for classification of mixed data using Naive Bayes classifiers is not very successful. In this paper, we propose a classification method, ENB which is capable of handling mixed data. For categorical data, we utilize the original approach in the Naive Bayes algorithm to calculate the probabilities of categorical values. For continuous data, we adopt the statistical theory, in which we not only take the average into account but also consider the variance of numeric values. For an unknown input pattern, the product of the probabilities and the P-values are calculated and then the class which results in the maximum product is designated as the target class to which the input pattern belongs. 2. Naive Bayesian classifier

* Corresponding author. Address: Department of Information Management, National Yunlin University of Science and Technology, 123, Sec. 3, University Road, Douliu, Yunlin 640, Taiwan, ROC. Tel.: +886 37627153; fax: +886 97605684. E-mail addresses: [email protected] (C.-C. Hsu), sunny@ chinmin.edu.tw (Y.-P. Huang), [email protected] (K.-W. Chang).

0957-4174/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.08.031

Bayesian networks have been successfully applied to a great number of classification problems. There has been a surge of interests in learning Bayesian networks from data. The goal is to induce a network that best captures the dependencies among the variables for the given data. A

C.-C. Hsu et al. / Expert Systems with Applications 35 (2008) 1080–1083

Naive Bayesian classifier assumes conditional independence among all attributes given the class variable. It learns from training data the conditional probability of each attribute given its class label (Duda & Hart, 1973; Langley, Iba, & Thompson, 1992). A simple Bayesian network classifier, which in practice often performs surprisingly well, is the Naive Bayesian classifier. This classifier basically learns the class-conditional probabilities P(Xi = xijC = cl) of each attribute Xi given the class label cl. A new test case X1 = x1, X2 = x2, . . ., Xn = xn is then classified by using Bayesian rule to compute the posterior probability of each class cl given the vector of observed attribute values according to the following equation: P ðC ¼ cl j X 1 ¼ x1 ; X 2 ¼ x2 ; . . . ; X n ¼ xn Þ ¼

P ðC ¼ cl ÞP ðX 1 ¼ x1 ; X 2 ¼ x2 ; . . . ; X n ¼ xn j C ¼ cl Þ P ðX 1 ¼ x1 ; X 2 ¼ x2 ; . . . ; X n Þ ð1Þ

The simplifying assumption behind the Naive Bayesian classifier is that the attributes are conditionally independent, given the class label according to the following equation: P ðX 1 ¼ x1 ; X 2 ¼ x2 ; . . . ; X n ¼ xn j C ¼ cl Þ n Y ¼ P ðX i ¼ xi j C ¼ cl Þ

ð2Þ

i¼1

This assumption simplifies the estimation of the classconditional probabilities from the training data. Notice that one does not estimate the denominator in Eq. (1) since it is independent of the class. Instead, one normalizes the nominator term to 1 over all classes. Bayesian network models are widely used for discriminative prediction tasks such as classification. In recent years, it has been recognized, both theoretically and experimentally, that in many situations it is better to use a matching ‘discriminative’ or ‘supervised’ learning algorithm such as conditional likelihood maximization (Friedman, Geiger, & Goldszmidt, 1997; Greiner, Grove, & Schuurmans, 1997; Kontkanen, Myllymaki, & Tirri, 2001; Ng & Jordan, 2001). Naive Bayesian classifiers have been proven successful in many domains, despite the simplicity of the model and the restrictiveness of the independent assumptions it made. Naive Bayesian algorithm handles only categorical data, but could not reasonably express the probability between two numeric values and preserve the structure of numeric values. Extended Naive Bayesian algorithm is used in data mining as a simple and effective classification algorithm. The ENB algorithm properly handles the mixed data. 3. Extended Naive Bayesian classification algorithm The ENB algorithm has been widely used in data mining as a simple and effective classification algorithm. It handles

1081

the mixed data. For a categorical attribute, the conditional probability that an instance belongs to a certain class c given that the instance has an attribute value A = a, P(C = c j A = a) is given by the following equation: P ðC ¼ cjA ¼ aÞ ¼

P ðC ¼ c \ A ¼ aÞ nac ¼ P ðA ¼ aÞ na

ð3Þ

where nac is the number of instances in the training set which has the class value c and an attribute value of a, while na is the number of instances which simply has an attribute value of a. Due to horizontal partitioning of data, each party has partial information about every attribute. Each party can locally compute the local count of instances. The global count is given by the sum of the local counts. For a numeric attribute, the necessary parameters are the mean l and variance r2 for all different classes. Again, the necessary information is split between the parties. In order to compute the mean, each party needs to sum the attribute values of the appropriate instances having the same class value. These local sums are added together and divided by the total number of instances having the same class to get the mean for that class value. The ENB clustering algorithm usually has the following steps Step 1. A training dataset requires xi parties, Ck class values and w attribute values. If there occur numeric data attributes, calculate the mean value and the variance value. However, the time is counted if categorical data attributes occur. Step 2. It calculates the probability of an instance having the class and the attribute value. If there occur numeric data attributes, the probability of attribute value xi in class Ck determines the probability according to the following equation: pðxi jC k Þ ¼ pðC k Þ 0

atti Y i¼1

pðwi;t jC k Þ 

attj Y j¼iþ1

2 1

B C jX j  X 0j j  ðlj  l0j Þ rffiffiffiffiffiffiffiffiffiffiffiffiffi  pB  mC @z P A: ^2j ^02 r r j þ 0 nj n j

ð4Þ If there occur categorical data attributes, the probability of attribute value xi domain value wi,t in class Ck determines the probability such that P(Cijx) > P(Cjjx), x is in class Ci; else x is in class Cj. The Bayesian approach to classify the new instance is to assign the most probable target value, P(Cijx), given the attribute values {w1, w2, . . ., wn} that describe the instance according the following equation: P ðC i jxÞ ¼

P ðC i \ xÞ P ðxjC i ÞP ðC i Þ ¼ P ðxÞ P ðxÞ

ð5Þ

1082

C.-C. Hsu et al. / Expert Systems with Applications 35 (2008) 1080–1083

Table 1 The ARR score in five datasets, six kinds of classification algorithm Algorithm

Credit approval

German credit

Hepatitis

Horse

Breast cancer

ARR

ENB

6.9245 Duda and Hart (1973) (6.2083, 8.1053)

3.0322 Duda and Hart (1973) (2.6022, 3.8551)

11.9333 Duda and Hart (1973) (7.0000, 27.0000)

5.2185 Friedman et al. (1997) (3.9286, 6.6667)

25.0150 Langley et al. (1992) 25.0150 (22.1000, 32.0000)

0.7667

MLP’s

5.8885 Greiner et al. (1997) (5.0702, 6.8636)

2.6528 Greiner et al. (1997) (2.2843, 2.8953)

5.5871 Greiner et al. (1997) (4.0909, 10.2000)

3.9193 Greiner et al. (1997) (2.6316, 5.2727)

29.2250 Duda and Hart (1973) (20.0000, 45.2000)

0.4000

CART

5.4120 Ng and Jordan (2001) (4.1642, 7.2381)

2.5758 Ng and Jordan (2001) (2.2212, 2.9412)

6.9730 Friedman et al. (1997) (3.3077, 10.2000)

26.4563 Duda and Hart (1973) (3.3125, 68.0000)

14.0205 Ng and Jordan (2001) (7.8846, 20.0000)

0.3867

DT

6.2979 Friedman et al. (1997) (4.7667, 7.8718)

2.5758 Ng and Jordan (2001) (2.2212, 2.9412)

8.1958 Langley et al. (1992) (4.0909, 17.6667)

21.1898 Langley et al. (1992) (5.2727, 33.5000)

14.8546 Greiner et al. (1997) (8.2400, 22.1000)

0.3567

The ENB classifier makes the simplification assumption that the attribute values are conditionally independent given the target value according the following equation: P ðxjC i Þ ¼

n Y

P ðxk jC i Þ:

ð6Þ

k¼1

The categorical values have been normalized before calculating the probability of each class. It determines the normalized according to the following equation: P 1 þ xi 2Ck N ðwi;t ; xi ÞpðC k jxi Þ P ðwi;t jC k Þ ¼ P PjV j jV j þ xi 2Ck t¼1 N ðwi;t ; xi ÞpðC k jxi Þ ð7Þ where jVj is the total domain value in the attribute value xi. Step 3. All parties calculate the probability of each class, according to the following equation: pi ¼

m Y

p½j

ð8Þ

j¼1

Step 4. It selects the maximal of the probability of each class according to the following equation: classCount

max ðp½iÞ i¼1

ð9Þ

The real mixed datasets Australian Credit Approval, German Credit Data, Hepatitis, Horse Colic and Breast Cancer from the UCI repository (Merz & Murphy, 1996). Australian Credit Approval has 690 records of 14 attributes, including eight categorical attributes and six numerical attributes. German Credit Data has 1000 records of 20 attributes, including thirteen categorical attributes and six numerical attributes. Hepatitis has 155 records of 19 attributes, including thirteen categorical attributes and six numerical attributes. Horse Colic has 368 records of 27 attributes, including twenty categorical attributes and seven numerical attributes. Breast Cancer has 699 records of 9 numerical attributes This study uses the average reciprocal rank (ARR) (Quinlan, 1986; Voorhees & Tice, 2000) as our evaluation metric. ARR is defined as shown in Eq. (10). Assume that the system only retrieves n relevant items and they are ranked as r1, r2, . . ., rn. n 1 X 1 ð10Þ ARR ¼ N i¼1 ri The ARR score for classification using the accuracy rate are shown in Table 1. The ARR score for classification using an average amount l, minimize value and maximize value. The accuracy rate is shown in Table 1. The higher the ARR values, the better the classification result. ENB has the highest ARR score. The results clearly show that the ENB algorithm has good classification quality. 5. Conclusions and future work

4. Experiments and results The algorithm is developed by using Java 2.0, Access database and SPSS Clementine 7.2. In the experiments, it presents the results of the ENB algorithm with mixed data and compares with other classification algorithms, ex. Decision Tree (DT) (Quinlan, 1993), Classification And Regression Tree (CART) (Breiman, Friedman, Olshen, & Stone, 1984), and Multiplayer Perceptions (MLP’s) (Simon, 1999).

The aim of this paper which demonstrates the technology for building classification algorithm from examples is fairly robust. This paper also proposes an efficient ENB algorithm for classification. Meanwhile, it achieves ARR value reappearing nearly 76% and it also endeavors to improve the accuracy in the classification. The further study can apply this algorithm in other time series databases, like financial databases.

C.-C. Hsu et al. / Expert Systems with Applications 35 (2008) 1080–1083

References Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. The Wadsworth and Brooks. Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. John Wiley & Sons. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2), 131–163. Greiner, R., Grove, A., & Schuurmans, D. (1997). Learning Bayesian nets that perform well. In Proceedings of the thirteenth annual conference on uncertainty in artificial intelligence (pp. 198–207). San Francisco. Kontkanen, P., Myllymaki, P., & Tirri, H. (2001). Classifier learning with supervised marginal likelihood. In Proceedings of the seventeenth conference on uncertainty in artificial intelligence (pp. 277–284). San Francisco.

1083

Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the international conference on artificial intelligence. Merz, C.J., & Murphy, P. (1996). UCI repository of ML databases, http:// www.ics.uci.edu/~mlearn/MLRepository.html. Ng, A., & Jordan, M. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. Advances in neural information processing systems, 14, 605–610. Quinlan, J. R. (1986). Induction of decision tree. Machine Learning., 1, 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo. CA: Morgan Kaufman. Simon, H. (1999). Neural Networks: A Comprehensive Foundation. Voorhees E.M., & Tice, D.M. (2000). The TREC-8 Question Answering Track Report. The Eighth Text Retrieval Conference (TREC-8).