Decision analysis of data mining project based on Bayesian risk

Expert Systems with Applications 36 (2009) 4589–4594 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

Download PDF

202KB Sizes 4 Downloads 223 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 36 (2009) 4589–4594

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Decision analysis of data mining project based on Bayesian risk Guangli Nie a,b, Lingling Zhang a,b,*, Ying Liu b, Xiuyu Zheng a,b, Yong Shi b a b

School of Management, Graduate University of Chinese Academy of Sciences, Beijing 100080, China Research Center on Fictitious Economy and Data Science, CAS, Beijing 100080, China

a r t i c l e Keywords: Decision analysis Data mining Business intelligence

i n f o

a b s t r a c t Data mining, an efﬁcient method of business intelligence, is a process to extract knowledge from large scale data. As the augment of the size of enterprise and the data, data mining as a way to make use of the data become more and more necessary. But now most of the literatures only focus on the algorithm itself. Few literatures research what qualiﬁcation to fulﬁll before the decision doing data mining from the perspective of the company manager. This paper discusses the factors affect the data mining project. Based on the Bayesian risk, we build a model taking the risk attitude of the top executive in account to help them make decision whether to do data mining or not. Ó 2008 Published by Elsevier Ltd.

1. Introduction In recent years, data mining has been very well researched and a number of algorithms have been proposed (Hope & Korb, 2004; Wu, 1999; Ying, 2005) In order to prove the effectiveness of an algorithm, researchers test their algorithms in terms of accuracy, time cost and space cost, support, conﬁdence, and lift are also measures of the interestingness of the rules or patterns in the databases (Wang, Strong, & Guarascio, 1994). For example, predictive accuracy is usually used to measure the effectiveness of classiﬁcation learners, accepting a machine learner as superior to another if its predictive accuracy passes a statistical signiﬁcance test (Hope & Korb, 2004). However, researchers seldom consider the algorithm from the perspective of the company. Although some papers discuss the business issue, the aim is to prove the effectiveness of the algorithm (Strobel & Hrycej, 2006). Based on the limited resource and data quality, should my company use data mining techniques on our data? What could I get if I launch a data mining project? The current researchers cannot answer these questions. Data mining is an application-driven technique(Chen & Liu, 2005; Sim, 2003). It has been widely used in many applications, from tracking criminals to brokering information for supermarkets, from developing community knowledge for a business to crossselling, detecting the customer churn. Some applications include * Corresponding author. Address: School of Management, Graduate University of Chinese Academy of Sciences, Beijing 100080, China. Tel.: +86 10 82680396; fax: +86 10 82680698. E-mail addresses: [email protected] (G. Nie), [email protected] (L. Zhang), [email protected] (Y. Liu), [email protected] (X. Zheng), [email protected] (Y. Shi). 0957-4174/$ - see front matter Ó 2008 Published by Elsevier Ltd. doi:10.1016/j.eswa.2008.05.014

marketing, ﬁnancial investment, fraud detection, manufacturing and production, and network management. Data mining is also useful for sky survey cataloging, mapping the datasets of Venus, biosequence databases, and Geosciences systems (Sim, 2003). Although data mining are getting more widely used, the present research on data mining did not pay adequate attention to our ‘‘God”, people who will use the algorithms. Research papers seldom help managers make decision whether to use data mining or not taking the risk, the ﬁnance condition, the effect brought to the company after the application of data mining, the data quality to account. In this paper, we try to build a mechanism to evaluate if a company is quantiﬁed to launch a data mining project. A score is used to measure whether the company should do data mining or not based on Bayesian risk. The remaining of this paper is organized as follows. Section is the review on the application of data mining, the factors affecting data mining and the Bayesian analysis. Section 3 introduces how data quality affects data mining. Section 4 describes how to evaluate human factors and ﬁnance factors which are important to the success of a data mining project. Section 5 discusses importance of the support of the top executives. Bayesian Risk is presented in Section 6, followed by a case study in Section 7. Section 8 conclusions this paper and points out our further work. 2. Literature review The literature review includes three parts. The ﬁrst part is the review on data mining algorithms and the application of data mining. The second part is the factors that would affect the success of data mining project. The third part is the Bayesian risk analysis.

4590

G. Nie et al. / Expert Systems with Applications 36 (2009) 4589–4594

2.1. Data mining application Data mining models can be obtained by employing a lot of algorithms. The two most common supervised modeling methods are classiﬁcation and regression (Moreno Garcı’a et al., 2008). Many unsupervised algorithms are proposed such as association rule mining, clustering (Jain, Murty, & Flynn, 1999), sequence mining and so on. Association rule mining was ﬁrst introduced by Agrawal et al. in the context of transaction databases (Agrawal, Imielinski, & Swami, 1993a, 1993b). Data clustering has been studied in the Statistics (Dubes & Jaiu, 1980; Duda & Hart, 1973; Lee, 1981; Murtagh, 1983), Machine Learning (Peter, James, & Matthew, 1988) and Database (Raymond & Han, 1994) communities with different, methods and different emphases. A lot of papers focus on the improvement of the existing algorithms such as naive Bayes, association rule mining (Chen, Liu, Yu, Wei, & Zhang, 2006; Chen et al., 2007). The frequent association mining problem has drawn much attention over the past decade. Many algorithms have been proposed to improve the mining of frequent itemsets (Han, Pei, & Yin, 2000). The class imbalance problem is an important issue in classiﬁcation of Data mining. An approach was proposed to mining the multi-relational imbalanced database (Lee et al., 2007). It is the application that drives the development of data mining algorithms. As the development of the information technology, it is quite easy for organization to collect data at a very low cost especially the organization offering service, the banking for example. The traditional tools used to analyze the data cannot meet the need of the enterprise to process so much data. Many people both from academic and industrial generally realize the importance of the data mining and made some effort on it. Shu-Hsien Liao reviewed the literatures from 1995 to 2002 and summarized the application of data mining as follows: cross-sales, deviation detection, ﬁnance, organizational learning, user-guided query construction, interface, consumer behaviors/service, semantic indexing, data quality, health care management, knowledge reﬁnement, prediction of failure, marketing, software integration, knowledge warehouse, grid services, hypermedia (Liao, 2003). The use of data mining and visualization techniques for decision support in planning and regional level management of Slovenian public health-care has been demonstrated applicable and practicable (Lavrac et al., 2007). Financial markets generate large volumes of data. Nearly 10 kinds of application of data mining for ﬁnance were listed such as portfolio management, probability distribution estimation for ﬁnancial data, stock charting by Hui Wang et al. (Hui Wang & Andreas S. Weigend). Data mining is frequently adopted to discover the valuable information from the huge database (Chen & Lin, 2007). In data mining, association rule mining is widely applied to market basket analysis or transaction data analysis (Agrawal, Imielinski & Swami, 1993; Srikant and Agrawal, 1997). An approach based on association rule mining was proposed to do product assortment and shelf space allocation. Due to the increasingly massive amount of biology, chemistry and clinical data, data mining was used to improve drug delivery after mining public and/or proprietary data (Ekins et al., 2006). Data mining was also used in project management. A process to reﬁne association rules, based on the generation of unexpected patterns, is proposed. The goal is to generate strong association rules between attributes that can be obtained early in the project and the ﬁnal software size (Moreno Garci’a et al., 2004). Data mining can play an important role in Marketing, especially customers relationship management (CRM), the CRM of churn email for example. Data mining also work in Internet Service Providers. Data mining (decision tree and support vector machines algorithms) can also be used for making cancer predictions with Analysis of gene expression data (Shah & Kusiak, 2007). Data

Mining classiﬁcation techniques was used in detecting ﬁrms that issue fraudulent ﬁnancial statements (FFS) and deals with the identiﬁcation of factors associated to FFS (Kirkos et al., 2007). Data mining technology is introduced to fault diagnosis of rotating machinery, and a new method based on C4.5 decision tree and principal component analysis (PCA) was proposed (Sun, Chen, & Li, 2007). Data mining was used in the mental health (Diederich et al., 2007). A deployed data mining application system for Motorola whose intended use was for identifying causes of cellular phone failures was developed. In academic area, data mining work is a good tool. Data mining was used to mine the chemical information in GRID environment (Maran, Sild, Kahn, & Takkis, 2007). What all these researches focus on is how to design a new algorithm or to improve the existing algorithms to mine more quickly and accurately. The application researches are based on assumption that the real situation is suitable to do data mining. All these research works rely on a hypothesis that the data they mine is suitable for data mining. They failed to take account of the real status of the organization where the algorithms would be implemented. 2.2. The factor affecting the data mining project Not like the massive papers concerning algorithms, little attention has been paid on the analysis about the data mining project. There are few literatures about the success of data mining project. Although the data quality was once discussed, the main aim of the discussion was how to improve the data quality. The problem falls into two categories. The ﬁrst category is related to inconsistency among systems such as format, syntax and semantic inconsistencies. The second category is related to inconsistency with reality as it is exempliﬁed by missing, obsolete and incorrect data values and outliers. The usefulness of the results produced by data mining methods can be critically impaired by several factors such as (1) low quality of data, including errors due to contamination, or incompleteness due to limited bandwidth for data acquisition, and (2) inadequacy of the data model for capturing complex probabilistic relationships in data. The cost of the data mining including the payment on the project and the cost because of the misjudgment of the data mining model attracts some attention. Some research shows that factors such as the probability of intrusion and the costs of responding to detected intrusions must be taken into account in order to compare the effectiveness of machine learning algorithms over the intrusion detection domain. There are some works on the fundamental trade-off in distributed data mining; namely, the trade-off between the efﬁciency and cost-effectiveness of a distributed data mining application on one side, and the accuracy and reliability of the resulting predictive system on the other side. To the best of our knowledge, there is seldom research on the factor that will affect the success of the data mining project such as the human factor, ﬁnance factors, the support of the chief executives, risk attitude of the executive. All of the factors listed above would be discussed separately in Sections 3 and 4 in this paper. 2.3. The Bayesian analysis The research on Bayesian analysis is quite mature now. There are a lot of research achievements on this topic. Assuming that the uncertainties involved in the decision problem can be considered to be unknown numerical quantities, and will represent them by h (a vector or matrix) which is commonly called the state of nature, h will be used to denote the set of all possible states of nature. Typically some experiments can be conducted to obtain statistical information about them. In decision theory, an attempt is made to

4591

G. Nie et al. / Expert Systems with Applications 36 (2009) 4589–4594

combine sample information with other relevant aspects of the problem in order to make the best decision. In addition to the sample information, two other types of information are typically relevant. The ﬁrst is knowledge of the possible consequences of the decisions. Often this knowledge can be quantiﬁed by determining the loss that would be incurred for each possible decision and for the various possible values of h. (Decision theorists in economics and business talk them in terms of gains (utility)). We can see that the gain is just a negative loss, so there is no real difference between the loss and utility. The incorporation of a loss function into statistical analysis was ﬁrst studied by Abraham Wald (1950), which also reviews earlier work in decision theory. The approach to statistics which formally seeks to utilize prior information is called Bayesian analysis (name after Bayes (1763)). Bayesian analysis and decision theory go rather naturally together, partly because of their common goal of utilizing nonexperimental sources of information, and partly because of some deep theoretical ties. Decisions are more commonly called actions in the research. Particular actions will be denoted by a, while the set of all possible actions under consideration will be denoted as A. A key element of decision theory is the loss function. If a particular action a1 is taken and h1 turns out to be the ture state of nature, then a loss L(h1, a1) will be incurred. Thus we will assume a loss function L(h, a) is deﬁned for all (h, a) 2 H A. In practical, once the utility function U(h, a) has been obtained, the loss function can simply be deﬁned as L(h,a) = U(h,a). The desire to maximize utility then becomes the desire to minimize loss. The formation of the utility function and the loss function is quite explicit now. There are several standard loss functions such as squared-error loss, exponential loss. The later one will be used in this paper. Now that we cannot observe the state of nature H directly, we can observe another variable (a random variable) denoted X, which is related to H. The information of variable of X is more easy to collect. Often X will be vector, as when X = (X1, X2, . . ., Xn), then Xi being independent observation from a common distribution. A particular realization will be denoted x. The probability distribution of X will, of course, depend upon the unknown state of nature h (Berger, 1985). The use of above theory will be demonstrated in Section 5 and a case will be given in Section 6. 3. Data quality Many factors can affect the success of a project, including data quality, human factor, manager factor, budget, and algorithm. Some of them are subjective and others are objective. In this section, we focus on the data quality factor which is the foundation of the success of a data mining project. Just as the proverb saying ‘‘If you have no hand you can’t make a ﬁst”. Data is the base. No good data, no good data mining (Han & Kamber, 2001). The data preparation step will take 60% of all the time spent on a data mining project. First of all, since data mining is a technique focusing on processing large scale datasets, data size is the ﬁrst attribute to be considered. In our paper, we deﬁne data size by the number of years the data accumulated. There are many other attributes to measure the data quality listed all the factors to measure data quality (Wang et al., 1994). We list them in Fig. 1. Chen (2000) summarizes others’ research and proposes a few dimensions to evaluate the quality of data (Chen, 2002). Ballou and Fazer propose four dimensions to evaluate data quality: accu-

Data Quality

Accessibility Data Quality

Representation Data Quality

1. Accessibility 2. Access security

1. Interpretability 2. Ease of understanding 3. Representational Consistency 4. Concise

Contextual Data Quality

1. Value-Added 2. Relevancy 3. Timeliness 4. Completeness 5. Appropriate amount of data

Intrinsic Data Quality

1. Believability 2. Accuracy 3. Objectivity 4. Reputation

Fig. 1. A conceptual framework of data quality (Wang et al., 1994).

racy, timeliness, completeness, and consistency. Although the dimensions are proposed for Information System, they are applicable to evaluate the quality of data for data mining. In the decision analysis of data mining, we add data size attribute. The dimensions are as follows. 3.1. Accuracy The recorded value is consistent with the actual value. The accuracy dimension is the most straightforward and is the difference between the correct value and the one actually used. 3.2. Timeliness The recorded value is not out-of-date. Any data item will become out-of-date as time passes. 3.3. Completeness All the values of a certain variable are recorded. Completeness can be handled in a satisfactory manner. For example, a default value or an estimated value can be assigned to ﬁll the missing value. However, such assigned values could affect the level of accuracy quality. 3.4. Consistency The representation of the data value is consistent in all cases. In this paper, we propose ﬁve dimensions which are summarized in Table 1. The total score of the data quality is deﬁned in Eq. (1):

Squality ¼ ðsaccuracy þ stimeliness þ scompleteness þ sconsistency þ syears Þ=5 ð1Þ 4. Human and ﬁnance factors Data mining projects can not be independent from the strategy and the condition of the companies. Human factor plays an important role. Data mining need experts to select the appropriate attributes for the model and judge the interestingness and usefulness of the rules discovered. In addition, ﬁnancial support is also important to the success. Experts’ evaluation of the human and ﬁnance factors of a company are in demand. Sim (2003) reviewed the factors of human being: Sponsor: inﬂuential, focused on business value, enthusiastic. User group: owners of the business and evaluators of the project’s success.

4592

G. Nie et al. / Expert Systems with Applications 36 (2009) 4589–4594

Table 1 The score table of data quality Evaluation dimension

Accuracy (%) Timeliness Completeness (number of missing values) Consistency Data size

Score 1

2

3

4

5

0–20 Worst Lots of missing values Very Low Not at all

20–40 Bad Many missing values Somewhat low Somewhat not suitable

40–60 Somewhat Several missing values Somewhat Somewhat

60–80 Good Few missing Values Somewhat high Suitable

80–100 Best No missing values Very high Very suitable

Business analyst: experienced in the domain and application. Data analyst: experienced in exploratory data analysis (EDA) and data mining. Data management specialist: experienced in database administration, has access to the physical data. The project manager: experienced in project management. Besides the listed factors, we believe knowledge manager is also needed. Knowledge manager must be experienced in knowledge interpreting and knowledge implementation. The people listed above should be sufﬁciently trained in the process of data mining. The scoring of the above seven human factors is shown in Table 2. The total score is the average of seven kinds of human specialist.

Shuman ¼ ðssponsor þ suser group þ sbusiness analyst þ sdata analyst þ sdata management þ sproject manager Þ=6

ð2Þ

The ﬁnance score is given by experts in ﬁnance evaluation. All of the score is integer. The maximum ﬁnance score is ﬁve and the minimum is one. We denote the ﬁnance as Sﬁnance. 5. The support of the top executives

The score will be used in Section 5. Since data quality is the base of data mining, we judge the company is not suitable for launching a data mining project if any value of the data quality attribute is smaller than 3. Alternatively, if Stotal P 3 and any value of the data quality attributes is great than 3, we consider the company suitable for a data mining project. This is the observation X which will be used in Section 5. 6. Decision-making model The foregoing evaluation is objective evaluation analysis. Now we will take the risk attitude of the executive in account. This model is the usage of the Bayesian analysis which is a mature model in decision analysis. In this decision analysis, the decision-makers have two action choices: doing data mining, do not do data mining. The action set is A = {a1, a2}, where a1 means doing data mining and a2 means not. The natural condition is N = {h1, h2}, where h1 means the company is actually suitable to do data mining, and h2 means unsuitable. Assuming that the decision-maker is a rational person, he is a risk averse person. We assume the utility function is exponential, as in Eq. (6) a

Since many ERP projects badly failed due to lacking of support of the top executives, data mining projects do need the strong support in order to make them progress smoothly. It determines whether the project could get adequate resource and support (Tables 3 and 4). In order to estimate the degree of support, we design a score table as below. First, the original score needs be normalized between 1 and 5 as formula below.

v1 ð5 1Þ þ 1 ¼ 2ðv 1Þ þ 1 31 11 X 0 V i =11 Ssupport ¼ v0 ¼

ð3Þ ð4Þ

i¼1

Integrate the four factors above into Stotal.

Stotal ¼ ðSquality þ Shuman þ Sfinance þ Ssupport Þ=4

ð5Þ

uðyÞ ¼ c b þ cðy þ bÞa 0 < a < 1

ð6Þ

The loss function of the decision-maker is in Eq. (7):

lðh; aÞ ¼ sup sup uðh; aÞ uðh; aÞ h2N

ð7Þ

a2@

Though the simple evaluation of the data quality and the support of the top executives of the enterprise, we could get the prior probability p(h1) and p(h2). The observation of the decision analysis is the evaluation of whether the organization is suitable for doing data mining or not. Observation of this decision analysis problem is X = {x1, x2}. The method of computing X is described in Section 4. x1 denotes the result of evaluation is suitable to do data mining, and x2 denotes the result of evaluation is unsuitable. Eq. (8) shows the computation of Bayesian risk (Ting, 1987)

rðp; dp Þ ¼ inf

d2D

XX

lðhi ; dðxj ÞÞf ðxj jhi Þpðhi Þ

ð8Þ

hi 2N xj 2X

Table 2 The score table of human factor Human resource needed

Sponsor User group Business analyst Data analyst Data management specialist The project manager The project manager a

Score 1

2

3

4

5

Lakea Lake Lake Lake Lake Lake Lake

Weak Weak Weak Weak Weak Weak Weak

Somewhat Regular Somewhat Somewhat Somewhat Somewhat Somewhat

Good Good Familiar Familiar Familiar Familiar Familiar

Inﬂuential/enthusiastic Experienced Experienced Experienced Experienced Experienced Experienced

There is a short of such human specialist or employees have no knowledge on data mining.

4593

G. Nie et al. / Expert Systems with Applications 36 (2009) 4589–4594 Table 3 The score table of the support from the executives Question

1

2

3

How long have you known data mining (years) Do you know the aim of DM Do you think the analysis of the past data will improve your decision making Do you know the process of DM Could you accept if the preprocess of the data cost more than 60% of the planed time Do you think human is more important than algorithm and machine in DM If DM project needs the conﬁdential data of your company, would you provide If DM meet a huge setback, accuracy is low for example, would you stop the project soon The management level of the project manager you intend to choose Would you mind paying a lot for the DM software Would you support the payment for the DM after the project

<0.5 Donot know Do not think so Do not know Not at all Don’t think so Reject Yes

0.5–2 Somewhat Somewhat Somewhat Somewhat Somewhat Decide based on the condition Decide based on the condition

>2 Quite clearly Yes Quite clearly Yes Yes Yes Going on support the project

Department level Yes No

Medium level Neutral Neutral

Top level Not at all Yes

Table 4 Score of the support from the executives Attribute Original Normalized

1 2 3

2 3 5

3 3 5

4 1 1

5 2 3

6 1 1

7 1 1

8 3 5

9 2 3

10 3 5

11 3 5

Let pij = f(xjjhi). Due to the page limitation, we would not introduce f(xj jhi) in detail. As to pij, Let’s take p11 as an example. p11 is the probability of a company which is actually suitable to do data mining is predicted as suitable Posterior probability

f ðx jh Þpðhi Þ pðhi jxj Þ ¼ P j i f ðxj jhi Þpðhi Þ

ð9Þ

In this project,

mðx1 Þ ¼ p11 pðh1 Þ þ p21 pðh2 Þ; mðx2 Þ ¼ p12 pðh1 Þ þ p22 pðh2 Þ p11 pðh1 Þ ; mðx1 Þ p pðh Þ pðh1 jx2 Þ ¼ 12 1 mðx2 Þ

ð10Þ

pðh1 jx1 Þ ¼

p21 pðh2 Þ ; mðx1 Þ p pðh Þ pðh2 jx2 Þ ¼ 22 2 mðx2 Þ

ð11Þ

In order to demonstrate how the decision-making process works, in this section we will give a simple example. Assume N is an Internet company in China. Now the executive of N would like to decide whether to start a data mining project to retain customers. As the process illustrated above, the data quality, the human and ﬁnance, and the support of the executive must be valued ﬁrstly. The data below is the result of the evaluation. Data quality:

saccuracy ¼ 4;

ð12Þ

suser group ¼ 2;

sbusiness analyst ¼ 4;

sdata analyst ¼ 3;

sproject manager ¼ 5

Shuman ¼ ð4 þ 1 þ 4 þ 3 þ 5 þ 4Þ=6 ¼ 3:5 Factor of ﬁnance: Sﬁnance = 4

Ssupport ¼

11 X

V 0i =11 ¼ 37=11 ¼ 3:36

i¼1

ð13Þ

Stotal ¼ ðSquality þ Shuman þ Sfinance þ Ssupport Þ=4 ¼ ð4 þ 3:5 þ 4 þ 3:36Þ=4 ¼ 14:86=4 ¼ 3:715 P 3

E lðH; d1 ðx1 ÞÞ ¼ lðh1 ; d1 ðx1 ÞÞ pðh1 jx1 Þ þ lðh2 ; d1 ðx1 ÞÞ ð14Þ

Eh lðH; d1 ðx2 ÞÞ ¼ lðh1 ; d1 ðx2 ÞÞ pðh1 jx2 Þ þ lðh2 ; d1 ðx2 ÞÞ ð15Þ

Stotal P 3 and each value of Squality is great than 4, so N is suitable to launch the data mining project as the result of the evaluation. We assume that N is suitable to do data mining with 0.6 as prior probability, i.e. p(h1) = 0.6, p (h2) = 0.4. If the accuracy of the evaluation is

The Bayesian risk (Dyer & Sarin, 1982) is given in Eq. (16):

rðp; d1 Þ ¼ Eh lðH; d1 ðx1 ÞÞ mðx1 Þ þ Eh lðH; d1 ðx2 ÞÞ mðx2 Þ

syears ¼ 4

sdata management ¼ 5;

h

pðh1 jx2 Þ

scompleteness ¼ 3;

Factor of human:

ssponsor ¼ 4;

Then the expectation loss (Dyer & Sarin, 1982) is given in Eq. (14):

pðh2 jx1 Þ

stimeliness ¼ 4;

Squality ¼ ð4 þ 4 þ 3 þ 5 þ 4Þ=5 ¼ 4

Denote decision-making criterion by d(x). If the evaluation result of the company is suitable (x1) for doing data mining, and then the company decides to data mining (a1). And if the evaluation result of the company is not suitable (x2) for doing data mining, then the company decide not to do data mining (a2). The decision-making criterion is as follows.

a2 ¼ d1 ðx2 Þ

7. Case study

sconsistency ¼ 5;

pðh2 jx1 Þ ¼

a1 ¼ d1 ðx1 Þ;

In the same way, we could compute the other Bayesian risk of decision-making criterion d2(x). Comparing the value of the two results, we choose the lower one as the decision-making criterion. The executives could use this model to help them make the decision. If the decision-making criterion is d1(x), the company could do data mining if the evaluation result is suitable to do data mining. The company won’t do data mining if the evaluation is not suitable. If the decision-making criterion is d2(x), then the choice of company’s opposite to d1(x). Do data mining while the evaluation is unsuitable.

ð16Þ

0:8;

then

f ðx1 jh1 Þ ¼ 0:85;

f ðx1 jh2 Þ ¼ 0:15;

f ðx2 jh2 Þ ¼ 0:85;

f ðx2 jh1 Þ ¼ 0:15

4594

G. Nie et al. / Expert Systems with Applications 36 (2009) 4589–4594

Table 5 The yield of N Action

Do data mining project Not do data mining project

References Status Actually suitable

Actually unsuitable

$110,000 $70,000

$0 $30,000

According to formula (10)–(12), m(x1) = 0.57, m(x2) = 0.43

pðh1 jx1 Þ ¼ 0:89; pðh1 jx2 Þ ¼ 0:21; pðh2 jx1 Þ ¼ 0:11; pðh2 jx2 Þ ¼ 0:79 Based on our analysis, this company will get $110,000 (y) if it chooses to do data mining and the company is actually suitable to do data mining (h = suitable); $0 if it chooses to do but actually unsuitable (h = unsuitable); $70,000 if it chooses not do, but actually suitable (h = suitable); $30,000 if it chooses not do but actually suitable (h = suitable). The yield of the project described above is shown in Table 5. Assuming the utility function of the executive of N is 1 ð1 e0:025y Þ, so the lose function would be lðyÞ ¼ UðyÞ ¼ 0:936 1 ðe0:025y 0:064Þ. 1 uðyÞ ¼ 0:936 According to formula (14, 15), Ehl(H, d1(x1)) = 0.401 and Ehl(H, d1(x2)) = 0.21, so the Bayesian risk of d1(x) would be r(p, d1) = 0.401 0.57 + 0.21 0.43 = 0.319, the Bayesian risk of d1(x) is 0.585 which can be computed in the same way. Since r(p, d1) < r(p,d2), criterion d1 should be accepted. This result means that the executive should make the decision to start the data mining project if the evaluation is suitable (Ting, 1987). 8. Conclusion As the development of data mining, many companies are now in the stage of hesitating if it is good to use data mining analysis in their business decisions. In this paper, we discussed the factors which must be evaluated before the top manager of the company makes decision whether to starting to do data mining in a company. We proposed four important factors, including data quality, human, ﬁnance budget, and support of the executives After the primary evaluation of the condition, we get the observation value X. Based on a deep research or the observation of the mentioned factors; we could compute the Posteriori Probability. With the prior and Posteriori Probability, we applied the Bayesian to get the model of decision criteria with the lowest Bayesian risk. With the model and the evaluation, the manager can decide whether his company or organization is good to use data mining analysis in their decisions or not. 9. Further work This paper only gives a pure parametric model. Talking about the parameter of the formula is a very important in future. We only give what is the observation X. We did not study the idiographic formula of f(xjjhi). Further work is giving a case to test the model. This work is also related to the idiographic form of the formula. Acknowledgements This research has been partially supported by a grant from National Natural Science Foundation of China (#70621001, #70531040, #70501030, #70472074), Beijing Natural Science Foundation (#9073020), 973 Project #2004CB720103, Ministry of Science and Technology, China, and BHP Billiton Co., Australia.

Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In ACM SIGMOD conference (pp. 254–259), Washington, DC, USA. Agrawal, R., Imielinski, T., & Swami, A. (1993a). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5, 914–925. Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. New York: Springer. Chen, C.-Y. (2002). A framework for optimizing data quality given limited resources. Arizona State University. Chen, Y.-L. et al. (2007). A novel approach for discovering retail knowledge with price information from transaction databases. Expert Systems with Applications. doi:10.1016/j.eswa.2007.03.006. Chen, M.-C., & Lin, C.-P. (2007). Expert Systems with Applications, 32, 976–986. Chen, S. Y., & Liu, X. (2005). Data mining from 1994 to 2004: An applicationorientated review. International Journal of Business Intelligence and Data Mining, 1(1), 4–21. Chen, G., Liu, H., Yu, L., Wei, Q., & Zhang, X. (2006). Decision Support Systems, 42, 674–689. Diederich, J. et al. (2007). Ex-ray: Data mining and mental health. Applied Soft Computing, 7, 923–928. Dubes, R., & Jaiu, A. K. (1980). In M. C. Yovits (Ed.). Clustering methodologies in exploratory data analysis advances in computers (Vol. 19). New York: Academic Press. Duda, R., & Hart, P. E. (1973). Pattern classiﬁcation and scene analysis. Wiley. Dyer, J. S., & Sarin, R. K. (1982). Relative Risk Aversion. Management Science. Ekins, S. et al. (2006). Application of data mining approaches to drug delivery. Advanced Drug Delivery Reviews, 58, 1409–1430. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD conference (pp. 1–012). Han, J., & Kamber, M. (2001). Data mining: Concept and techniques. Morgan Kaufmann Publishers, Inc. Hope, L. R. & Korb, K. B. (2004). A Bayesian metric for evaluating machine learning. In AI 2004, LNAI 3339 (pp. 991–997). Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3). Kirkos, E. et al. (2007). Data mining techniques for the detection of fraudulent ﬁnancial statements. Expert Systems with Applications, 32, 995–1003. Lavrac, N., Bohanec, M., Pur, A., Cestnik, B., Debeljak, M., & Kobler, A. (2007). Data mining and visualization for decision support and modeling of public healthcare resources. Journal of Biomedical Informatics, 40, 438–447. Lee, R. C. T. (1981). Clustering analysis and its applications. In J. T. Toum (Ed.). Advances in information systems, science (Vol. 8, pp. 169–292). New York: Plenum Press. Lee, C.-I. et al. (2007). An approach to mining the multi-relational imbalanced database. Expert Systems with Applications doi:10.1016/j.eswa.2007.05.048. Liao, S.-H. (2003). Knowledge management technologies and applications – literature review from 1995 to 2002. Expert Systems with Applications, 25, 155–164. Maran, U., Sild, S., Kahn, I., & Takkis, K. (2007). Mining of the chemical information in GRID environment. Future Generation Computer Systems, 23, 76–83. Moreno Garci’a, M. N. et al. (2004). Building knowledge discovery-driven models for decision support in project management. Decision Support Systems, 38, 305–317. Moreno Garcı’a, M. N. et al. (2008). association rule mining method for estimating the impact of project management policies on software quality,development time and effort. Expert Systems with Applications, 34, 522–529. Murtagh, F. (1983). A survey of recent advance in hierarchical clustering algorithms. The Computer Journal. Peter, C., & James, K., Matthew, S., et al., (1988). Auto class: A Bayesian classiﬁcation system. In Proceedings of the ﬁfth international conference on machine learning, Morgan Kaufmau, Jun. Raymond, T. Ng, & Han, J. (1994). Efﬁcient and effective clustering methods for spatial data mining. In Proceedings of the VLDB. Shah, S., & Kusiak, A. (2007). Cancer gene search with data-mining and genetic algorithms. Computers in Biology and Medicine, 37, 251–261. Sim, J. (2003). Critical success factors in data mining projects. PhD thesis, Philosophy University of North Texas. Srikant, R., & Agrawal, R. (1997). Mining generalized association rules. Future Generation Computer Systems, 13(2–3), 161–180. Strobel, C. M. & Hrycej, T. (2006). A data mining approach to the joint evaluation of ﬁeld and manufacturing data in automotive industry. In PKDD 2006, LNAI 4213 (pp. 625–632). Sun, W., Chen, J., & Li, J. (2007). Decision tree and PCA-based fault diagnosis of rotating machinery. Mechanical Systems and Signal Processing, 21, 1300–1317. Ting, C. (1987). Decision Analysis (in Chinese). Wang, R., Strong, D., & Guarascio, L. (1994). Beyond accuracy: What data quality means to data consumers. Total Data Ouality Management Research Program, 42. Wu, H. (1999). Choice of techniques in data mining, Three experiments in applications. Thesis for the Degree of Master, College of Arts and Sciences of American University. Ying, L. (2005). High performance data mining techniques for large databases. PhD thesis, Computer Science, Northwestern University.

Decision analysis of data mining project based on Bayesian risk

Decision analysis of data mining project based on Bayesian risk

Recommend Documents