Integrating in-process software defect prediction with association mining to discover defect pattern

Integrating in-process software defect prediction with association mining to discover defect pattern

Information and Software Technology 51 (2009) 375–384 Contents lists available at ScienceDirect Information and Software Technology journal homepage...

248KB Sizes 1 Downloads 41 Views

Information and Software Technology 51 (2009) 375–384

Contents lists available at ScienceDirect

Information and Software Technology journal homepage: www.elsevier.com/locate/infsof

Integrating in-process software defect prediction with association mining to discover defect pattern Ching-Pao Chang a,*, Chih-Ping Chu a, Yu-Fang Yeh b a b

Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan Department of Business Education, National Changhua University of Education, Changhua 500, Taiwan

a r t i c l e

i n f o

Article history: Received 29 April 2006 Received in revised form 24 April 2008 Accepted 29 April 2008 Available online 4 May 2008 Keywords: Software defect prediction Association rule Multi-interval discretization

a b s t r a c t Rather than detecting defects at an early stage to reduce their impact, defect prevention means that defects are prevented from occurring in advance. Causal analysis is a common approach to discover the causes of defects and take corrective actions. However, selecting defects to analyze among large amounts of reported defects is time consuming, and requires significant effort. To address this problem, this study proposes a defect prediction approach where the reported defects and performed actions are utilized to discover the patterns of actions which are likely to cause defects. The approach proposed in this study is adapted from the Action-Based Defect Prediction (ABDP), an approach uses the classification with decision tree technique to build a prediction model, and performs association rule mining on the records of actions and defects. An action is defined as a basic operation used to perform a software project, while a defect is defined as software flaws and can arise at any stage of the software process. The association rule mining finds the maximum rule set with specific minimum support and confidence and thus the discovered knowledge can be utilized to interpret the prediction models and software process behaviors. The discovered patterns then can be applied to predict the defects generated by the subsequent actions and take necessary corrective actions to avoid defects. The proposed defect prediction approach applies association rule mining to discover defect patterns, and multi-interval discretization to handle the continuous attributes of actions. The proposed approach is applied to a business project, giving excellent prediction results and revealing the efficiency of the proposed approach. The main benefit of using this approach is that the discovered defect patterns can be used to evaluate subsequent actions for in-process projects, and reduce variance of the reported data resulting from different projects. Additionally, the discovered patterns can be used in causal analysis to identify the causes of defects for software process improvement. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction Defect prevention focuses mainly on preventing defects from occurring and has been applied to improve software quality and productivity in many organizations, in which the causal analysis is the major activity to identify the causes of reported defects [1,2]. The main challenge is that the causal analysis attempts to identify the root cause of defects among a broad range of possible causes. To facilitate the analysis process, many tools can be utilized, such as cause–effect diagrams and control charts. Orthogonal defect classification (ODC) is commonly used to select analysis items, where the reported defects are grouped and weighted according to the schema [3]. The difficulty with applying this approach is that reported defects may fall into different categories, and the defect pattern is hard to investigate when the defect clas* Corresponding author. Tel.: +886 7 6937633; fax: +886 7 6930435. E-mail address: [email protected] (C.-P. Chang). 0950-5849/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2008.04.008

sification schema is complicated [4], resulting in significant effort in identifying the actual cause. To reduce the difficulty of causal analysis, the reported defects can be used to build defect models, and utilized to locate modules that may cause defects. The classification tree models can be applied on multiple releases of software products to build the prediction models, which can be used to locate the modules that are likely to cause defects [5]. Clustering techniques can be used to group software modules according the metrics of modules. Faulty modules with similar attributes are thus grouped together. These groups can be utilized to predict the faulty modules [6]. In addition to the reported defects, the change history of modules can also help predict the number of defects generated by specific module in the future [7]. Although these approaches provide the locations (modules) where the defects may occur, it does not indicate how and when the defect occurs. Defects are hard to identify in a module with a very large number of operations, since corrective actions can not be taken on every operation. Action-Based Defect Prediction

376

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384

(ABDP) performs classification mining on the records of defects and actions to construct prediction models [8]. The ABDP approach includes a set of attributes to describe the actions within a software process, where an action is defined as an operation performed as part of a task in the Work Breakdown Structure (WBS) of the project, while a defect is defined as software flaws, such as the errors in documents, bugs in codes and designs, software system failures and other problems in processes causing software failure. The defects defined here can arise at any stage of the software process. The main limitation of using classification mining is that many useful rules for describing the behavior of software process may be lost. To enhance the ABDP, this study proposes an Association Rule based Defect Prediction (ARDP) approach to predict subsequent defects. The ARDP applies classification-based association rule mining [9,10] on data sets of performed actions to discover rule sets that may cause large numbers of defects, and uses the rule sets to predict whether the subsequent actions are likely to generate high defects. Although classification mining can be applied to discover a small set of rules, the association rule mining technique finds all rules with a higher level of support greater than a predetermined threshold, and can be utilized to describe the relationship between attributes of actions and generated defects [11]. Two major problems in classification mining can be solved by using association rule mining. First, the useful rule set can be obtained by applying different levels of minimum support, and under-sampling can be utilized to achieve excellent prediction results [12]. Second, the unrelated attributes problem, which may affect the accuracy of prediction, and is usually addressed using feature subset selection techniques, can be address using proper minimum confidence where only large item sets of attributes are discovered by association rule mining [13]. The ARDP applies bottom-up interval merging technique on the continuous attributes of actions to improve prediction accuracy [14]. Once actions with high probability of causing defects are identified, the project manager can review these actions carefully and take appropriate corrective actions. Newly performed actions are continually appended to the historic data to build a new rule set for subsequent actions. The iterative process forms the inprocess software defect prediction, which can reduce the variance of data from different projects. To demonstrate the performance of the ARDP, this study applies ARDP to a data set of a business project developing an Attendance Management System for the Customs Office of the Ministry of Finance of Taiwan (AMS-COMFT) [8]. The rest of this paper is organized as follows. Section 2 gives an overview and related work of the defect prediction. Section 3 describes the technologies of the ARDP approach, while Section 4 discusses the data set to be analyzed and shows the analytical results and discussion. Finally, Section 5 draws conclusions.

2. Related work 2.1. Software defect prediction To achieve the aim of project, a set of tasks needs to be planned in advance, where the planned tasks constitute the work breakdown structure (WBS) of the entire project. The software process can be treated as a sequence of steps performed to finish these tasks [15], while the execution of the process can be regarded as a series of actual actions performed by developers to finish all tasks of the project. This study defines defects as software flaws, such as the errors in documents, bugs in codes and designs, software system failures and other problems in processes causing software failure. The defect is an important factor affecting the performance of software process execution [16].

Objectives of applying prediction models to estimate the number of defects of software products (or components) include defect detection and defect prevention. In defect detection, the prediction models are utilized to predict the number of defects remaining in completed software products (or identify the modules that contain defects) to facilitate defect detection process. For instance, the capture–recapture model [17], defect profiling [18] and defect defection time [19] can be applied on the data accumulated from inspection process to estimate the number of defects. Neural networks can be used in software metrics (such as object-oriented metrics) to predict the number of defects and evaluate quality of software products [20,21]. The regression technique can be applied to the change history of software products in order to accurately predict the number of faults of modules [22,23], or on the test cases to estimate the number of defects that can be detected [24]. The clustering can be applied on the data specified by experts according to complexity metrics of software products to group the fault prone modules and not fault prone modules [6]. However, software defects may occur in all stages of the software process, such as requirement development, designing, coding and supporting stage. Moreover, some modules are hard to express as single components. Instead of detecting the number of defects that have been injected into software products, the defect prevention tries to predict possible defects in advance. The prediction results can be utilized to plan corrective actions to avoid possible defects. For example, the classification tree models can be applied on the data collected from multiple releases of software products to build the prediction model to predict possible defects generated in subsequent releases, in which the metrics used to collect data include call graph metrics, control flow metrics, statement metrics, software process metrics and software execution metrics [5]. The Bayesian Net (BN) can be applied in each development lifecycle to predict the number of software defects of software products to facilitate software testing [25]. Besides identifying the components that may generate defects, the Action-Based Defect Prediction (ABDP) applies classification tree techniques on the records of performed actions to identify subsequent actions that are likely to cause defects [8]. Association rule mining and classification tree are two common data mining approaches, which can be used to discover knowledge of the software process. Both these approaches involve data selection, data preprocessing, data mining and assimilation of results [26]. The ABDP uses the C4.5 algorithm to construct decision trees based on the gain ratio criteria. The training data are split into two groups, in which each attribute is measured according to Information Entropy (normalized) [27]. Several issues must be considered when applying decision trees to build prediction models. First, classification mining is used to obtain the minimum set of rules, while association rule mining finds the maximum rule set with specific minimum support and confidence. Thus, many useful rules for describing the behavior of the software defects may be lost while using classification approach. Second, the imbalance data set needs additional processing, such as under-sampling or oversampling, when constructing decision trees. Furthermore, multiinterval discretization can provide more accurate prediction than entropy-based binary discretization in most cases for continuous attributes in C4.5. The following subsections describe enhancements of ADBP. 2.2. Software defect patterns The first advantage of using association rule mining is that the mined rule set can be used to describe software defect behaviors. To facilitate the description of defect behaviors, this study defines defects patterns as a set of values of attributes that can be used to describe and predict the occurrence of defects. Defect patterns can

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384

be derived applying statistical models to products with defects. The defect patterns of products consisting of program code are similar to bug patterns, which are defined as sections of code that may cause errors [28]. To determine whether a product contains defects, the attributes of the products and activities are compared with defect patterns. The attributes used to measure the work products include the size, complexity, efforts and change history [29–31]. However, the aim of defect prediction in this study is to detect products that are likely to generate defects, rather than of finding the products with defects. Therefore, the attributes of the actions performed on work products are measured to obtain specific defect patterns for prediction. There are many approaches can be used to record the execution of software process, such as the TychoMetrics uses the Measurement Object Model to define measurement schema [32] and the Task Element Decomposition (TED) divides tasks into operations based on their size and complexity [33]. This study uses the same attributes to record actions and reported defects as ABDP [8]. 2.3. Association rule mining The main aim of association rule mining is to identify interesting associations between items using the collected data set. The raw data set needs to be transformed to a format that can be recognized by the mining engine. Transactional and relational data are two major data formats. Association rule mining techniques can usually be used to discover the association rules from a transactional data set, while the classification mining can be used to build decision tree models from a relational data set [34]. The Classification-Based Association rule mining (CBA) integrates classification and association rule mining techniques to discover association rules from relational data sets [35]. Fig. 1 shows an example of association rule mining on a relational data set. The attributes are categorized as antecedent attributes and subsequent attributes. The antecedent attributes can be collected before executing the action, and can be used to predict the subsequent attributes. The antecedent attributes are the Action Complexity (denoted as C) and Developer Experience (denoted as E), while the subsequent attribute is the number of defects (denoted as D). To identify the rules that cause the high defects (D=H), the support and confidence are calculated as in Fig. 1. The rule ‘‘C=H and E=L implies D=H” indicates that assigning low experience developers to high complexity actions may cause high defects with 67% confidence. The identified rule set can be used to help understand the causes of software defects. Another problem with using the classification mining in software process is the rarity problem, which occurs when the number of actions causing defects is very small [36]. The under-sampling approach can be used to address the rarity problems. Pruning tree or feature subset selection can be used to address the over-fitting problem with classification mining [37,38]. For association rule mining, the small minimum support may cause too many rules

Action Developer Number of Complexity Experience Defect 1 H L L 2 M L H 3 H L H 4 H L H 5 L H L 6 L L M Support(D=H and E=L) = 3/6 Support(C=H and E=L and D=H) = 2/6 = 0.33 Confidence(C=H and E=L→D=H) = (2/6)/(3/6) = 2/3 = 0.67 No

Fig. 1. An example of association rule mining.

377

to be generated, resulting in over-fitting. This problem can be resolved pruning redundant rules [39]. Both classification and association rule mining must handle continuous attributes. The algorithms discretization of continuous attributes can be divided into two categories, top-down algorithm apply the binary partitioning recursively to a whole interval [40], while the bottom-up algorithm considers each value of input attributes as a single interval, and uses merge criteria to select adjacent intervals pairs for merging [41]. The first challenge of using the bottom-up interval merge is choosing the best interval pairs for merging. To solve this problem, the ChiMerge applies the v2 values to rank the interval pairs [42], while the StatDisc merge uses the / value to rank the interval pairs [43]. The InfoMerge uses entropy of intervals to determine the information loss of interval pairs by group, where the loss value is calculated by subtracting information gains of the interval pairs before merging from the information gains of the merged intervals. The groups of interval pairs with the minimum loss value can be chosen for merging. The number of intervals of continuous attributes may affect the support and confidence of the discovered rules described as the catch-22 problem. To determine the optimized number of partitions of numeric attributes, Srikant and Agrawal define a partial completeness level over itemsets to compute the optimized number of intervals of continuous attributes [44]. A predetermined threshold can also be used to limit the number of merged intervals in some applications. Another problem using interval merging is that merging an interval affects the measurement of certain intervals, meaning that the measure of interval pairs must be recalculated and ranked. To address this problem, Wang et al. used MTrees to index the interval pairs, using the goodness chain to rank interval pairs by loss values, and the adjacent chain to index the adjacent interval pairs. Their method only requires recalculations for interval pairs that are adjacent to the merged interval pairs [45]. A rule may contain both category and numeric attributes, which can be processed in different phases. For example, the Action Complexity (as depicted in Fig. 1) is a category attribute, while the Expected Efforts (not shown) used to execute the action is a numeric attribute. If an itemset is large (the support of the itemset is greater than the minimum support), then its category and continuous parts are also large. For instance, if C [ N is large itemset, then the N is also large, where the ‘‘[” denotes the combination of two sets of attributes, the N denotes the set of all numeric attributes and the C denotes the set of category attributes. In real application, only specific goal classes (such as high defects actions) are of interest, and C ? H can be used as the template for mining processes to reduce the number of large itemsets generated in the mining process [46]. Although the number of rules can be significantly reduced by using a template with a bounded number of merged intervals of continuous attributes, it may still be large, making the rule set difficult for an analyst to interpret. To overcome this problem, rule pruning can be utilized to eliminate the redundant rules and reduce the number of rules discovered. For example, let A, B denote sets of attributes, and C denotes the goal class. Define the strength of a rule as the level of confidence in it, and similar rules as rules with a difference in strength less than a given threshold (such as a predetermined threshold e). If the strength of two rules, A ? C and A [ B ? C, are similar, then A [ B ? C is redundant.

3. The Association Rule Based Defect Prediction The proposed ARDP approach is adapted from the Action-Based Defect Prevention (ABDP) approach and contains four components, namely Action Collection, Action Prediction, Defect Reporting and

378

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384

Rule Generation. The interactions between these components are illustrated in Fig. 2 and are described in the following subsections. 3.1. The Action Collection and Defect Reporting Action Collection is used to manage an action and its related information, such as the information on projects, tasks, resources and products, while the Defect Reporting captures the reported defects produced by actions performed throughout the software process. The defects can be detected by developers or users at any stage. This study defines a defect as an abnormality of a software product under Configuration Management (CM) control. Bugs and errors of executing an action (the work product still under working) are not considered as defects. The configuration management process is beyond the scope of this study, and is not discussed in further detail. Before being removed, a defect detected from a configuration managed product has to be reported to the Defect Reporting component, where it is analyzed to discover factors such as the action causing the defect, the impact of correcting it and the estimated effort involved in addressing the defect. The action-based model used to describe actions and defects is proposed as in Tables 1 and 2 of [8]. This study can predict effects of an action, such as the number of defects generated by an action. To build the prediction rule set for, the Action Collection needs to generate a relational data set based on the schema of the rule engine. The collected records of actions and defects are transformed into a data set according the features defined in Table 3 of [8]. This data set groups the features into antecedent features (features 1–20) and subsequent features (feature 21). The subsequent features can be treated as effects of an action (the number of generated defects), and is not known before the action is completed. The antecedent features can be input and used to generate a rule set, which can be utilized to predict the subsequent features of an action before execution. The evaluation results are returned to the person who submits the action for modification. The corrective actions may depend on the project team’s defect prevention process. 3.2. Rule Generation The Rule Generation builds a rule set by using the relational data set generated by Action Collection. The rule of the high defect pattern can be expressed as P ? C, where P = {pi j pi is a subset of attribute values of action} and C={L, H} is the class. The format of elements of P is represented as ID = ‘value’, where ID represents the identification of attribute, and the value denotes the value of the attribute. An example of the rule is shown as follows.

f2 ¼ ‘N’;4 ¼ ‘10’;5 ¼ ‘3’;8 ¼ ‘3’g ! fHg

(1) Defect detected (2) Defect analysis (3) Defect fixed

Can be used for (1) Causal Analysis (2) Defect Prevention (3) Process Improvement

Action Collection

Action(s) to be predicted Defect Reporting

ð1Þ

Action Prediction

Predicted Results

Data Set for analysis

Prediction Models

Fig. 2. The main components of the ARDP.

Rule Generation

According to the attribute IDs listed in Table 3 of [8], the Attribute 2 denotes the Action_Type with the value ‘new’; Attribute 4 represents the Action_Complexity with the value ‘high’; Attribute 5 represents the Object_Type with the value ‘Application’, and Attribute 8 represents the Action_Target with the value ‘DD’(Detailed Design). The goal class value in Eq. (1) is H (high), which means that a highly complex action creating a module in detailed design stage may cause high defects. Moreover, any submitted action with the values of attributes shown in Eq. (1) can be predicted as highdefect action. Eq. (1) only represents the type of categorical attributes. To extend the Eq. (1) to represent the numeric attributes, the attribute format can be extended to the form ID: [value1; value2], where the ID denotes the identification of the numeric attribute; value1 and value2 denote the value range of the attribute; ‘[’ and ‘]’ indicate that the ending points are included, and ‘(’ and ‘)’ indicate that the endpoints are exclusive. An example of rule containing numeric attributes is shown as Eq. (2), where attribute 6 denotes the Effort Expected, and the attribute 9 denotes the Number of objects.

f2 ¼ ‘N’;5 ¼ ‘3’;8 ¼ ‘3’;6 : ð5; 8; 9 : ð0; 1g ! fHg:

ð2Þ

Eqs. (1) and (2) indicate that a submitted action with certain defect patterns may cause too many defects. The next subsection describes in detail the procedures utilized to predict a submitted action. The procedure used to generate the rule set is described in the following steps. The first step specifies the thresholds and generates the set of category attributes and the set of continuous attributes. The specified thresholds includes the subsequent feature (high-defect), the minimum support (minsupp), the minimum confidence (minconf), the maximum number of intervals of continuous attributes (MaxAttrNumber) and the maximum number of large itemsets (MaxGenaratedItems) generated for category and numeric attributes. The second step generates the large itemsets of category attributes using the classification-based association mining algorithm based on the goal class specified in Step 1. The generated large itemset of the category attributes is denoted as the variable citemset. The third step generates the intervals for each continuous attribute where the intervals do not overlap. Fig. 3 presents the algorithm for finding the elementary intervals for each continuous attribute. The first two lines are used to find and sort the endpoints of intervals for each continuous attribute. The endpoints are values appearing in the data set. To improve the efficiency of the mining process, ARDP only chooses the endpoint from transactions in the goal class (i.e. high-defect transactions). Lines 4– 12 generate all intervals for every continuous attribute, for which the leftmost endpoint is combined as the string ‘‘15:(-;2]” where the 15 represents the attribute id, and the string ‘‘(-;2]” represents an interval less than or equal to 2. The rightmost endpoint is combined as the string ‘‘15:(12;+)”, which is the interval greater than 12. All mid-

01 Find all possible values for each continuous attribute 02 Sort possible values for each continuous attribute 03 ... 04 For each continuous attribute index i do { 05 String [] attr = get possible values of attribute[i]; 06 String inter = “:(-;”+attr[0]+“]”; 07 ... 08 for (int j = 0; j < attr.length-1; j++) 09 inter = inter + “:(”+attr[j]+“;”+attr[j+1]+“]”; 10 inter = inter + “:(”+attr[attr.length-1]+“;+)”; 11 ... 12 } Fig. 3. Generate the intervals of continuous attributes.

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384

01 for each interval pairs index i do { 02 String IntPair(i) = the interval pairs i 03 Calculate the loss value of the IntPair(i); 04 } 05 while (number of intervals > MaxAttrNumber) do { 06 sort the interval pairs by loss values; 07 merge interval pair with the lowest loss value; 08 recalculate the loss of interval pairs affected. 09 } Fig. 4. Algorithm used to merge adjacent intervals.

379

adjacent (i.e. two intervals with different attributes are not adjacent). Pruning the mined rule set can reduce the number of rule sets to help understand the results. Since the subsequent item of mined rules is a single item {H}, the pruning procedure can be obtained by comparing the antecedent items and rule confidences. For instance, a rule set contains two rules, whose antecedent items are {5:(4;8],3 = ‘0’} and {5:(4;8],2=‘-’,3 = ‘0’}. If these two rules have similar confidence (i.e. the difference is below 0.1), then the second rule is redundant and can be pruned. 3.3. Action Prediction

points are combined as the string ‘‘15:(2;5]”, where the right endpoint is included in the interval, while the left endpoint is not. The fourth step generates interval pairs for merging, where interval pairs are adjacent intervals that can be considered for merging. For instance, for an attribute with id = 5 and a set of appearing values {5, 6, 8, 16}, the generated intervals from Step 3 are 5:(-;5], 5:(5;6], 5:(6;8], 5:(8;16] and 5:(16;+), while the interval pairs generated from Step 4 are 5:(-;5] 5:(5;6], 5:(5;6] 5:(6;8], 5:(6;8] 5:(8;16] and 5:(8;16] 5:(16;+). The fifth step merges the adjacent intervals, where the merge algorithm is presented in Fig. 4. Lines 1–4 computes the loss value for every interval pair, where the loss value is defined as the difference in information gain after and before merger. Lines 5–8 sort the interval pairs, and merge the interval pairs with the smallest loss values. A recalculation is needed for interval pairs affected by the merged intervals. In the example presented at Step 4, after merging the interval pair 5:(5;6] 5:(6;8], the new intervals are 5:(-;5], 5:(5;8], 5:(8;16) and 5:(16;+), while the interval pairs are 5:(-;5] 5:(5;8], 5:(5;8] 5:(8;16] and 5:(8;16] 5:(16;+). The interval pairs 5:(-;5]  5:(5;8] and 5:(5;8] 5:(8;16] have changed, and need to be recalculated their loss values. The merge process is performed continuously until the number of intervals is less than the maximum number of continuous attributes specified by the user at Step 1. Step 6 generates the large itemset for continuous attributes, where the generation process is similar to the process for category attributes except that the minimum support has to be adjusted when the number of large itemset is larger than the maximum number of large itemsets specified in Step 1. The procedure for adjusting the minimum support is listed in Fig. 5, where the increase in support can be predetermined or obtained from the minimum support (such as 1% of the minimum support is taken as the increase). The large itemset for both continuous and category attributes needs to be pruned when the minimum support is altered. The final step generated the rule set by combining the large itemsets generated in Steps 2 (category large itemsets) and 6 (continuous large iemsets). The rule set is pruned using minimum confidence. To reduce the number of rule sets, the redundant rules are pruned using strength checking as in [34], where only Rule 1 is used because only rules with the goal class as the right hand side item are discovered. A rule composed from the intervals of continuous and category attributes can be expressed as:

5 : ð5; 16; 9 : ð0; 1; 18 : ð; 39; 3 ¼ ‘5’;7 ¼ ‘4’ ! H The prefix number in each item represents the attribute index. The right hand side item represents the goal class. The attribute id prefixed at every interval determines whether two intervals are 01 while (litem.size()>MaxGenaratedItems){ 02 minsup = minsup + SupportIncrement; 03 litem.PruneItems(minsup); 04 citemset.PruneItems(minsup); 05 nitemset.PruneItems(minsup); 06 } Fig. 5. Adjust the minimum support.

An action is a basic operation performed in executing a software project. The action information is input and transformed to the features shown in Table 3 of [8], and evaluated before execution. For a defect pattern P = {p1,. . .,pn}, if F(pi) represents the ID part of item pi, then a submitted action A = {a1,. . .,am} is said to satisfy P (predicted as high-defect action), if for each pi (1 6 i 6 n), exists an aj (1 6 j 6 m) whose value is covered by the interval of pi and j = F(pi). Assuming that the Eq. (2) is a high-defect pattern, and a submitted action is expressed as follows, in which only the first 10 attributes are listed. The last symbol ‘‘?” denotes the subsequent feature to be predicted.

f1; N; R; 0; 3; 7; 4; 3; 1; 26; ?g

ð3Þ

The action depicted in Eq. (3) is predicted as a high-defect action, since the values of Action_Type, Object_Type, Effort_Expected, Action_Target and Num_of_action_objects are covered by Eq. (2).

4. The experimental results and discussions To assess the defect prediction of the ARDP proposed in this paper, the ARDP was run on a data set obtained from the AMS-COMFT project. The project was performed with 682 actions, and 413 defects were reported. To facilitate the analysis, high-defect actions were defined as actions causing more than three defects, while other actions were defined as low-defect actions. According to this definition, the project included 41 high defect actions. The attributes used to describe the actions and defects in the AMS-COMFT project are listed in Tables 1 and 2 of [8], and the transformed features are shown as in Table 3 of [8], in which the attributes 1–10 can be collected from the actions, while the remaining attributes have to be determined from the tasks and defects of the action. For example, the Task_Status indicates whether the current action is within or over schedule, and is calculated by comparing the action performed date with the scheduled date of the task. The Task_Actions variable represents the number of actions used to execute the same task, and can be computed while generating the data set for rule engine. The Attributes 14–20 are calculated in a similar way to the Task_Actions. The Task_progress represents the progress of the task when the action is ready to be performed, and is calculated by dividing the used efforts of all performed actions of the task by the expected efforts of the task. The total_defect_num represents the number of defects resulting from the action, and is used to classify an action as low or high defect. The number of defects used to build rule set was computed while generating the data set, and the number of defects used to assess the prediction was calculated at the end of the project. 4.1. The accuracy evaluation To demonstrate the accuracy of the ARDP approach applied on the prediction of defects caused by actions of the software process, the collected data set was sorted by action date and divided into many segments, each containing 10 actions, except that the first

380

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384

segment contained 20 actions and the last segment contained 2 actions. The last action in each segment was the checking point, except that the last segment was only utilized for testing. The ARDP was evaluated at every check point using all the actions before the check point as training data, and the next 10 actions as testing data. The rule set was renewed at the check point, and used to predict the subsequent 10 actions. For example, the first evaluation used actions 1–20 as the training data, and actions 21– 30 to test the accuracy of the rule set. The second check point used actions 1–30 as the training data, and action 31–40 as the testing data. The final check point used actions 1–680 as the training data, and actions 681–682 as testing data. Hence, 67 data sets were generated with 662 actions to be predicted, including 39 high-defect actions (since the first training set contained two high-defect actions, and therefore could not be used as test data, and applied to evaluate the accuracy of ARDP). Using ARDP to predict the actions causing many defects, a highdefect prediction can be treated as positive, while a low-defect prediction treated as negative. A true-positive (TP) is defined as a correct positive prediction, and a false-positive (FP) is defined as an incorrect positive prediction. Similarly, a true-negative (TN) is defined as a correct negative prediction, and a false-negative (FN) or false alarm is defined as an incorrect negative prediction. The accuracy is defined as the percentage of correct predictions (TP+TN) among all predictions (TP+TN+FP+FN). The precision represents the percentage of the correct prediction (TP) of the positive predictions (TP+FP). The recall is defined as the percentage of high-defect actions that have been discovered. The specificity denotes the percentage of correctly predicted low-defect actions. To predict all check points, the total 662 predictions (39 positive and 623 negative predictions) can be used to evaluate the accuracy of the mined rule set. 4.2. The results Table 1 summarizes the analytical results using 0.01 as the minimum support (supp), 0.4 as the minimum confidence (conf), 150 as the maximum number of continuous attributes (maxA) and 450 as the maximum number of large item sets (maxI), only the first 20 check points are shown. The first column shows the check point number; the second and third column show the H class predicted as H or L class, and the fourth and fifth column show the L

Table 1 The results of prediction using supp = 0.01, conf = 0.4, maxA = 150, maxI = 450

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

H class

L class

H

L

H

L

0 1 1 2 0 0 0 0 2 0 2 1 0 3 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

2 1 4 2 4 2 7 7 0 0 1 1 2 3 2 1 1 0 1 0

8 8 5 6 6 8 3 3 8 10 7 8 8 4 8 8 8 10 9 10

Classified as →

H

L

H class

39

0

L class

559

62

Accuracy = 15.26% Precision = 6.50% Recall = 100.00% Specificity = 9.95% Rule avg. # = 114

Fig. 6. The summation of Table 1.

class predicted as the H and L class. The sixth and seventh columns, respectively, show the actions of the training and testing data sets. The last column shows the number of rules generated using the training data, in which the similar rules are pruned. Fig. 6 summarizes the results of all 67 check points, where nearly all defects were predicted correctly with a recall rate of 100%. However, the precision was only 6.50%, signifying that almost 94 false alarms were obtained for every 100 predictions. The high false alarm rate means that almost every action is suspected of to causing high defects and has to be verified. Fig. 6 indicates that almost all predictions were predicted as H class, because the rules generated for ARDP focused only on the H class. An action is predicted as L class if no rule is available to fit the action. Therefore, the generation of too many rules may affect predictions (the average number of generated rules is about 114 rules). To reduce the number of rule set, the confidence level used to select the rule set can be increased (a confidence level of 0.5 is commonly used for Rule Generation). To demonstrate the effect of different confidence level on prediction, Table 2 shows the summary of results of predictions using a minimum support level of 0.01, and confidence levels in the range 0.1–0.9 as different confidence. Columns 1 (conf.), 6 (acc.), 7 (prec.), 8 (rec.), and 9 (spec.), respectively, show the confidence, accuracy, precision, recall and specificity. The accuracy and precision both increase when the confidence increases. The recall falls as the number of predicted L (Low-defect) increases, which also decreases the false alarm rate. Although in this study the accuracy could be increased up to 91% at conf = 0.9, the number of discovered high-defect actions was only 4, where 35 high-defect actions were not predicted, and 21 high-defect actions were incorrectly identified. ARDP largely focuses on accurately predicting the actions caused high-defect which can be determined by recall. Additionally, the false alarm rate must be kept within an acceptable range to reduce the effort involved in verifying suspect actions. Fig. 7 summarizes the accuracy, precision, recall and specificity presented in Table 2. The lines of accuracy and specificity are very similar because the H class rare (TP + FN = 39) compared to the whole data set (TP + FN + TN + FP = 662). Therefore, the values of the following two equations are almost the same.

TP þ TN TN  ¼ Specificity TP þ FN þ FP þ TN FP þ TN

Training

Testing

Rule #

Accuracy ¼

1–20 1–30 1–40 1–50 1–60 1–70 1–80 1–90 1–100 1–110 1–120 1–130 1–140 1–150 1–160 1–170 1–180 1–190 1–200 1–210

21–30 31–40 41–50 51–60 61–70 71–80 81–90 91–100 101–110 111–120 121–130 131–140 141–150 151–160 161–170 171–180 181–190 191–200 201–210 211–220

164 176 225 296 282 259 263 276 257 87 115 157 171 166 198 195 189 197 165 78

According to Fig. 7, the recall dropped significantly (from 87% to 72%) at confidence = 0.4 (or 0.5). Therefore, the minimum confidence can be set to around 0.4 (or 0.5). Table 2 The summary of results of prediction using supp = 0.01, conf = 0.1–0.9 Conf

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

H class

L class

H

L

H

L

39 35 35 34 28 21 17 11 4

0 4 4 5 11 18 22 28 35

561 275 197 131 95 60 44 29 21

62 348 426 492 528 563 579 595 602

Acc.

Prec.

Rec.

Spec.

Rules #

15.26 57.85 69.64 79.46 83.99 88.22 90.03 91.40 91.54

6.50 11.29 15.09 20.61 22.76 25.93 27.87 27.50 16.00

100.00 89.74 89.74 87.18 71.79 53.85 43.59 28.21 10.26

9.95 55.86 68.38 78.97 84.75 90.37 92.94 95.35 96.63

114 102 100 104 86 55 33 25 24

381

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384 Table 4 The prediction results using conf = 0.4, supp = 0.01, n = 30–180

100.00

n

Percentage

80.00

accuracy precision

60.00

recall specificity

40.00

false alarm

20.00 0.00

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Confidence Fig. 7. The results using different confidences.

Table 3 shows the results by using different levels of minimum support and summarized in Fig. 8. The precision and recall were not significantly changed at minimum support values between 0.005 and 0.02, but the recall dropped significantly when the minimum support was increased to 0.05. The minimum support should be set to a small value, when building the rule set, to detect rare cases and ensure that the H cases can be used to build the rule set. In addition to minimum support and confidence, the number of intervals of continuous attributes can be determined by the partial completeness level [47]. For instance, the number of intervals can be calculated with partial completeness level K = 17, the minimum support m = 0.01 and the number of continuous attributes n = 12 to compute as follows.

Number of intervals ¼

2n ¼ 150 supp  ðK  1Þ

Table 4 shows the results using 0.01 as minimum support, 0.4 as minimum confidence and using different numbers of intervals of continuous attributes (n). Although all defect actions can be identi-

Table 3 The results of prediction using Conf = 0.4, Supp = 0.005–0.05 Supp

0.005 0.010 0.020 0.030 0.040 0.050

H class

L class

H

L

H

L

35 34 33 33 31 23

4 5 6 6 8 16

183 131 109 84 73 53

440 492 514 539 550 570

Acc.

Prec.

Rec.

Spec.

Rules #

71.75 79.46 82.63 86.40 87.76 89.58

16.06 20.61 23.24 28.21 29.81 30.26

89.74 87.18 84.62 84.62 79.49 58.97

70.63 78.97 82.50 86.52 88.28 91.49

225 104 40 19 10 7

100.00 80.00 Percentage

accuracy precision

60.00

30 60 90 120 150 180

H class

L class

H

L

H

L

31 33 34 34 34 34

8 6 5 5 5 5

136 140 135 135 131 131

487 483 488 488 492 492

Acc.

Prec.

Rec.

Spec.

Rules #

78.25 77.95 78.85 78.85 79.46 79.46

18.56 19.08 20.12 20.12 20.61 20.61

79.49 84.62 87.18 87.18 87.18 87.18

78.17 77.53 78.33 78.33 78.97 78.97

79 93 99 105 104 104

fied using these setting, the accuracy with around 150 to 180 intervals was much better than accuracy with other numbers of intervals. The maximum number of large itemsets of category and continuous attributes restricts the number of combination of the category and continuous large itemsets. For example, setting a maximum of 450 itemsets restricts the number of rules to about 2  105 (before pruning). Table 5 shows the results of using different maximum numbers of large itemsets (m). Table 6 shows the rule set mined using the first 530 records as training data with 0.4 as minimum confidence and 0.01 as minimum support. Rule 3 indicates that an action causes high-defect when the expected effort falls within 6 and 8, the Object_Type is Application, and the originator is programmer. Similar defect patterns can be seen in Rules 4 and 5, in which Rule 4 contains item 11 = ‘0’ (Task_Status is within schedule) and Rule 5 contains item 12:(40;120] (expected efforts for the task falls between 40 and 120). In this case, although the last item of Rules 3, 4 and 5 may be contained in one action, the item 7 = ’4’ for high-defect prediction achieves the highest support (0.019). The manager can perform the evaluation and decision for every prediction based on the mined rule set. The recall (79%) of predictions that use 0.04 as minimum support in Table 3 reveals that about 80% of high-defect actions can be predicted in advance. The high accuracy (about 88%) also indicates that most actions can be predicted correctly. However, the high specificity (about 88%) and low precision (about 30%) indicate that most correct predictions are low-defect actions (too many low-defect actions are predicted as high-defect actions). The low precision may result in too many false alarms and stuck software projects. Precision can be simply improved by considering the confidence of rules while evaluating an action. To reduce the difference of the numbers of defects between high-defect actions and low-defect actions, the low-defect actions can be further split into two groups, the median-defect actions (1 or 2 defects) and no-defect actions (0 defects). The new categorization grouped the actions into 41 high-defect actions, 177 median-defect actions and 464 nodefect actions. The numbers of high-defect, median-defect and nodefect actions used for testing were 39, 175 and 448 respectively. Besides building a rule set for high-defect actions, this categorization generates a rule set for median-defect and no-defect action predictions. The multiple rule sets predict an action as either high-defect or no-defect. More than one rule can fit an action.

recall specificity

40.00

Table 5 The prediction results using conf = 0.4, supp = 0.01, n = 150, m = 250–450

false alarm m

20.00 0.00 0.005

0.01

0.02

0.03

0.04

0.05

Support Fig. 8. The results using different supports.

250 300 350 400 450

H class

L class

H

L

H

L

23 23 23 34 34

16 16 16 5 5

53 53 53 131 131

578 578 578 492 492

Acc.

Prec.

Rec.

Spec.

Rules #

89.70 89.70 89.70 79.46 79.46

30.26 30.26 30.26 20.61 20.61

58.97 58.97 58.97 87.18 87.18

91.60 91.60 91.60 78.97 78.97

7 7 7 104 104

382

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384

For example, rules 1, 2, 3 and 4 in Table 6 predict action {1, 3, -, 5, 3, 7, 4, 1, 0, 26, 0, 40, 6, 0, 0, 0, 6, 3, 3, 0.00} as high-defect. Table 7 shows the prediction results of 10 testing actions (301– 310), in which AC denotes the actual class; PC denotes the prediction class; R denotes the number of rules satisfied, MS denotes the maximum support, and MC denotes the maximum confidence of all satisfied rules. The minimum support and confidence in the rule sets for high-defect and median-defect actions were 0.05 and 0.4, while 0.4 and 0.8 were used as minimum support and confidence, respectively, to mine the rule set for no-defect actions. Actions 2 and 4 were predicted as high-defect actions when using a single rule set for high-defect actions. However, the prediction of action 2 needs to be reconsidered when using rule sets of median-defect and no-defect actions. Fig. 9 depicts the prediction results using multiple rule sets, in which the precision increased to 34.48%, while the recall dropped to 51.28% (many high-defect actions were also evaluated as lowdefect actions) in comparison with those of predictions shown in Table 3. Evaluation performed by different experts may produce different results. Table 8 presents the results using a single rule set for no-defect action prediction, in which an action is evaluated as high-defect or median-defect (denoted as MH) if no rule for low-defect prediction is satisfied. The precision for these results is around 36%. The major cause of low precision is an imbalance data set. Under-sampling is a technique that can be used to address the imbal-

Table 6 The mining results using the first 530 records (conf = 0.4, supp = 0.01, maxA = 150, maxI = 450) No

Rule

Supp

Conf

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

6:(6;8], 3 = ‘-’ 6:(6;8], 4 = ‘5’ 6:(6;8], 5 = ‘3’, 7 = ‘4’ 6:(6;8], 5 = ‘3’, 11 = ‘0’ 6:(6;8], 5 = ‘3’, 12:(40;120] 6:(6;8], 7 = ‘4’, 12:(40;120] 9:(0;1], 3 = ‘-’, 4 = ‘0’ 9:(0;1], 3 = ‘-’, 7 = ‘4’ 9:(0;1], 3 = ‘-’, 11 = ‘0’ 9:(0;1], 3 = ‘-’, 5 = ‘3’, 8 = ‘4’ 9:(0;1], 4 = ‘5’, 7 = ‘4’, 11 = ‘0’ 9:(0;1], 7 = ‘4’, 8 = ‘4’, 11 = ‘0’ 12:(40;120], 3 = ‘-’, 4 = ‘5’ 12:(40;120], 3 = ‘-’, 11 = ‘0’ 12:(40;120], 4 = ‘5’, 7 = ‘4’ 12:(40;120], 4 = ‘5’, 11 = ‘0’ 12:(120;320], 1 = ‘0’ 12:(120;320], 3 = ‘-’, 7 = ‘4’ 14:(-;0], 3 = ‘-’, 4 = ‘5’, 10 = ‘30’ 14:(-;0], 3 = ‘-’, 8 = ‘3’, 11 = ‘0’

0.023 0.017 0.019 0.011 0.011 0.011 0.013 0.043 0.026 0.032 0.019 0.015 0.017 0.023 0.017 0.015 0.019 0.023 0.013 0.015

0.429 0.474 0.417 0.429 0.462 0.462 0.467 0.426 0.438 0.405 0.500 0.400 0.409 0.444 0.409 0.400 0.417 0.500 0.412 0.400

Classified as →

H

H class

20

19

L class

38

585

L

Accuracy = Precision = Recall = Specificity =

91.39% 34.48% 51.28% 93.90%

Fig. 9. The prediction results using multiple rule sets.

Table 8 The prediction results using no-defect rule set (supp = 0.1, conf = 0.6–0.8, n = 150, m = 250–450) Conf

0.6 0.7 0.8

MH class

N class

MH

N

MH

N

25 94 141

189 120 73

52 162 229

396 286 219

Acc.

Prec.

Rec.

Spec.

Rule #

63.60 57.40 54.38

32.47 36.72 38.11

11.68 43.93 65.89

88.39 63.84 48.88

52 43 35

ance problems of data sets. To demonstrate the use of undersampling to improve the prediction precision, 49 low-defect actions were selected randomly from the 641 low-defect actions of the original data set. The 49 low-defect actions and 41 high-defect actions form a balance data set with the total of 90 actions. The evaluation process was the same as that used in Section 4.1, such as actions 1–20 were used to build prediction models, and actions 21–30 were used as the testing data set. Table 9 shows the prediction results using 0.1 as the minimum support and 0.9 as the minimum confidence. The number of incorrect prediction of low-defect actions was only 6, which is smaller than the number of correct predictions of high-defect actions [24]. Table 10 presents the accuracy, precision, recall and specificity of the results in Table 9 (conf = 0.9). The precision was 81.82%, while the recall was 87.10%. Hence, the low precision due to imbalance in the data set can be solved by under-sampling. To decrease the impact of prediction accuracy, the data set used to build prediction models has to be verified to avoid an imbalance data set, and perform under-sampling on the collected data when the data set is imbalance.

Table 9 The results using under-sampling (supp = 0.1, conf = 0.9, maxA = 150, maxI = 450) Data set

H Class

Training

Testing

1–20 1–30 1–40 1–50 1–60 1–70 1–80

21–30 31–40 41–50 51–60 61–70 71–80 81–90

H

Total

L Class

Rule #

L

H

L

5 1 5 4 5 2 2

0 3 0 0 0 1 3

2 1 0 2 1 0 0

3 5 5 4 4 7 5

24

7

6

33

61 25 17 31 27 14 19

Table 7 The prediction results of actions 301–310 (conf = 0.4, supp = 0.05, maxA = 150, maxI = 450) Action

AC

PC

High-defect R

1 2 3 4 5 6 7 8 9 10

L L L H L L L L L L

L H L H L L L L L L

0 2 0 4 0 0 0 0 0 0

MS

MC

0.05

0.47

0.05

0.47

Median-defect

No-defect

R

MS

MC

R

MS

MC

2 4 2 8 2 2 2 0 1 1

0.07 0.07 0.07 0.07 0.07 0.07 0.07

0.44 0.57 0.44 0.57 0.44 0.44 0.44

0.46 0.44 0.44

0.85 0.85 0.85

0.46 0.44 0.46 0.46

0.85 0.85 0.85 0.85

0.07 0.07

0.44 0.44

2 1 1 0 2 1 2 2 0 0

Table 10 The results using under-sampling(supp = 0.1, conf = 0.3–0.9, maxA = 150, maxI = 450) Conf

0.3 0.4 0.5 0.6 0.7 0.8 0.9

H class

L class

H

L

H

L

31 31 31 28 27 27 27

0 0 0 3 4 4 4

39 39 39 29 18 13 6

0 0 0 10 21 26 33

Acc.

Rec.

Prec.

Spec.

Rules #

44.29 44.29 44.29 54.29 68.57 75.71 85.71

44.29 44.29 44.29 49.12 60.00 67.50 81.82

100.00 100.00 100.00 90.32 87.10 87.10 87.10

0.00 0.00 0.00 25.64 53.85 66.67 84.62

84 80 73 56 49 39 28

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384

5. Conclusions The defect prediction is an important activity for defect prevention and software process improvement. This study presents Defect Prediction approach based on Association Rules, which applies association mining technique. The proposed approach can be applied to the software development process to predict the actions that are likely to cause high defects. The data set used to mine the rule sets is collected from the reports of operations and defects of the project, and can be used to predict the defects generated by the subsequent actions. To increase the accuracy of prediction, ARDP applies multiinterval discretization to the continuous attributes of the data set. The prediction results can be applied in the causal analysis process and corrective action planning to prevent defects from occurring. The main advantage of ARDP is the in-process prediction, in which the training data can be obtained from the project in execution for building the rule set. The in-process analysis can also reduce the variance between different projects. Second, the selected attributes used in ARDP to build the rule sets can be adapted from the existing process, and may reduce the effort involved in modifying existing processes for ARDP and data collection. Third, the mined rule set can accurately predict subsequent actions and yield a quick response. Fourth, the multi-interval discretization adopted in ARDP can automatically handle any number of continuous attributes. Additionally, the ARDP can also be merged into an existing process for defect prediction. To show the accuracy of the ARDP, this study applies the ARDP to the AMS-COMFT project to evaluate the accuracy of ARDP, revealing the following results. (1) Actions that cause high defects among repositories of the software process may be rare class, and can be identified by a small minimum support level. The rarity problem of the high-defect class can be addressed using under-sampling technique. Since the mined rule set concentrates on the high-defect class, the number of rules may influence the number of false alarms. The minimum support, minimum confidence, number of intervals and maximum number of large itemsets must be selected carefully to reduce the number of false alarms and increase the recall rate. (2) The number of intervals of the continuous attributes may affect the number of discovered rules. The threshold for selecting the number of intervals can be determined from the partial incompleteness and minimum support levels. (3) Although the accuracy of prediction using ARDP reach 85%, understanding the characteristics of the data set can help determine these thresholds.Additionally, patterns among actions causing many defects can be modeled using data mining techniques. The rule set built by the multi-interval association mining can also be used to describe the causes of reported defects. Future work to identify these patterns will include using multi-dimension association techniques to improve the prediction accuracy.

Acknowledgement This work is partially supported by the National Science Council of Taiwan, ROC, under Grant NSC-92-2213-E-309-005, and partially sponsored by the Ministry of Economic Affairs of Taiwan, under grant 93-EC-17-A-02-S1-029. References [1] S. Mohapatra, B. Mohanty, Defect prevention through defect prediction: a case study at Infosys, IEEE International Conference on Software Maintenance (2001) 206–272.

383

[2] R.G. Mays, C.L. Jones, G.J. Holloway, D.P. Studinski, Experiences with defect prevention, IBM System Journal 29 (1990) 4–32. [3] R. Chillarege, I.S. Bhandari, J.K. Chaar, M.J. Halliday, D.S. Moebus, B.K. Ray, M.-Y. Wong, Orthogonal defect classification – a concept for in-process measurements, IEEE Transactions on Software Engineering 18 (1992) 943–956. [4] M. Leszak, D.E. Perry, D. Stoll, Classification and evaluation of defects in a project retrospective, The Journal of System and Software 61 (2002) 173–187. [5] T.M. Khoshgoftaar, E.B. Allen, W.D. Jones, J.P. Hudepohl, Classification-tree models of software-quality over multiple releases, IEEE Transactions on Reliability 49 (2000) 4–11. [6] S. Zhong, T.M. Khoshgoftaar, N. Seliya, Analyzing software measurement data with clustering techniques, IEEE Intelligent System 19 (2004) 20–27. [7] T.L. Graves, A.F. Karr, J.S. Marron, H. Siy, Predicting fault incidence using software change history, IEEE Transactions on Software Engineering 26 (2000) 653–661. [8] C.-P. Chang, C.-P. Chu, Defect prevention in software processes: an actionbased approach, Journal of Systems and Software 80 (4) (2007) 559–570. [9] B. Liu, Y. Ma, C.-K. Wong, P.S. Yu, Scoring the data using association rules, Applied Intelligence 18 (2) (2003) 119–135. [10] C. Westphal, T. Blaxton, Data Mining Solutions: Methods and Tools for Solving Real-World Problems, John Wiley & Sons, 1998. [11] D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, MIT Press, Cambridge, MA, 2001. [12] G.M. Weiss, Mining with rarity: a unifying framework, ACM SIGKDD Explorations Newsletter 6 (2004) 7–19. [13] M.A. Hall, G. Holmes, Benchmarking attribute selection techniques for discrete class data mining, IEEE Transaction on Knowledge and Data Engineering 15 (3) (2003) 1437–1447. [14] A.A. Freitas, S.H. Lavington, Speeding up knowledge discovery in large relational database by means of a new discretization algorithm, in: Proceedings of 14th British Nat. Conf. on Database, Edinburgh, UK, 1996, pp. 124–133. [15] R.S. Pressman, Software Engineering: A Practitioner’s Approach, McGraw-Hill, NJ, 2001. [16] B. Boehm, V. Basili, Software defect top 10 lists, IEEE Computer 34 (1) (2001) 2–6. [17] K.E. Emam, O. Laitenberger, Evaluating capture–recapture models with two inspectors, IEEE Transactions Software Engineering 27 (2001) 851–864. [18] M.C. Ohlsson, A. Amschler Andrews, C. Wohlin, Modelling fault-proneness statistically over a sequence of releases: a case study, Journal of Software Maintenance 13 (3) (2001) 167–199. [19] C.-P. Chang, C.-P. Chu, Y.-F. Yeh, Time approach for estimating the defect number, Journal of Software Engineering Studies 2 (1) (2007) 138–150. [20] M.M. Thet Thwin, T.-S. Quah, Application of neural networks for software quality prediction using object-oriented metrics, The Journal of Systems and Software 76 (2005) 147–156. [21] T.M. Khoshgoftaar, D.L. Lanning, A neural network approach for early detection of program modules having high risk in the maintenance phase, Journal of Systems and Software 29 (1) (1995) 85–91. [22] S. Bibi, G. Tsoumakas, I. Stamelos, I. Vlahavas, Regression via classification applied on software defect estimation, Expert Systems with Applications 34 (2008) 2091–2101. [23] T. Graves, J.A. Karr, H.S. Marron, Predicting fault incidence using software change history, IEEE Transactions on Software Engineering 26 (7) (2000) 653– 661. [24] A.M. Salem, K. Rekabb, J.A. Whittakerc, Prediction of software failures through logistic regression, Information and Software Technology 46 (2004) 519–523. [25] N. Fenton, M. Neil, W. Marsh, P. Hearty, D. Marquez, P. Krause, R. Mishra, Predicting software defects in varying development lifecycles using Bayesian nets, Information and Software Technology 49 (2007) 32–43. [26] W. Klösgen, J.M. Zytkow, Knowledge discovery in databases terminology, Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, Cambridge, MA, 1996. [27] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman Publishers, USA, 1993. [28] C.L. Huntley, Organizational learning in open-source software projects: an analysis of debugging data, IEEE Transactions on Engineering Management 50 (4) (2003) 485–493. [29] R.R. Lutz, I.C. Mikulski, Operational anomalies as a cause of safety-critical requirements evolution, Journal of Systems and Software 65 (2) (2003) 155– 161. [30] M. Pighin, An empirical quality measure based on complexity values, Information and Software Technology 40 (14) (1998) 861–864. [31] J.C. Munson, A.P. Nikora, J.S. Sherif, Software faults: a quantifiable definition, Advances in Engineering Software 37 (5) (2006) 327–333. [32] J. Lawler, B. Kitchenham, Measurement modeling technology, IEEE Software 12 (3) (2003) 68–75. [33] M. Cusumano, Japan’s Software Factories: A Challenge to U.S. Management, Oxford University Press, New York, 1991. [34] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA, 2000. [35] P. Perner, S. Trautzsch, Multi-interval discretization methods for decision tree learning, in: Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition, 1998, pp. 475–482. [36] N. Japkowicz, S. Stephan, The class imbalance problem: a systematic study, Intelligent Data Analysis 6 (5) (2002) 429–449.

384

C.-P. Chang et al. / Information and Software Technology 51 (2009) 375–384

[37] K. Michalak, H. Kwas´nicka, Correlation-based feature selection stragegy in classification problems, Internation Journal of Applied Mathematics and Computer Science 16 (4) (2006) 503–511. [38] R. Rajeev, K. Shim, A decision tree classifier that integrates building and pruning, Data Mining and Knowledge Discovery 4 (4) (2000) 315–344. [39] J. Li, H. Shen, R. Topor, Mining informative rule set for prediction, Journal of Intelligent Information Systems 22 (2) (2004) 155–174. [40] T.M.A. Basile, F. Esposito, N.D. Mauro, S. Ferilli, Handling continuous-valued attributes in incremental first-order rules learning, Lecture Notes in Computer Science 3673 (2005) 430–441. [41] M. Boulle, Khiops: a statistical discretization method of continuous attributes, Machine Learning 55 (1) (2004) 53–69. [42] F.E.H. Tay, A modified chi2 algorithm for discretization, IEEE Transaction on Knowledge and Data Engineering 14 (3) (2002) 666–670.

[43] L. Shen G.J. Hwang, F. Li, A dynamic method for discretization of continuous attributes, Lecture Notes in Computer Science 2412 (2002) 181–193. [44] R. Srikant, R. Agrawal, Mining quantitative association rules in large relational tables, in: Proceeding of the international conf. on Management of data, 1996, pp. 1–12. [45] B. Liu, W. Hsu, S. Chen, Y. Ma, Analyzing the subjective interestingness of association rules, IEEE Intelligent Systems 15 (5) (2000) 47–55. [46] R. Rastogi, K. Shim, Mining optimized association rules with categorical and numeric attributes, IEEE Transactions on Knowledge and Data Engineering 14 (1) (2002) 1041–4347. [47] K. Wang, S.H.W. Tay, B. Liu, Interestingness-based interval merger for numeric association rules, in: Proceedings of the 4th Intl. Conf. on Knowledge Discovery and Data Mining (KDD’98), New York, 1998, pp. 121–128.