Expert Systems with Applications 36 (2009) 6019–6024
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Finding ‘‘persistent rules”: Combining association and classification results Karthik Rajasethupathy a, Anthony Scime b,*, Kulathur S. Rajasethupathy b, Gregg R. Murray c a b c
Department of Mathematics, 310 Malott Hall, Cornell University, Ithaca, NY 14853-4201, United States Department of Computer Science, The College at Brockport, State University of New York, 350 New Campus Dr., Brockport, NY 14420-2933, United States Department of Political Science, Texas Tech University, Box 41015, Lubbock, TX 79409, United States
a r t i c l e Keywords: Association mining Classification Persistent rules Strong rules
i n f o
a b s t r a c t Different data mining algorithms applied to the same data can result in similar findings, typically in the form of rules. These similarities can be exploited to identify especially powerful rules, in particular those that are common to the different algorithms. This research focuses on the independent application of association and classification mining algorithms to the same data to discover common or similar rules, which are deemed ‘‘persistent-rules”. The persistent-rule discovery process is demonstrated and tested against two data sets drawn from the American National Election Studies: one data set used to predict voter turnout and the second used to predict vote choice. Ó 2008 Elsevier Ltd. All rights reserved.
1. Introduction Data mining is a process of inductively analyzing data to find interesting patterns and previously unknown relationships in the data. Typically, these relationships can be translated into rules that are used to predict future events or to provide knowledge about interrelationships among data. Data mining methodologies often lead to a large number of rules that need to be evaluated to find the most interesting and useful. These methodologies reduce the number of rules through pruning or imposing thresholds of support and confidence. A domain expert further reduces the rules by identifying those that are physically impossible, redundant, or not meaningful to the issue under consideration. The rules that remain are the interesting and useful rules. However, different data mining methodologies process a data set differently. This yields results in different forms. For example, the a priori association mining algorithm presents results as strong rules, while the C4.5 classification algorithm creates a decision tree that can be converted into rules. Although these algorithms create different sets of rules, some of the individual rules may be similar. When this is the case, these independently identified yet common rules may be considered ‘‘persistent-rules”. Persistent-rules improve decision making by narrowing the focus to rules that are the most robust, consistent, and noteworthy. In this research, the concept of persistent-rules is introduced. Further, the persistent-rule discovery process is demonstrated in the area of voting behavior, which is a complex process subject to a wide variety of factors. Given the high stakes often involved in elections; researchers, campaigns, and political parties devote con* Corresponding author. Tel.: +1 585 395 2323; fax: +1 585 395 2304. E-mail address:
[email protected] (A. Scime). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.06.090
siderable effort and resources to trying to understand the dynamics of voting and vote choice (Edsall, 2006). This research concludes by showing how persistent-rules, found using both association and classification data mining algorithms, can be used to identify likely voters and for whom they will vote. 2. Related work Researchers have attempted to identify the most useful and interesting rules by applying other methodologies to data prior to the application of a data mining algorithm. For example, Deshpande and Karypis (2002) and Padmanabhan and Tuzhilin (2000) improved classification rules by first using association mining techniques. In this approach, the association mining creates itemsets that are selected based on achieving a given support threshold. The original data set then has an attribute added to it for each selected itemset, where the attribute values are true or false; true if the instance contains the itemset and false otherwise. The classification algorithm is then executed on this modified data set to find the interesting rules. Jaroszewicz and Simovici (2004) employed user background knowledge to determine the interestingness of sets of attributes using a Bayesian Network prior to association mining. In their research, the interestingness of a set of attributes is indicated by the absolute difference between the attributes’ support as calculated in a Bayesian Network and as an association itemset. Data dimensionality reduction (DDR) can be used to reduce the number of rules by simplifying the data. Fu and Wang (2005) employed DDR to improve classification using neural networks and to produce concise, accurate, and interesting rules. Murray, Riley, and Scime (2007) and Scime and Murray (2007) used expert knowledge to reduce data dimensionality while iteratively creating classification models.
6020
K. Rajasethupathy et al. / Expert Systems with Applications 36 (2009) 6019–6024
3. Data mining: association and classification methodologies In data mining, there are a number of methodologies used to analyze data. Association mining is used to find patterns of data that show conditions where sets of attribute-value pairs occur frequently in the data set. It is often used to determine relationships in transaction data. Classification mining, on the other hand, is used to find models of data for categorizing instances of, for example, objects, events, or persons. It is often used for predicting future events from historical data (Han & Kamber, 2001). Typically, the choice of methodology is determined by both the goal of the data mining and the data. However, similar results have been obtained by mining the same data set using both methodologies. For example, Bagui (2006) mined crime data using association and classification techniques that yielded the same conclusions about criminal activity and enforcement. More specifically, association mining evaluates data for relationships among attributes in the data set (Witten & Frank, 2005). The association rule mining algorithm a priori finds itemsets within the dataset at user specified minimum support and confidence levels. The size of the itemsets is continually increased as the algorithm proceeds until no itemsets satisfy the minimum support level. The support of an itemset is the number of instances that contain all the attributes in the itemset. The largest supported itemsets are converted into rules where each item implies and is implied by every other item in the itemset. For example, given an itemset of three items (C1 = x, C2 = g, C3 = a), six rules are generated:
IF ðC1 ¼ x AND C2 ¼ gÞ THEN C3 ¼ a
ð1Þ
IF ðC1 ¼ x AND C3 ¼ aÞ THEN C2 ¼ g IF ðC2 ¼ g AND C3 ¼ aÞ THEN C1 ¼ x
ð2Þ ð3Þ
IF C1 ¼ x THEN ðC2 ¼ g AND C3 ¼ aÞ
ð4Þ
IF C2 ¼ g THEN ðC1 ¼ x AND C3 ¼ aÞ
ð5Þ
IF C3 ¼ a THEN ðC1 ¼ x AND C2 ¼ gÞ
ð6Þ
Classification mining, on the other hand, uses the C4.5 algorithm to generate a decision tree. The goal of classification is to determine the likely value of a class variable (the outcome or dependent variable) given values for the other attributes of data. This is accomplished by the construction of a decision tree using data containing the dependent variable and its values. The decision tree consists of decision nodes and leaf nodes, beginning with a root decision node, connected by edges. Each decision node is an attribute of the data and the edges represent the attribute values. The leaf nodes represent the dependent variable; the expected classification results of each data instance. Using the three items from above with C1 as the dependent variable, Fig. 1 represents a possible tree. The branches of the decision tree can be converted into rules, which all have as the consequent the dependent variable with its legal values. The rules for the tree above are
IF C3 ¼ a AND C2 ¼ g THEN C1 ¼ x
ð7Þ
IF C3 ¼ a AND C2 ¼ h THEN C1 ¼ y
ð8Þ
IF C3 ¼ b THEN C1 ¼ z
ð9Þ
IF C3 ¼ c THEN C1 ¼ y
ð10Þ
4. Rule reduction and supersession The need to reduce the number of rules is common to classification and association techniques. This reduction may take place because the rule is not physically possible, the rule’s confidence falls below the established threshold level, or the rule can be combined
Fig. 1. Classification decision tree.
with other rules. In association mining, a minimum confidence level is set for the rules. Those rules whose confidence falls below that level are eliminated. In classification mining, a pruning process combines decision tree nodes to reduce the size of the tree while having a minimum effect on the classification result (Witten & Frank, 2005). It is possible that physically impossible or obviously coincidental rules remain after the algorithms reduce the number of rules. These rules should be identified by a domain expert and be eliminated, as well. Furthermore, one rule may have all the properties of another rule in association mining. As a rule’s premise takes on more conditions, the confidence of the rule generally increases. For example, given two rules, with the confidence levels given after the rule
IF A1 ¼ r and A2 ¼ s THEN A3 ¼ t ðconf : :90Þ
ð11Þ
IF A1 ¼ r THEN A3 ¼ tðconf : :80Þ
ð12Þ
Rule (12) has all the conditions of Rule (11). The additional condition in Rule (11) increases the confidence, however if a confidence level of .80 is sufficient Rule (12) can supersede Rule (11). Rule (11) is eliminated. 5. Persistent-rule discovery Data mining methodologies can be complementary. For example, association mining has been used to strengthen the results found in classification mining. Deshpande and Karypis (2002) added the resulting association itemsets as Boolean attributes of the data set being classified. However, persistent-rules are those that are obtained across independent data mining methods. That is, they are the subset of rules common to more than one method. If an association rule and a classification rule are similar, then the rule would be robust across methods and be considered persistent. The only association rules that can be compared to classification rules are those rules that contain the same premise and consequent, such as in Rules (3) and (7) in which the premise is C2 = g AND C3 = a and the consequent is C1 = x. Commonly, the classification rules contain many conditions as the tree is traversed to construct the rule. That is, a condition exists for each node of the tree. As long as the entire association rule premise is present in a classification rule, the association rule can supersede the classification rule. When the classification rule drops conditions that are not present in the association rule, it becomes a rule-part.
K. Rajasethupathy et al. / Expert Systems with Applications 36 (2009) 6019–6024
However, there may be many identical rule-parts. The process of finding the rule-parts that match an association rule involves the following steps: (1) Find association rules with the classification dependent variable as the consequent; (2) Find those classification rules that contain the same conditions as the association rule; (3) Create rule-parts by deleting the classification rule conditions that are not conditions in the association rule. 6. Rule reversal Classification mining begins with a goal class or dependent variable toward which the construction of the tree is oriented. As a result, all classification rule consequents contain the same attribute, although this attribute may have different values. In contrast, association mining creates candidate rules by considering all the possible combinations of attribute-values pairs in the data as premises and consequents. Rules are selected from the candidate rules by determining the number of instances that satisfy the candidate rule and then comparing that number to a threshold value. As a result, association and classification rules can only be compared when the premise and consequent of the rule match. An association rule may have the classification rule’s consequent as one of its conditions. In this case, the association rule needs reversal in order for it to be compared to its corresponding classification rule. To reverse a rule, apply the following Boolean logic:
IF ðX ¼> YÞ THEN ðNOT Y ¼> NOT XÞ For example, reverse Rule (4) (IF C1 = x THEN (C2 = g AND C3 = a)):
IF NOT ðC2 ¼ g AND C3 ¼ aÞ THEN NOT C1 ¼ x
ð13Þ
A simple application of DeMorgan’s Law, followed by decomposition leads to
IF ðNOT C2 ¼ gÞ OR ðNOT C3 ¼ aÞ THEN NOT C1 ¼ x
ð14Þ
IF ðNOT C2 ¼ gÞ THEN NOT C1 ¼ x
ð15Þ
and
IF ðNOT C3 ¼ aÞ THEN NOT C1 ¼ x
ð16Þ
Rules (15) and (16) combined state ‘‘if any values other than C2 = g and C3 = a then anything except C1 = x”. Or,
IF C1 ¼ x OR C1 ¼ y OR C1 ¼ z OR C2 ¼ h OR C3 ¼ b OR C3 ¼ c THEN C1 ¼ y OR C1 ¼ z OR C2 ¼ g OR C2 ¼ h OR C3 ¼ a OR C3 ¼ b OR C3 ¼ c
ð17Þ
which can be decomposed into 42 rules. However, the only rules that will match the classification tree rules are ones that conclude with C1 = y or C1 = z, of which there are 12. Of those 12 rules, the six with C1 as a condition will not appear in the classification rule set. Leaving six rules that may match part of the classification tree.
IF C2 ¼ h THEN C1 ¼ y IF C3 ¼ b THEN C1 ¼ y
ð18Þ ð19Þ
IF C3 ¼ c THEN C1 ¼ y
ð20Þ
IF C2 ¼ h THEN C1 ¼ z
ð21Þ
IF C3 ¼ b THEN C1 ¼ z
ð22Þ
IF C3 ¼ c THEN C1 ¼ z
ð23Þ
Rule (18) matches part of Rule (8), Rule (20) matches Rule (10), and Rule (22) matches Rule (9). The other three rules are no longer of
6021
interest. In this example, then, there are four persistent-rules. Rule (3) is directly a persistent-rule, while Rules (18), (20), and (22) are persistent-rules based on rule reversal. 7. The ANES data and the data mining application The 1948–2004 ANES cumulative data file (ANES, 2005) is a single file composed of the merged cases and attributes from each of the ANES studies conducted since 1948 (47,438 records). The file includes most, but not all, of the questions that have been asked in three or more ANES surveys conducted during the multi-decade time period. It is composed, therefore, of more than 900 attributes, which, for comparability, have been coded in a consistent manner from year to year. Because the data set is prepared for analysis, all the attribute values are coded numerically with predefined meanings. This study uses ANES data that had been previously selected and cleaned for data mining (Murray et al., 2007; Scime & Murray, 2007).1 The ANES attributes are of two types: discrete and continuous. Discrete-value attributes contain a single defined value such as party identification, which is indicated as Democrat, Republican, or other. Continuous-value attributes take on an infinite number of values such as the 0-100-scale ‘‘feeling thermometers”, which measure affect toward a specified target, and subtractive scales, which indicate the number of ‘‘likes” minus the number of ‘‘dislikes” mentioned about a target. It should be noted that in the previous studies the continuous-value attributes were left as continuous attributes. As a result of the previous data mining methodology studies, the data sets had been cleaned and prepared for classification mining. To insure discrete attributes were not misinterpreted as numeric values an ‘‘a” or ‘‘A” was prepended to each value. Because association mining only uses discrete attributes, the continuous attributes were discretized. In this study, the WEKA (Waikato Environment for Knowledge Analysis) (Witten & Frank, 2005) software implementations of the association mining a priori algorithm and the classification mining C4.5 algorithm were used. Shannon’s entropy method was used to discretize the continuous attributes. 8. Demonstrating persistent-rules: predicting vote choice The persistent-rule discovery process was first applied to the data set used in the presidential vote choice studies (Scime & Murray, 2007; Murray & Scime, in press). This data set consists of 14 attributes and 6677 instances from the ANES. The a priori association algorithm was run on the data set, which generated 29 rules with a minimum 0.80 confidence and 0.15 support levels. All 29 rules concluded with the race attribute having the value ‘‘white”. This suggested that the number of white voters in the data set was sufficiently large to skew the results. Further examination of the data set revealed that 83.5% of the voters were white. The domain expert concluded that race as an indicator of association is not useful. Recleaning the data to remove the race attribute, the data set was rerun with the a priori algorithm. This resulted in 33 rules with confidence levels between 0.60 and 0.85 and a support level of 0.15. Though the confidence levels had decreased, the rule consequents were varied and reasonable. Next, the C4.5 classification algorithm using three folds was applied to the data set to which the a priori association algorithm was applied (i.e., the data set that excluded the race attribute). Following Scime and Murray (2007), the dependant variable was the political party for which the voter reported voting (depvarvotewho). The classification tree had more complex rules than the rules
1
Please see the Appendix for attribute definitions.
6022
K. Rajasethupathy et al. / Expert Systems with Applications 36 (2009) 6019–6024
obtained from association mining. For example, one branch of the tree was
IF the feeling about Democratic presidential candidate is negative THEN the respondent votes Republican ð29Þ
apid = a2 affrepcand = ‘(0.5 to 0.5]’2 demtherm = ‘(inf to 42.5]’ aeduc = a1: NotVote
IF the affect towards the Republican Party is mostly positive
This branch of the tree translates into the rule: IF Party identification (apid) = weak or leaning Democratic (a2) AND Affect towards Republican candidate (affrepcand)= no affect,‘(0.5 to 0.5]’ AND Democratic thermometer (demtherm) = not favorable ‘(inf to 42.5]’ AND Education of respondent (aeduc) = 8 grades or less (a1) THEN Dependent variable, party voted for (depvarvotewho) = NotVote Recall that persistent-rules must have identical rule consequents generated independently by both data mining methodologies. Because vote choice (depvarvotewho) was the subject of the classification mining, only rules with that consequent among the association mining results are candidates for identification as persistent-rules. Ten of the 33 association rules met this requirement; two of these are superseded by another, leaving eight possibly persistent-rules. For example, one of the eight association rules states: IF Affect towards Republican candidate (affrepcand) = extreme like ‘(2.5 to inf)’ THEN Dependent variable, party voted for (depvarvotewho) = Republican A review of the tree rules reveals that there are six classification rules whose premises and consequents match the premises and consequents of the association rules. The other rules are not considered further, because to be classified along a branch an instance must satisfy all the conditions (attribute-value pairs) of the branch. By supersession, the instances that satisfy the branch would also satisfy the association rule being evaluated. The six classification rules that incorporate the association rule have the rule-part:
The persistent-rule discovery process was repeated on all eight association rules. The persistent-rules are
THEN the respondent votes Republican
ð24Þ
IF feelings about Republican presidential candidate are positive THEN the respondent votes Republican
ð25Þ
IF the affect towards the Democratic candidate is negative THEN the respondent votes Republican
ð26Þ
IF the respondent identifies him or herself as a strong Democrat THEN the respondent votes for the Democratic candidate
ð30Þ
IF the affect towards the Democratic Party is positive THEN the respondent votes Democratic
ð31Þ
Four association rules were also generated that had the class attribute, depvarvotewho, in the condition portion of the rule. Given rule reversal, these rules were candidates for further analysis:
IF reptherm ¼ ‘ð79:5 to infÞ’ AND depvarvotewho ¼ REP THEN awhoelect ¼ a2
ð27Þ
IF the feeling about Democratic presidential candidate is positive THEN the respondent votes for the Democratic candidate ð28Þ 2 The numbers are the range of values as found in the discretization process. The value closed by the parentheses is not included in that range of numbers. The value closed by the square brackets is included in the range of numbers. The values ‘inf’ and ‘inf’ represent negative and positive infinite, respectively.
ð32Þ
IF affdemcand ¼ ‘ð inf to 1:5’ AND depvarvotewho ¼ REP THEN awhoelect ¼ a2
ð33Þ
IF aintelect ¼ a2 AND depvarvotewho ¼ REP THEN awhoelect ¼ a2 IF depvarvotewho ¼ REP THEN awhoelect ¼ a2
ð34Þ
ð35Þ
Rule (35) supersedes Rules (32)–(34) because the consequent of all the rules is awhoelect = a2 and the conditions of all the rules contain depvarvotewho = REP. Applying rule reversal to Rule (35):
IF NOT awhoelect ¼ a2 THEN NOT depvarvotewho ¼ REP
ð36Þ
Of the rules that can be derived from Rule (36), the only rules of interest are those that have the classification dependent variable (depvarvotewho) as the consequent and have as a condition the consequent attribute (awhoelect) of the original rule (Rule (35)). That is
IF awhoelect ¼ a1 OR awhoelect ¼ a7 THEN depvarvotewho ¼ DEM OR depvarvotewho ¼ Not Vote OR depvarvotewho ¼ MAJ ð:73Þ
IF affrepcand ¼ ‘ð2:5 to infÞ’ THEN REP
IF the affect towards the Republican candidate is positive
THEN the respondent votes Republican
ð37Þ
Rule (37) corresponds to eight classification rules with the following rule-parts IF IF IF IF
awhoelect = a1 awhoelect = a1 awhoelect = a7 awhoelect = a1
THEN THEN THEN THEN
DEM Not Vote MAJ REP
Therefore, Rule (37) is also a persistent-rule. This rule states, ‘‘If the respondent thought either a Democrat or another candidate (not Republican) would most likely win the election, then he/she either voted for a Democrat or did not vote all”. In this example, then, there are nine persistent-rules. Rules (24)–(31) are directly persistent-rules, while Rule (37) is a persistent-rule based on rule reversal. 9. Demonstrating persistent-rules: identifying likely voters The persistent-rule discovery process was next applied to the data set used in the likely voter study (Murray et al., 2007). This data set consists of three attributes and 3899 instances from the ANES. A threefold C4.5 classification algorithm generated a tree with three rules. Association a priori analysis resulted in three
K. Rajasethupathy et al. / Expert Systems with Applications 36 (2009) 6019–6024
rules. The resulting rules were compared and evaluated following the process detailed for the vote choice rules. The focus of this analysis is voter turnout – whether the respondent is expected to vote or not. As such, the only association rules that could be compared to the classification rules were those that concluded with voter turnout. Interestingly, none of the rules satisfied this requirement. However, two of the three association rules included voter turnout in the premise. The classification tree follows: A_Intent = A1 A_Prevvote = A0: A0 A_Prevvote = A1: A1 A_Intent = A0: A0 The association rules include
IF A Voteval V2 ¼ A1 AND A Prevvote ¼ A1 THEN A Intent ¼ A1 IF A Voteval V2 ¼ A1 THEN A Intent ¼ A1
ð38Þ
ð39Þ
6023
tent-rules”, are those that are common to different algorithms. Persistent-rules are discovered by the independent application of association and classification mining to the same data set. These rules have been identified as strong by the association mining algorithm and have met the minimum confidence level established for the classification algorithm. While persistent-rules may have a lower confidence than similar association rules and may not classify all future instances of data, they improve decision making by narrowing the focus to rules that are the most robust, consistent, and noteworthy. In this case, the persistent-rule discovery process is demonstrated in the area of voting behavior. In the vote choice data set, mining and analysis resulted in nine persistent-rules out of the 33 total rules that were generated through association mining. In the likely voter data set, the process resulted in one persistent-rule out of the two rules that were generated through association mining. The persistent-rule discovery process suggests these 10 rules are the most robust, consistent, and noteworthy of the much larger potential rule sets. Appendix. ANES survey items in the vote choice data set Discrete-valued questions (attribute names)
IF A Prevvote ¼ A1 THEN A Intent ¼ A1
ð40Þ
Rule (40) is eliminated because it does not include voter turnout. Rule (39) supersedes Rule (38). Rule (39), if the respondent voted, then he/she intended to vote, is reversed and becomes if the respondent did not intend to vote, then he/she did not vote:
IF NOT ðA Intent ¼ A1Þ THEN NOT ðA Voteval V2 ¼ A1Þ
ð41Þ
Both attributes’ values are binary in Rule (41). A_Intent is either A1 or A0 (the voter either intended to vote or not); and, A_Voteval_V2 is either A1 or A0 (the voter either voted or did not vote). Decomposition of Rule (41) leads to only one rule concluding with the classification dependent variable (A_Voteval) and matching a classification rule. The rule is
IF A-Intent ¼ A0 THEN A Voteval V2 ¼ A0
ð42Þ
Hence, if a person does not intend to vote, then it is very likely he/ she will not vote. This is the only persistent-rule from this analysis.
10. Conclusion Data mining typically results in a set of rules that can be applied to future events or that can provide knowledge about interrelationships among data. This set of rules is most useful when it can be dependably applied to new data. Dependability is the strength of the rule. Generally, a rule’s strength is measured by its confidence level. Strong association mined rules are those that meet the minimum confidence level set by the domain expert (Han & Kamber, 2001). The higher the confidence level the stronger the rule and the more likely the rule will be successfully applied to new data. Classification mining generates a decision tree, and resulting rules, that has been pruned to a minimal set of rules. Each rule also has a confidence rating suggesting its ability to correctly classify future data. This research demonstrates a process to identify especially powerful rules. These powerful rules, which are deemed ‘‘persis-
What is the highest degree that you have earned? (aeduc) 1 8 grades or less. 2 9–12 grades, no diploma/equivalency. 3 12 grades, diploma or equivalency. 4 12 grades, diploma or equivalency plus non-academic training. 5 Some college, no degree; junior/community college level degree (AA degree). 6 BA level degrees. 7 Advanced degrees including LLB. Some people do not pay much attention to political campaigns. How about you, would you say that you have been/were very much interested, somewhat interested, or not much interested in the political campaigns this year? (aintelect) 1 Not much interested. 2 Somewhat interested. 3 Very much interested. Some people seem to follow what is going on in government and public affairs most of the time, whether there is an election going on or not. Others are not that interested. Would you say you follow what is going on in government and public affairs most of the time, some of the time, only now and then, or hardly at all? (aintpubaff) 1 2 3 4
Hardly at all. Only now and then. Some of the time. Most of the time.
How do you identify yourself in terms of political parties? (apid) 3 Strong Republican 2 Weak or leaning Republican 0 Independent 2 Weak or leaning Democrat 3 Strong Democrat
6024
K. Rajasethupathy et al. / Expert Systems with Applications 36 (2009) 6019–6024
In addition to being American, what do you consider your main ethnic group or nationality group? (arace) 1 2 3 4 5 7
White Black Asian Native American Hispanic Other
Who do you think will be elected President in November? (awhoelect) 1 Democratic candidate 2 Republican candidate 7 Other candidate Continuous-valued questions Feeling thermometer questions. A measure of feelings. Ratings between 50 and 100 degrees mean a favorably and warm feeling; ratings between 0 and 50 degrees mean the respondent does not feel favorably. The 50 degree mark is used if the respondent does not feel particularly warm or cold: Feeling about Democratic Presidential Candidate. (demtherm) Discretization ranges: (inf to 42.5], (42.5 to 54.5], (54.5 to 62.5], (62.5 to 77.5], (77.5 to inf) Feeling about Republican Presidential Candidate. (reptherm) Discretization ranges: (inf to 42.5], (42.5 to 53.5], (53.5 to 62.5], (62.5 to 79.5], (79.5 to inf) Feeling about Republican Vice Presidential Candidate. (repvptherm) Discretization ranges: (inf to 32.5], (32.5 to 50.5], (50.5 to 81.5], (81.5 to inf) Affect questions. The number of ‘likes’ mentioned by the respondent minus the number of ‘dislikes’ mentioned: Affect toward the Democratic Party. (affdem) Discretization ranges: (inf to 1.5], (1.5 to 0.5], (0.5 to 0.5], (0.5 to 1.5], (1.5 to inf) Affect toward Democratic presidential candidate. (affdemcand) Discretization ranges: (inf to 1.5], (1.5 to 0.5], (0.5 to 0.5], (0.5 to 2.5], (2.5 to inf) Affect toward Republican Party. (affrep) Discretization ranges: (inf to 2.5], (2.5 to 0.5], (0.5 to 0.5], (0.5 to 2.5], (2.5 to inf)
Affect toward Republican presidential candidate (affrepcand) Discretization ranges: (inf to 2.5], (2.5 to 0.5], (0.5 to 0.5], (0.5 to 2.5], (2.5 to inf) ANES survey items in the likely voter data set Was respondent’s vote validated? (A_Voteval_V2) 0 No record of respondent voting. 1 Yes. ‘‘On the coming Presidential election, do you plan to vote?” (A_Intent) 0 No 1 Yes ‘‘Do you remember for sure whether or not you voted in that [previous] election?” (A_Prevvote) 0 Respondent did not vote in previous election or has never voted 1 Voted: Democratic/Republican/Other
References American National Election Studies (ANES). (2005) Center for political studies. Ann Arbor, MI: University of Michigan. Bagui, S. (2006). An approach to mining crime patterns. International Journal of Data Warehousing and Mining, 2(1), 50–80. Deshpande, M., & Karypis, G. (2002). Using conjunction of attribute values for classification. In Proceedings of the eleventh international conference on information and knowledge management, McLean, VA, pp. 356–364. Edsall, T. B. (2006). Democrats’ data mining stirs an intraparty battle. The Washington Post, March 8: A1. Fu, X., & Wang, L. (2005). Data dimensionality reduction with application to improving classification performance and explaining concepts of data sets. International Journal of Business Intelligence and Data Mining, 1(1), 65–87. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Boston, MA: Morgan Kaufman. Jaroszewicz, S., & Simovici, D. A. (2004). Interestingness of frequent itemsets using Bayesian networks as background knowledge. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, pp. 178–186. Murray, G.R., & Scime, A. (in press). Micro-targeting and electorate segmentation: Data mining the American national election studies. Journal of Political Marketing. Murray, G. R., Riley, C., & Scime, A. (2007). ‘‘A new age solution for an age-old problem: Mining data for likely voters, presented at the 62nd annual conference of the american association of public opinion research, May 17–20, 2007, Anaheim, CA. Padmanabhan, B., & Tuzhilin, A. (2000). Small is beautiful: discovering the minimal set of unexpected patterns. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, Boston, MA, pp. 54–63. Scime, A., & Murray, G. R. (2007). Vote Prediction by iterative domain knowledge and attribute elimination. International Journal of Business Intelligence and Data Mining, 2(2), 160–176. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (second ed.). San Francisco, CA: Morgan Kaufman.