A review of the fourteenth international conference on machine learning

A review of the fourteenth international conference on machine learning

ad.--ll!l!J ELSEVIER IDA INTELLIGENT DATA ANALYSIS Intelligent Data Analysis 2 (1998) 245-255 www.elsevier.comhcatehda A Review of the Fourteent...

965KB Sizes 4 Downloads 93 Views

ad.--ll!l!J

ELSEVIER

IDA INTELLIGENT

DATA

ANALYSIS

Intelligent Data Analysis 2 (1998) 245-255 www.elsevier.comhcatehda

A Review of the Fourteenth International Conference on Machine Learning Lewis Frey *, Cen Li ‘, Douglas Talbert *, Doug Fisher 3 Department of Computer Science, Vanderbilt University, Nashville, TN 37235, USA

Abstract We briefly review each paper of the Fourteenth International Conference on Machine Learning, along with some general observations on the conference as a whole. The major topics of papers include data reduction, feature selection, ensembles of classifiers, natural language learning, text categorization, inductive logic programming, stochastic models, and reinforcement learning. 0 1998 Elsevier Science B.V. All rights reserved. Keywords: Inductive concept learning; Ensembles; Natural language learning; Inductive logic programming; models; Reinforcement learning

Stochastic

1. Introduction The Fourteenth International Conference on Machine Learning (ICML-97) was held at Vanderbilt University, Nashville, Tennessee, July 8-12, 1997. It was co-located with the Tenth Annual Conference on Computational Learning Theory. This review provides a very brief description of each paper that was presented at the conference, along with some general observations on the conference as a whole. We organize the ICML-97 conference papers under six headings: general issues of inductive concept learning (e.g., data reduction), ensembles of classifiers, natural language learning and text categorization, inductive logic programming, stochastic models, and reinforcement learning. The headings are employed as a convenient framework for reviewing the ICML-97 papers, but they are not mutually exclusive. A number of papers address selected issues of inductive concept learning including data reduction, feature selection, constructive induction, noise tolerance and overfitting, improving minority class prediction, and extending the space of efficiently learnable tasks. Many of these papers are motivated by recent interest in data mining and a resurgent interest in language learning and text categorization. * Corresponding author. UIU: http://cswww.vuse.vanderbilt.ed~freyl. ’ URL: http://cswww.vuse.vanderbilt.edu/-cenli/. 2 UIU: http://cswww.vuse.vanderbilt.edtidat/. 3 URL: http://cswww.vuse.vanderbilt.edtidfisher/. 1088-467X/98/$19.00 0 1998 Elsevier Science B.V. All rights reserved. PII: SlO88-467X(98)00027-4

246

L. Frey et al. /Intelligent

Data Analysis 2 (1998) 245-255

Ensembles of classifiers combine the predictions of multiple hypotheses to classify objects. Papers concerned with this theme describe methods of learning (and pruning) ensembles, schemes for representing ensembles, and explanations for the effectiveness of ensembles. Natural language learning (NLL) and text categorization are explored in a number of papers. NLL automates the learning of natural language using textual corpora. ICML-97 papers deal with classifying spelling errors, feature selection in text, and integrating methods from information retrieval. Inductive logic programming (ILP) utilizes preclassified training descriptions and background knowledge to learn predicate definitions. Papers explore ILP for software fault diagnosis and learning through exercises. Stochastic models represent probabilistic dependencies between variables. This section includes papers that investigate probabilistic measures, and representations such as belief networks and dependency trees. Reinforcement learning (RL) determines good (ideally, optimal) policies, based on reward functions, for sequential decisions and control problems. Papers in this section address new methods for learning the reward functions, partially observable environments, sequential subgoals, PAC learners, and application of RL methods.

2. General

Issues of Inductive

Concept Learning

Inductive concept learning, typically supervised, is the most broadly studied area of machine learning. Many of the papers presented at ICML-97 focus on different aspects of this general problem such as data reduction, feature selection, noise tolerance, efficiency, and expanding learnable domains. 2.1. Data Reduction,

Feature Selection, and Constructive

Induction

Most work in concept learning assumes that the data are represented as a two dimensional table of values, where columns correspond to variables over which each datum is described, and rows correspond to objects (examples, instances). Removing columns (feature selection) or removing rows (data reduction) prior to and/or during learning can reduce the cost of learning and increase the effectiveness of learning in terms of the error rate or complexity of the resultant classifier. Data points and variables can also be added by prototype construction and constructive induction, respectively, in an effort to improve learning effectiveness. Oates and Jensen empirically show that there is often a linear relationship between the growth of the training set and the growth of a learned decision tree, even when no significant improvement in accuracy accompanies the larger structure. Furthermore, randomly removing data results in a (linear) decrease in decision tree complexity. They argue that all reduction techniques should use a similar “straw-strategy” to estimate the true effectiveness of the technique. Wilson and Martinez experiment with three algorithms for reducing the space requirements for nearest-neighbor learning algorithms by discarding training instances based on comparisons with neighboring instances. Rather than removing instances, Datta and Kibler describe a data reduction method for an instance-based learner that learns multiple prototypes for each class. Kubat and Matwin discuss how data reduction can mitigate problems that some learners face when confronted by data in which one class is extremely underrepresented relative to the remaining class(es).

L. Frey et al. /Intelligent

Data Analysis 2 (1998) 24.5-255

247

So-called imbalanced datasets can occur when one is learning how to identify occurrences of rare medical ailments and in natural language applications. Koller and Sahami exploit feature (keyword) selection to categorize documents in a “topic hierarchy.” Their technique is context-dependent in that different features are used to categorize documents within different subtrees of the hierarchy. Devaney and Ram use automated feature selection, traditionally used for supervised learning, in the unsupervised task of clustering. Defining performance improvement due to feature selection is challenging in the unsupervised context. Like Kubat and Matwin, Cardie and Howe examine ways to improve minority class prediction, but their approach uses feature weighting, which generalizes feature selection. Their approach assigns weights in an instance-specific fashion and mitigates the imbalanced dataset problem in three natural language processing datasets (see Section 4) Gama modifies a top-down decision tree learner by embedding a constructive induction algorithm that uses linear discriminant functions to form additional features at each decision point in the tree. Zupan, Bohanec, Bratko, and Demar present a method for automatically decomposing complex problems into less complex subproblems by using the training data to construct a problem/concept hierarchy. This function decomposition work is quite unique within machine learning, but is closest to constructive induction, where we might view the intermediate concepts introduced by the system as discovered “hidden variables.” Baxter does not address feature selection or constructive induction per se, but looks at the very general problem of learning “domain-appropriate” distance functions between feature vectors. He tests his technique in speech and character recognition domains. 2.2. Noise and Overjitting Noise tolerance and overfitting/underfitting avoidance have been long studied in machine learning and are sometimes refered to as the model selection problem. Several papers continue in this tradition. Schuurmans, Ungar, and Foster examine several model selection techniques for least squares regression problems. They present some new insights into these methods that allow them to characterize model selection problems as “hard’ or “easy.” They also present a new selection technique that can outperform many standard strategies on “hard’ problems by exploiting prior statistical knowledge. Mansour addresses overfitting by decision trees with a pessimistic pruning technique. This method takes into account the size of the subtree at a given point to determine and exploit an upper bound for the true error at a given node. Cross validation (CV) is often used to select the best hypothesis. Ng explains how the hypothesis with the lowest CV error rate may not be the hypothesis with the lowest generalization error rate, particularly when one is selecting among many hypotheses with relatively little training data - the selected model may have best CV error rate “by chance.” Decatur presents a technique for learning in the probably-approximately-correct (PAC) model when noise is present. Typical PAC learners address either noise free problems or oversimplified models of noise. Decatur describes a new, more realistic model of noise, in which examples can be misclassified according to varying noise rates. Optiz’s paper deals with redundancy in neural networks. He argues that a standard technique for preventing overfitting in neural networks, the use of penalty terms, does not properly handle redundancy.

L. Frey et al. /Intelligent

248

Data Analysis 2 (1998) 245-255

This can lead to inaccurate estimates of effective network size and can hamper efforts to prevent over-fitting such as feature selection. Auer describes an algorithm for learning axis-parallel, high-dimensional boxes from multi-instance examples, which represent each example by multiple views. The work is related to previous work on learning to classify certain synthetic molecules [9]. The paper advocates the derivation of practical systems from theoretical considerations. 2.3. Regression Regression is the problem of learning to predict the value of a continuously-valued attribute. The “classifier” that results may be in the form of a prediction equation, “decision” or regression tree, or some other structure. Regression trees have become popular non-linear prediction structures. Early work on regression trees (e.g., Breiman et al’s CART system [l]) would label leaves by the mean of the dependant-variable values of objects stored at that leaf (just as typical decision trees store the mode value). Torgo shows that regression-tree prediction accuracy can be improved by using techniques other than averaging the values found at the leaves, such as nearest-neighbor techniques, linear regression, and kernel regression. The Relief [ 121 and ReliefF [ 131 algorithms evaluate and weight features in advance of supervised learning. The Relief-family evaluation algorithms rely on class membership information to evaluate and rank feature informativeness. Robnik-ikonja and Kononenko have developed a probabilistic modification that enables Relief to be used when the predicted value is continuous. Todorovski and Dzeroski present a method to assist scientific discovery systems in scaling up to large hypothesis spaces. Their grammar-based declarative bias uses domain-specific knowledge and common mathematical operators to reduce the search space of prediction equations. Moore, Schneider, and Deng improve the efficiency of instance-based learners that use locally weighted polynomial regression (LWPR) to learn continuous non-linear mappings. The authors present efficient and exact approximation techniques, based on a multiresolution search of an augmented kd-tree, that reduces the cost of computing solutions. 2.4. Induction

over Structured Data

Machine learning research typically assumes that each datum is a feature vector or set of attributevalue pairs. This is particularly true in neural network research. Botta, Giordana, and Piola present an extension of Radial Basis Function Networks that is capable of representing and refining a knowledge base in a restricted first-order logic representation (Horn clauses). This increases the algorithm’s ability to handle a broader range of data. We will return to the problem of structured induction when we survey inductive logic programming papers in Section 5.

3. Classifier

Ensembles

A classifier ensemble exploits a number of “base” classifiers. There are at least three ways of defining ensemble learners, which are not necessarily mutually exclusive.

L. Frey et al. /Intelligent

Data Analysis 2 (1998) 245-255

249

1. The base classifiers may be built using learners with different biases (e.g., neural network and decision tree induction). An observation is classified via one or more of the base classifiers by looking to “preferences” that might be hand coded or learned themselves. 2. The base classifiers may be binary classifiers that are combined to implement a multiclass learner (i.e., where the number of class labels is greater than 2). For example, Dietterich and Bakiri [7] map each class label onto a bit string prior to learning. Bit strings for class labels are designed to be well separated, thus serving as error-correcting output codes (ECOC). An off-the-shelf system for learning binary classifications (e.g., 0 or 1) can be used to build multiple classifiers, one for each bit in the output code. An instance is classified by predicting each bit of its output code (i.e., label), and classifying the instance as the label with the “closest” matching output code. 3. The base classifiers may be learned from different subsets of the training data. For example, two well known ensemble methods of this type are bugging [2] and boosting [10,19]. Bagging builds each base classifier from data that are drawn from the training data with replacement. On each draw, each training instance has an equal probability of being drawn. The construction of base classifiers are independent of one another. In contrast, boosting builds its base classifiers sequentially. On each trial a boosting technique draws examples following a probability distribution that insures that instances misclassified by a classifier constructed on a previous trial are more likely to be drawn for purposes of constructing the classifier on the current trial. When classifying a new data object using bagging or boosting, the classification decision by the combined model is made by taking a (weighted or unweighted) vote of the base classifiers. More generally, stacking [21] refers to the presence of a second-level classifier (e.g., a decision tree) that accepts the output of the base classifiers and produces a single output. In general, experimental investigations indicate that classifier ensembles improve classification accuracy over a variety of single-classifier systems, but ensembles of classifiers are more “complex.” ICML-97 papers cover various aspects of the classifier ensemble approach. 3.1. Ensemble Methods Schapire presents a new algorithm that adapts boosting to multiclass learning problems using ECOC, and proves theoretical upper bounds for the algorithm’s training error and generalization error. Mayoraz and Moreira present a modified ECOC algorithm for transforming multiclass problems into binary-class problems. Previous ECOC work mapped class labels onto error-correcting output codes informed only by the number of classes. Mayoraz and Moreira describe a strategy that uses the training data to guide the definition of codes so as to simplify the resulting base classifiers. Asker and Maclin use an ensemble approach to identify volcanoes on Venus from Synthetic Aperture Radar images of the planet from the Magellan dataset. Each of 48 neural network classifiers were trained on data with different combinations of principle components and noise pixel replacement levels. The resulting ensemble is comparable to expert performance. The authors stress that careful feature selection in data representation is of particular importance. Drucker compares boosting-like and bagging-like approaches for building ensembles of regressors (see Section 2.3). The base regressors are CART-generated regression trees. In all cases, boosting is equivalent to or better than bagging in terms of prediction error on tested datasets.

L. Frey et al. /Intelligent

250

Data Analysis 2 (1998) 245-255

3.2. Ensemble Size Margineantu and Dietterich experiment with various methods to reduce the number of base classifiers in an ensemble, thus reducing the memory requirements of the ensemble. In some cases ensemble accuracy is actually improved by the “ensemble pruning” process, suggesting that overfitting occurs in ensemble learning, but that it can be mitigated as well. Schapire, Freund, Bartlett, and Lee give a theoretical explanation for the recurring phenomena observed with boosting that test error often does not increase (sometimes decreasing) as ensemble size becomes very large and after training error reaches zero (see also Oates and Jensen in Section 2.1). They argue that this occurs because confidence in training set prediction is still increasing, thus increasing generalization (test-set) accuracy. 3.3. Ensemble Representation Kohavi and Kunz augment regular decision trees with the addition of option nodes [3], which combine evidence for differing classifications from each subtree rooted by a child of the option node. The technique is similar to decision-tree classification of data with missing values [17]. An option node combines a forest of decision (sub)trees (rooted by children of the option node) within a single tree structure, which may be more easily comprehended by humans than an explicit forest of (regular) decision trees. Vilalta and Rendell’s work integrates feature construction with multiple classifiers into a decision tree inducer. At each node of the decision tree, subsets of examples covered by the node are obtained by bagging or boosting and a logical combination of features is generated for each of the node’s subsets. Experiments reveal that incorporating classifier ensemble techniques locally at each node compares favorably with ensembles obtained via (standard) boosting and bagging strategies. Domingos describes a method intended to replicate the behavior of an ensemble, but by using a single classifier. In particular, once a base classifier ensemble is constructed, examples are labeled by the ensemble, and the labeled data serve as training data to build the second level classifier. Subsequent classification is done by the resultant “second-level” classifier only, and the original ensemble is discarded, thus differentiating Domingos’ method from stacking. Ting and Witten also build a second level classifier, but adhere to the stacking method. They use sampling with replacement as in bagging as well as samples that are mutually exclusive (which they call dugging) from which to learn the base classifiers.

4. Natural

Language

Learning

and Text Categorization

In one of the ICML-97 keynote addresses, Raymond Mooney advocated increased interaction between the machine learning and natural language learning (NLL) communities. NLL is concerned with automating the development of natural language processing systems by learning from natural language corpora. Mooney suggested that researchers from machine learning apply their methods to challenging problems of NLL. The conference also contained a number of papers concerned with text categorization, which assigns category labels to text documents. Papers in this section address machine learning as

L. Frey et al. /Intelligent

Data Analysis 2 (1998) 245-255

251

applied to NLL and document retrieval. In addition, Cardie and Howe (in Section 2.1) developed methods of feature weighting that they applied to NLL tasks. Mangu and Brill describe a method that acquires rules for selecting the correct word from an oftenconfused pair (e.g. principle/principal, then/than, etc.). The learner constructs rules that choose the correct word for the context. Mangu and Brill’s algorithm was competitive with alternative strategies, with the advantage that their algorithm generates relatively small rule sets. Ristad and Yianilos examine learning edit distance functions between pairs of strings, which they define as the (possibly weighted) number of insertions, deletions, and substitutions to transform one string into the other. A comparative study between the learned functions and the untrained Levenshtein distance function, which weights insertions, deletions, and substitutions equally, in word pronunciation domains reveals that identification is better informed by learned distance functions than by Levenshtein distance. Feature selection (see Section 2.1) can be expensive in text categorization where the size of the feature space can be very large when unique words are features. Yang and Pedersen compare aggressive feature selection methods that remove large numbers of words from document descriptions for purposes of improving the precision of information/document retrieval. They found that infrequent words may not be as important as frequent words in categorizing documents. Menczer describes ARACHNID, a collection of information discovery agents that operate on the WWW. The user supplies a list of keywords to the agents and a list of starting points. Based on relevance ratings of the links out of the document, the agent picks the best link and follows it. The relevant words on a new page are used to compute the agent’s reward. If the agent gets enough reward it produces offspring, which expands the search. The algorithm works well in environments that exhibit a semantic topology. Joachims explores the use of Rocchio’s relevance feedback algorithm [20] in categorizing text documents from twenty newsgroups and the Reuters’ database. Each document is represented as a weighted word vector, and document classes are represented by prototypical vectors. Test documents are placed in the category with the closest prototype. This strategy compares favorably with the Naive Bayesian classifier.

5. Inductive

Logic Programming

Inductive Logic Programming (ILP) is a branch of machine learning that uses the representational language of predicate calculus. The aim of ILP is to learn predicate definitions given a set of preclassified training descriptions and background knowledge. Some of the better known ILP systems include Golem [ 151, Cigol [ 141, and FOIL [ 181. Mooney mentioned in his talk that ILP is well suited for learning unbounded context representations such as lists, strings, or trees, which occur in the natural language domain. He suggests using ILP on problems such as semantic parsers and information extraction from text documents. There were two papers at the conference that explored ILP outside the context of NLL. Cohen and Devanbu examine the use of ILP in learning to predict the existence of faults in C++ classes. Variations on their system, FLIPPER [4], are able to prune literals and clauses, and the degree of non-determinism of a clause is used as an additional attribute for classification. Reddy and Tadepalli explore inductive teaching of search control knowledge that exploits increasingly-difficult problems. Learned rules are represented as (roughly) function-free Horn programs. The problems are presented to the learner in a

L. Frey et al. /Intelligent

252

Data Analysis 2 (1998) 245-255

bottom-up order as determined by the goal-subgoal hierarchy of the domain, and after each problem is solved, the acquired knowledge is either incorporated into an existing rule via least-general generalization or added as a new rule.

6. Stochastic

Models

A stochastic model facilitates probabilistic inferences [5,16]. Typically, a stochastic model is represented using a probabilistic network structure, with the nodes representing random variables, and the links between nodes representing dependencies between variables. A probability distribution is associated with each node (random variable). Learning stochastic models is a process of determining the (inter-connection) structure of the network and the probability distribution of variable values at each node (conditioned on the node’s parents) from data. Papers in this section focus on belief networks and other probabilistic models. Prior work on constructing Bayesian networks from incomplete data had focused on determining the probability distributions for fixed networks. In contrast, Friedman’s system learns belief network structure from incomplete data (i.e., in the presence of missing values or hidden variables) by extending the Expectation-Maximization procedure to search for both the best network structure and conditional probability distributions. The contribution of Baluja and Davies is to exploit dependencies between parameters in combinatorial optimization. They use incrementally learned pairwise statistics to create model-probability distributions, which are used to generate new data for evaluation. The performance of the combinatorial optimization algorithms consistently improves as the accuracy of their statistical models increase. Sakr, Levitan, Chiarulli, Horne, and Giles compare performance of first and second order hidden Markov models, a linear model, and a time delay neural network (TDNN) in predicting processormemory access patterns in a multiprocessor environment. Since TDNN gives uniformly good results for all three problems, they hypothesize that TDNN has the best chance of adapting to different memory access patterns stemming from other applications.

7. Reinforcement

Learning

The goal of reinforcement learning (RL) is to find an optimal policy for choosing the best action to take from a given state. Detailed descriptions of KL are in Kaelbling, Littman, and Moore [ 111. This section covers papers that extend FX techniques, examine theoretical and empirical approaches, and apply RL methods in varying domains. Suematsu, Hayashi, and Li use history tree models, which are statistical models with variable memory length, as the model for the RL environment. The problem of finding the most probable history tree model is cast in a Bayesian framework. To mitigate problems of computational complexity, an incremental algorithm for finding the most probable model is given to approximate the solution. Precup and Sutton extend Exponential Gradient methods to RL. Exponential Gradient (EG) has been shown empirically to converge faster than the Least-Means Square rule for on-line linear regression problems with many irrelevant features. The Exponential Gradient method adjusts the learner’s

L. Frey et al. /Intelligent

Data Analysis 2 (1998) 245-255

253

hypothesis (represented by a vector of weights) by a multiplicative update of the weights as compared to additive update used in the Least-Means Square rule. Kimura, Miyazaki, and Kobayasbi examine the task of a robot moving its body with a two joint arm. One of the difficulties of this task, as in many real world tasks, is hidden states. They use partially observable Markov decision processes to model this process. They use stochastic gradient ascent to improve the policy of the learner. Explanation-Based Reinforcement Learning (EBRL) combines Explanation-Based Learning, which generalizes experiences using a domain theory, and RL, which learns optimal/good actions [6]. EBRL allows the results of RL to be better transferred between tasks. Tadepalli and Dietterich apply EBRL to achieve sequences of subgoals, possibly with weak interactions among subgoals, in an optimal fashion. Fiechter measures the performance of an on-line RL algorithm by the difference in the total reward received by a learning agent and the total reward received by an optimal (non-learning) agent. The efficiency of on-line RL can be measured by the speed to which this difference levels off during learning. Based on this model of efficiency, Fiechter shows how an off-line PAC RL algorithm can be translated into an efficient on-line RL algorithm. Scheffer, Greiner, and Darken compare an algorithm that copies the behavior of a “teacher” with a learning algorithm that repeatedly experiments by trying its action policy and modifying it towards optimality. The advantage of the experimental learner is that it obtains experience in a larger proportion the search space. For an aeration tank problem the behavior-mimicking algorithm learned the optimal action for 15% of the sample states, while the experimental learner performed the optimal action for 25% of the states. Atkeson and Schaal examine the task in which a robot must swing a pendulum to the top of an arc and balance it there. The system uses two different planners, both based on learning a model and a reward function. It uses a human demonstration of the task to seed the performance of the two models. An important lesson is that simple mimicry of human demonstration is not adequate to perform the task. Mahadevan, Marchalleck, Das, and Gosavi explore continuous-time average-reward RL in which decisions are only allowed at discrete points. They integrate their system into two commercial discreteevent simulators, in hopes of being able to apply the approach to many other factory optimization problems. The authors demonstrate that their approach outperforms well known reliability heuristics from industrial engineering.

8. General Observations A wide variety of papers were presented at ICML 97. The distribution of topics over the 49 presented papers corresponds roughly to the distribution of 150 submitted papers. ICML-97 affiliated workshops and tutorials, some highlighted below, rounded out the conference. To some extent, the major topics represented at the conference correspond to current directions recently outlined by Dietterich [8]. Many of the papers summarized in Section 2 and Section 4 would fall under Dietterich’s methods for scaling up supervised learning algorithms. To a large extent, applications in data mining, natural language, and information retrieval are driving considerable research in machine learning as reflected in the ICML 97 proceedings. In addition, an ICML-97 affiliated workshop on ML Application in the Real World highlighted many application-oriented concerns, and a second workshop

254

L. Frey et al. /Intelligent

Data Analysis 2 (19981245-255

on Automata Induction, Grammatical Inference, and Language Acquisition detailed machine learning’s contributions to these areas and NLL. Ensembles are also an important area of concentration found in ICML 97, and highlighted by Dietterich. Ensembles are a promising approach for improving classifier accuracy, but important questions remain. What base classifiers are best, conditioned on ensemble construction (e.g., sampling) method and domain? We might expect, for example, that unstable classifiers such as decision trees (i.e., small perturbations in the training set may lead to very different classifier forms) would make ideal base classifiers, because their instability insures that base classifiers of varied forms will support ensemble classification. Ensembles may also lead to improved accuracy, but be less comprehensible. A general observation about ICML 97 is that error rate was not the primary measure of success as described in many papers, but rather considerable work was more concerned with decreasing classifier complexity or increasing comprehensibility without sacrificing accuracy. This is especially true of the ensemble papers, but not exclusively. Reinforcement learning was a third area highlighted by Dietterich for which there was strong representation at ICML 97. RL is becoming an increasingly important part of research on intelligent agents on the World Wide Web and other media. An ICML 97 affiliated workshop on RL included many additional papers. Finally, Dietterich highlights stochastic models. There was little work in this area represented at ICML 97. Undoubtedly, this is because of a strong conference, Uncertainty in Artijicial Intelligence, that attracts much of this work (http:Nwww.auai.org/). Our final area of inductive logic programming was not highlighted by Dietterich, but is nonetheless gaining prominence, supported by the Zntemational Conference on Inductive Logic Programming (http://www-ai.ijs.si/ilpnet.html). The ICML 97 Web site (http://cswww.vuse.vanderbilt.edu/”mlccolt/) provides an overview of the conference and links to tutorials (e.g., ILP, PAC), workshops (as mentioned above), and site of the Tenth Annual Conference on Computational Learning Theory (COLT 97), which was co-located with ICML 97. A final observation, suggested by our survey, is the influence of COLT research on many ICML papers. One important source, but not the only source of ensemble research is from COLT, for example, though papers in other areas reflect growing concern with efficiency and in some cases provable bounds on efficiency and accuracy. In general, ICML-97 was a conference that benefited from the intermingling of theory and application. References [l] Breiman, L., Friedman, L., Olshen, R., and Stone, C., Classzjication and Regression Trees, Wadsworth Inc., Belmont, California, 1984. [2] Breiman, L., Bagging Predictors, Machine Learning, 24 (2) (1996) 123-140. [3] Buntine, W., Learning Classification Trees, Statistics and Computing, 2 (2) (1992) 63-73. [4] Cohen, W.W., Learning to classify English text with ILP methods, in Advances in ZLP (Luc De Raedt, ed.), 10s Press, 1995. [5] Cooper, G.F. and Herskovits, E., A Bayesian Method for the Induction of Probabilistic Networks, Machine Learning, 9 (1992) 309-347. [6] Dietterich, T.G. and Flann, N., Explanation-based learning and reinforcement learning: A unified view, in Proceedings of the twelfth International Conference on Machine Learning, Morgan Kaufmann, Tahoe City, California, 176-184, 1995. [7] Dietterich, T.G. and Bakiri, G., Solving Multiclass Learning Problems via Error-correcting Output Codes, Journal of Arti$cial Intelligence Research, 2 (1995) 263-286.

L. Frey et al. /Intelligent

Data Analysis 2 (1998) 245-255

255

[8] Dietterich, T.G., Machine Learning Research: Four Current Directions, Al Magazine, 18 (4) (1997) 97-136. [9] Dietterich, T.G., Lathrop, R.H., and Lozano-Perez, T., Solving the Multiple-instance Problem with Axis-parallel Rectangles, Artificial Intelligence, 89 (l-2) (1997) 31-71. [lo] Freund, Y. and Schapire, R.E., Experiments with a New Boosting Algorithm, in Proceedings ofthe Thirteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco, California, 148-156, 1996. [ 1 l] Kaelbling, L.P., Littman, M.L., and Moore, A.W., Reinforcement Learning: A survey, Journal of Artz@iaZ Intelligence Research, 4 (1996) 237-285. [ 121 Kira, K. and Rendell, L.A., A practical approach to feature selection, in Proceedings of the Ninth International Conference on Machine Learning, Morgan Kaufmann, 249-256, Aberdeen, Scotland, 1992. [13] Konomenko, I., Simec, E., and Robnik-ikonja, M., Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF, Applied Intelligence, 7 (1994) 39-55. [14] Muggleton, S. and Buntine, W., Machine Invention of First-Order Predicates by inverting Resolution, in Proceedings of the Fi@h International Conference on Machine Learning, Morgan Kaufmann, Ann Arbor, Michigan, 339-352, 1988. [15] Muggleton, S. and Feng, C., Efficient induction of logic programs, in Inductive Logic Programming, Academic Press, 1992. [16] Pearl, J., Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, San Mateo, California, 1988. [ 171 Quinlan, J.R., Decision trees as probabilistic classifiers, in Proceedings of the Fourth international Workshop on Machine Learning, Morgan Kaufmann, Irvine, California, 3 l-37, 1987. [18] Quinlan, J.R., Learning logical definitions from relations, Machine Learning, 5 (3) (1990) 239-266. [19] Quinlan, J.R., Bagging, boosting, and ~4.5, in Proceedings of the Thirteenth National Conference on Artijicial Intelligence, AAAI Press, Portland, Oregon, 725-730, 1996. [20] Rocchio, J., Relevance Feedback in Information Retrieval, in The SMART Retrieval System: Experiments in Automatic Document Processing, (Salton, ed.), Prentice-Hall, 313-323, 1971. [21] Wolbert, D.H., Stacked Generalization, Neural Networks, 5 (1992) 241-259.