Hyper-parameter Optimization in Classification: To-do or Not-to-do
Journal Pre-proof
Hyper-parameter Optimization in Classification: To-do or Not-to-do Ngoc Tran, Jean-Guy Schneider, Ingo Weber, A.K. Qin PII: DOI: Reference:
S0031-3203(20)30051-0 https://doi.org/10.1016/j.patcog.2020.107245 PR 107245
To appear in:
Pattern Recognition
Received date: Revised date: Accepted date:
25 June 2019 19 October 2019 25 January 2020
Please cite this article as: Ngoc Tran, Jean-Guy Schneider, Ingo Weber, A.K. Qin, Hyperparameter Optimization in Classification: To-do or Not-to-do, Pattern Recognition (2020), doi: https://doi.org/10.1016/j.patcog.2020.107245
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd.
Highlights • We found hyper-param tuning is not well justified in many cases but still very useful in a few. • We propose a framework to address the problem of deciding to-tune or not-to-tune. • We implemented a prototype of the framework with 486 datasets and 4 algorithm. • The results indicates our framework is effective at avoiding effects of ineffective tuning. • Our framework enables a life-long learning approach to the problem.
1
Hyper-parameter Optimization in Classification: To-do or Not-to-do Ngoc Trana,c,∗, Jean-Guy Schneidera,b , Ingo Webera,c , A.K. Qina a Department
of Computer Science and Software Engineering, Swinburne University of Technology, Hawthorn VIC 3122, Australia b School of Information Technology, Deakin University, Burwood VIC 3125, Australia c Data61, CSIRO, Eveleigh NSW 2015, Australia
Abstract Hyper-parameter optimization is a process to find suitable hyper-parameters for predictive models. It typically incurs highly demanding computational costs due to the need of the time-consuming model training process to determine the effectiveness of each set of candidate hyper-parameter values. A priori, there is no guarantee that hyper-parameter optimization leads to improved performance. In this work, we propose a framework to address the problem of whether one should apply hyper-parameter optimization or use the default hyper-parameter settings for traditional classification algorithms. We implemented a prototype of the framework, which we use a basis for a three-fold evaluation with 486 datasets and 4 algorithms. The results indicate that our framework is effective at supporting modeling tasks in avoiding adverse effects of using ineffective optimizations. The results also demonstrate that incrementally adding training datasets improves the predictive performance of framework instantiations and hence enables “life-long learning.” Keywords: hyper-parameter optimization, framework, Bayesian optimization, machine learning, incremental learning In machine learning, specifically, traditional classification algorithms, al∗ Corresponding
author Email addresses:
[email protected] (Ngoc Tran),
[email protected] (Jean-Guy Schneider),
[email protected] (Ingo Weber),
[email protected] (A.K. Qin)
Preprint submitted to Pattern Recognition
January 31, 2020
though hyper-parameters and parameters are deemed to be practically interchangeable, the difference between them is theoretically significant: hyperparameters require to be specified prior to training by practitioners, while pa5
rameters are estimated from the data during training. For example, the number of hidden nodes is one of hyper-parameters that has to be specified for Neural Networks (NN) before training, while weights and biases of NN are parameters and will be learnt from the data during training. Given a learning algorithm A, the goal is finding a model M that maps from examples X to corresponding labels y of a finite dataset D. We have y ≈ MH (Θ|X)
(1)
where H is a set of hyper-parameters and Θ is a set of parameters. By speci10
fying different sets of hyper-parameters, different models will be generated. As H governs the characteristics of models, it might be useful to search for a set of
hyper-parameters H∗ that yields an optimal model M∗ . In the literature, this problem is referred to as hyper-parameter optimization [1]. Several examples of solutions in the literature for this problem are manual tuning, grid search, ran15
domized search [1], particle swarm optimization [2], and Bayesian optimization (BO) methods [3]. Regardless of which hyper-parameter optimization method is used, this task is generally very expensive in terms of computational costs. For example, in a study by Thornton et al. [4], the average amount of time to optimize hyper-
20
parameters for each dataset was approximately 13 single core CPU hours. That being said, [5] revealed that the optimization task was ineffective in many cases where there was little or no improvement in the performance of models. This result is of particular interest to practitioners: if they knew that a dataset and/or an algorithm was unlikely to benefit from hyper-parameter optimization, they
25
could skip the task and save the time it takes for optimization. Conversely, if they knew that tuning improves the performance, possibly in a significant way, they are more likely to run the hyper-parameter optimization than not. However, there currently is no such guideline or recommender system for this specific 3
problem. Instead, practitioners often make this decision on a case-by-case basis 30
based on their experience, hoping that their decisions will be appropriate for the current tasks. Studying a well-defined guideline to mitigate such practice in the community is certainly a long and arduous journey. As a step towards the goal, we propose a framework to systematically determine the problem “totune-or-not-to-tune” for traditional classification algorithms. To examine the
35
proposed framework in detail, we instantiated the framework with 4 learning algorithms and a collection of 486 datasets, all having different characteristics. Our prototypical implementation of the framework reveals a critical effect of hyper-parameter optimization that has been neglected in relevant work, namely over-fitting. To verify and validate the proposed framework, we evaluated the
40
prototype in a three-fold evaluation. The evaluation results indicate two advantages of the proposed framework. First, it is effective at supporting modeling task in reducing time on unplanned optimization as well as indicating possible improvement for tuning. Second, our framework enables a life-long learning approach as incrementally adding training datasets improves the predictive per-
45
formance. The remainder of this paper is structured as follows. Section 1 introduces some background. In Section 2, related approaches are discussed. Section 3 presents the framework. In Section 4, we describe a case study of the framework. The evaluation of the case study is in Section 5. The work is discussed in
50
Section 6, and Section 7 concludes the paper with a summary of the main contributions as well as future work.
1. Background In this section, we will present some background information on metalearning, which our proposed approach bases on. Besides, we will briefly in55
troduce a particular hyper-parameter tuning method that we used in our prototypical implementation of the framework, namely Bayesian optimization.
4
1.1. Meta-learning In contrast to base-learning where the goal is to accumulate learning experience from a specific learning task, meta-learning focuses on a higher level 60
of learning that exploits accumulated learning experience gained from previous base-learning tasks. The knowledge that meta-learning learns from is called meta-knowledge. Based on learning from meta-knowledge accumulated from past learning tasks, meta-learning provides supplementary information on the selection of various kinds of learning tasks, for example, suggesting appropriate
65
classifiers for classification tasks. Meta-knowledge is formed by characteristics of datasets and performance measures of algorithms on the data. Regarding to meta-features, three typical approaches to compute measures from the data have been proposed, namely simple, statistical, information-theoretic measures, model-based measures, and
70
landmarkers. Those identified approaches measure the data from different perspectives. Simple, statistical and information-theoretic meta-features gather simple descriptive measures of the data (e.g. number of instances, number of classes) [6], statistical measures (e.g. degree of correlation between numeric attributes, skewness of numeric attributes) [7] as well as information theory
75
measures (e.g. class entropy). The model-based approach, on the other hand, characterizes the data indirectly when meta-features are extracted properties of models induced from the data. Finally, the landmarking approach exploits fast and simple learning algorithms to make a quick estimate of the performance on a given dataset [8].
80
Within the scope of the problem to-tune-or-not, meta-learning is of interest when we can use it to develop meta-learning systems which, although far from ideal, mimic learning of human behaviour from previous experience to improve decision making under new circumstances. Intuitively, a person who has encountered a problem multiple times, would likely have more experience to make
85
better decisions under new circumstances. In like manner, the meta-learning systems will learn meta-models from meta-knowledge of previous learning experiences. For a new problem, the systems characterize it by a set of measurements 5
and then use the trained meta-models to make prediction based on those characteristics. 90
1.2. Bayesian optimization In machine learning, one of our main concerns is to solve the problem
max f (x) x
(2)
More often than not, f is a black-box function [9] that we do not know its structure (i.e. concavity, linearity). If evaluations of f are computationally cheap, we could use simple methods, such as grid search and random search, to 95
draw many evaluation samples. However, if f is computationally expensive to evaluate, it is critical to minimize the number of evaluations of f . Bayesian optimization (BO) is a method designed for global optimization of expensive black-box. Since the structure of f is unknown, it uses a surrogate model to approximate f , and an acquisition function to draw samples from the
100
search space. Although there are other methods that also use a surrogate of the objective function, the key difference of BO is that it uses Bayesian statistics to develop surrogates. For this reason, it is able to minimize the number of evaluations by defining a prior belief about f and iteratively updating the prior with drawn samples.
105
Conceptually, the algorithm of BO works as follows. 1. Given t = 0, build a surrogate model of f . 2. For t = 1, 2, ..., N , repeat: (a) Select a promising data point xt that yields the best performance on the surrogate model using the acquisition function.
110
(b) Applying xt on f . (c) Update the surrogate model with new samples. 3. Return xt that yields the largest f (x).
6
2. Related work Given a training dataset, building machine learning models typically involves 115
two steps: selecting a learning algorithm and then optimizing hyper-parameters to maximize model performance. Two related steps are often referred as algorithm selection and hyper-parameter optimization problems. The studied problem in this paper, deciding to-tune or not-to-tune, fits in between those two steps as depicted in Figure 1. The literature review for this study thus
120
consists of an overview of selection methods for machine learning algorithms and/or hyper-parameters, and techniques for deciding to-tune or not-to-tune.
Figure 1: Overview of the studied problem, deciding to-tune or not-to-tune (i.e. grey shaded box), and its relationships with related problems – learning algorithm selection and hyperparameter optimization.
2.1. Selection methods for machine learning algorithms and/or hyper-parameters There is a plethora of techniques that support automatic selection for either machine learning algorithms or hyper-parameter values for particular al125
gorithms. For the algorithm selection problem, some early works propose to perform algorithm selection based on a rough estimation of the performance of algorithms on the data using simplified versions of the algorithms on a data sample, such as landmarking approach [8] and fast subsampling [10]. Later, meta-learning has been proposed as a method to exploit learning experiences
130
from previous learning problems to enhance the selection task. Some examples of them include fast pairwise comparison [11] and loss time curves [12]. In the same manner, some proposed techniques for the problem of automatic hyper-parameter value selection can be essentially classified into two dis7
tinct categories. One of them is a group of methods that is independent of 135
past learning problems, including sequential hyper-parameter optimization [13], random search [1] and Bayesian optimization [3]. The other group is methods that are capable of exploiting previous experiences to enhance future tasks, such as collaborative hyper-parameter tuning [14] and multi-task Bayesian optimization [15]. While all of those mentioned studies concentrate on selecting either
140
algorithms or hyper-parameter values, there are a few studies that propose techniques to choose both learning algorithms and their associated hyper-parameter values. For example, Auto-WEKA, based on WEKA1 , is considered as the earliest work in this area [4]. Similar to Auto-WEKA, hyperopt-sklearn [16] is a Python library that supports algorithm and hyper-parameter selection for
145
scikit-learn [17]. 2.2. Methods for deciding to-tune or not-to-tune Interestingly, and to some extent surprisingly, the effectiveness of hyperparameter optimization on the performance of classification models has only been investigated by a few studies. Sun and Pfahringer [18] showed that the im-
150
pact of hyper-parameter optimization is not universal among datasets: approximately 20% of datasets had no improvement, and less than a 5% improvement for more than 90% of the datasets. However, deciding “to-tune or not-to-tune” was not concerned in this study. Ridd and Giraud-Carrier [5] constructed metamodels to predict whether hyper-parameter optimization could improve the per-
155
formance of models over a single threshold. This study is considered as the first that exploited meta-learning techniques to decide tuning or not. Still, the study constructed only a single prediction model for all learning algorithms. The low performance of the meta-model on test sets of the study evidently demonstrates that the blanket application of a single model for multiple learning algorithms
160
is inappropriate and, therefore, suggests to construct dedicated models for each algorithm. Montovani et al. [19] studied the validity of hyper-parameters at 1 see
https://www.cs.waikato.ac.nz/ml/weka/.
8
the algorithm level for the Support Vector Machine (SVM) specifically using a pool of 143 datasets. Montovani et al. were concerned with multi-class problems but using accuracy as the default metric is deemed as inappropriate since 165
multi-class skewness cannot be reflected. Furthermore, the optimization was performed only on one single fold instead of the whole training set. For this reason, comparing the performance between tuned and default models could be highly biased. More recently, Sanders and Giraud-Carrier [20] studied the ability of using meta-learning techniques to inform the decision of whether “to-
170
tune or not-to-tune” is based on the expected performance improvement and computational costs. The results of the study are promising, indicating that meta-learning is useful to support the task. Nevertheless, over-fitting issue is neglected in the study, since only resampling techniques were used to estimate the performance of base models.
175
In general, all of those mentioned studies are only concerned with determining the problem to-tune-or-not-to-tune on particular cases. In our study, we propose a framework that is extensive enough to accommodate various algorithms, datasets and optimizers. We examine our framework by using a prototypical implementation of it with 4 learning algorithms and 486 datasets, all having
180
different characteristics. We also inspect the effectiveness of hyper-parameter optimization by estimating tuned and default base models on the unseen data. By doing this, we can investigate in depth a critical problem in machine learning but surprisingly, has been neglected in the relevant works, namely over-fitting. As will be shown in the inspection, over-fitting is substantial in tuning when
185
tuned base models actually perform worse than default base models on the unseen data. Moreover, while the mentioned studies only have two-class predictive meta-models, we propose our predictive meta-models to handle three class prediction to reflect various effects of tuning that we will reveal in the inspection.
9
3. The proposed framework 190
In Figure 2, we illustrate the conceptual model of the proposed framework. The framework consists of two distinguished phases: the meta-model training phase and the application phase. In the training phase, the aim is to induce meta-models that are capable of deriving utilizable information from meta-knowledge to make a recommendation in the application phase, that is,
195
to give a user feedback on the anticipated effect of hyper-parameter optimization for a given learning algorithm Ak . The framework contains a number of hot-spots (or variation points) that allow practitioners to adjust the framework to their specific needs.
Figure 2: Overview of the proposed framework. In the training phase, it outputs the metamodel WAk to predict the effectiveness of hyper-parameter optimization for a learning algorithm Ak of interest. In the application phase, it outputs the predicted effect of tuning Ak on the data set Dnew . Besides, Dnew can also be added to the dataset repository (indicated by the dashed line with arrow from “New dataset” to “Repository of datasets”) for future re-training.
In the meta-model training phase of the framework, we take the learning 10
200
algorithm of interest Ak and apply it to an initial repository of N data sets Di , both with default hyper-parameters (“Default” in Figure 2) as well as using an optimizer of choice to tune the hyper-parameters (“Tuned” in Figure 2). The process to evaluate the performance of Ak begins with splitting a data (training)
set Di into two sets with a ratio of choice, namely a training set Di 205
(test)
and a test set Di the training set
. We then train Ak with its default hyper-parameters on
(training) Di
to obtain the default model, and tune Ak using the (training)
selected optimizer also on Di
to attain the tuned model. We proceed by
assessing the performance of both, the default and tuned models, on the test (test)
set Di 210
and use the attained values to compare their performance. In order
to avoid a potential bias of splitting Di into a training and test set, we repeat the process of evaluating the performance of Ak on Di – using different training and test sets – until a termination criterion of choice is met. The performance comparisons for each of the repetitions are then aggregated (e.g. using mean or median values).
215
In addition to evaluating the performance of the algorithm of interest Ak on the data set Di , we also extract attributes of Di to create meta-features. By associating the meta-features of Di to the aggregated performance measures, we obtain meta-knowledge consisting of (i) the data set’s Di meta-features as the attributes and (ii) the aggregated performance measures of Ak as target
220
variable. This process is now repeated for all N data sets in the repository in order to obtain the corresponding meta-knowledge (i.e. combined meta-features and aggregated performance measures). The entire meta-knowledge is then used to train a meta-model WAk (using a learning algorithm of choice) as a predictor
225
of the effectiveness of hyper-parameter optimization on algorithm Ak . The reader may note that the dataset repository is not fixed, but can be incrementally increased over time in order to add new meta-knowledge. Once sufficient new meta-knowledge has been added, the meta-model WAk can be re-trained to exploit the additional meta-knowledge. We will discuss the effect
230
of this incremental learning in Section 5.3. 11
In the application phase of the framework, the aim is to determine the effectiveness of tuning the hyper-parameters of Ak on a new learning set Dnew . We extract the meta-features for Dnew and use the meta-model WAk induced from the training phase to predict the effectiveness of hyper-parameter optimization 235
for Ak on Dnew . Besides, we can also add Dnew to the dataset repository for future meta-model re-training. In the following section, we will demonstrate the applicability of the framework by presenting a prototype implementation thereof using a number of different learning algorithms as well as specific choices of variation points, respec-
240
tively.
4. A prototypical implementation of the proposed framework In order to validate the proposed framework and gain further insights into its applicability, we implemented a prototype instantiation. In Figure 2, we annotate 6 items to be specified for the implementation. They are: 245
1. Item 1: a target algorithm to be decided to-tune or not-to-tune. 2. Item 2: a collection of training datasets for the target algorithm. 3. Item 3: a performance measurement strategy to train and evaluate tune and default models. 4. Item 4: an approach to characterize training datasets.
250
5. Item 5: a meta-knowledge base to train our prospect meta-model. 6. Item 6: an approach to fit a meta-model on the meta-knowledge. Thus, in this section, we will describe our implementation for all of these 6 items in 6 corresponding sub-sections (Section 5.1 - 5.6). 4.1. Base learning algorithms (Item 1 - Figure 2)
255
We selected four base learning algorithms for our prototypical implementation, namely Decision Tree (DT), Random Forest (RF), Neural Network (NN) and Support Vector Machine (SVM). We chose these algorithms because they
12
are generally popular and readily available in machine learning libraries, such as scikit-learn [17] and caret [21]. Moreover, these algorithms represent differ260
ent types of machine learning methods. Particularly, DT belongs to the class of tree-based algorithms; RF belongs to the class of bagging algorithms; NN belongs to the class of multi-layer perceptron algorithms; and SVM belongs to the class of maximum-margin hyperplane algorithms. Table 1 summarizes the hyper-parameters for each chosen learning algorithm, their default values,
265
and the used tuning ranges, respectively. Although we used more than one algorithm to demonstrate the framework, there are still many other algorithms. Practitioners, if prefer different algorithms and/or hyper-parameters, can always implement the framework with their own settings. Table 1: Description of hyper-parameters for chosen learning algorithms. (P : number of predictor variables in datasets) Hyperparameter
Default
Description
Value
Tuning Range
the amount of relative improvement that a cp DT
model needs to split a node. Small values result in large trees and over-fitting. Large
1e-2
[1e-2, 1e-1]
20
[2, 32]
30
[1, 30]
values result in small trees and under-fitting the minimum number of observations per node to split. Increasing it could lead to min-split
over-fitting because the model would learn relationships that are very specific to some training samples
max-depth
split
the maximum depth of any node. It is used to control the complexity of model the node splitting criteria
gini
{gini, information}
the number of trees in the forest. Generally ntree RF
increasing the number of trees could make the model more stable; however, a large number
5e2
[1e2, 1e3]
of trees is computationally expensive to fit the number of variables randomly sampled mtry
for splitting at each node. It effects the importance estimates of variables
13
√
P
[1,
P 2
]
node-size
size NN
the minimum size of terminal nodes. Large values result in smaller trees to be grown. the number of hidden nodes. Increasing the size of hidden nodes could lead to over-fitting
1
[1, 20]
P
[1, P + 10]
1e-2
[1e-3, 5e-1]
1e-2
[1e-5, 5e-1]
the threshold to stop training iteration. threshold
Tuning it to control the fitting time, which could be computationally expensive for NN.
learning-
the learning rate of backpropagation. Tuning
rate
it to control the convergence of models
act-func
the activation function
logistic {logistic, tanh}
the regularization parameter that controls the C
trade-off between the smooth decision boundary and the number of correct training
1
(0, 1]
1
[1e-2, 1]
1 P
(0, 1)
0
(0, 1)
3
[2, 5]
samples SVM
the training stopping criteria to control tolerance
training time, especially for non-linear kernels where model fitting is computationally expensive the control parameter for the variance degree
gamma
of the non-linear SVM model. Increasing gamma may lead to over-fitting a particular parameter for polynomial kernel,
coef-0
which is used to control the influence of the degree of polynomials on the model the degree of the polynomial kernel function. It is important to control training time and
poly-degree over-fitting since increasing degree values can consume more computational resources and lead to over-fitting
{linear, poly, kernel
the kernel type
radial
radial, sigmoid}
4.2. Base datasets (Item 2 - Figure 2) 270
We collected all of the base datasets used for the evaluation from OpenML [22]. The data sets retrieved from OpenML are widely used in the literature
14
and come from various sources, including UCI2 , KDD3 and WEKA4 . From an initial candidate set of more than 1,000 datasets, we filtered out datasets that did not meet the following criteria: (i) no missing values and (ii) between 100 275
and 100,000 instances. The filtering process resulted in 486 datasets from a variety of different domains (finance, biology, healthcare, etc.). Some example datasets are German credit risk5 , Run/walk information6 and KDD Cup 20097 . 4.3. Performance measure for base algorithms (Item 3 - Figure 2) In this section, we discuss our procedure to measure the performance of ev-
280
ery base algorithm Ak on each of the base datasets Di in order to obtain the required meta-knowledge for the meta-learning. Essentially, we had to measure the performance of base algorithms using both default and optimized (i.e. tuned) hyper-parameters so that we can compare the corresponding default and optimized performances, respectively. In order to make our comparison reliable,
285
we emphasized the choice of a non-biased performance measure procedure. Hyper-parameter optimization method for base models. For our prototype implementation, we chose Bayesian Optimization (BO) methods [3] to perform hyper-parameter optimization for each of the base learning algorithms. More specifically, we used Gaussian Processes (GP) [23] as the surrogate function
290
and Upper Confidence Bounds (UCB) [24] as the acquisition function for BO. We also heuristically configured our optimizer to have 5 initial points and 5 iterations, respectively. We further set a time limit of max. 4 hours for each optimization task. Performance metric for base models. We measured the performance of both op-
295
timized and default base models by macro-averaged F-measure [25]. Essentially, 2 UC
Irvine Machine Learning Repository, see http://archive.ics.uci.edu/ml/ Knowledge Discovery in Databases, see http://kdd.ics.uci.edu/ 4 WEKA Collections of Datasets, see https://www.cs.waikato.ac.nz/ml/weka/ 5 see https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) 6 see https://www.openml.org/d/40922 7 see https://www.openml.org/d/1112 3 UCI
15
for a dataset with q classes, we calculated fλ with respect to each class λ using the one-against-all approach, that is, positive for samples of the λ class and negative for all other samples (cf. Equation 3). The macro-averaged F-measure f is the average value of all fλ values (cf. Equation 4).
fλ =
f=
(1 + β 2 )tpλ 2 λ + β f nλ + f pλ
(3)
q 1X fλ (tpλ , f pλ , tnλ , f nλ ) q
(4)
(1 +
β 2 )tp
λ=1
300
where tp denotes a true positive, f p a false positive, tn a true negative, and f n denotes a false negative. We chose β = 1. Performance measure process. Figure 3 illustrates the flow of the performance measure process as implemented in our prototype. We divided every dataset Di (training)
of the repository stratified into 70% for training Di 305
(test) Di
and 30% for testing
(Step 1 ). Please note that we used stratified sampling to ensure both
sets have approximately the same ratio of examples in each class as the original dataset. (training)
We then performed training and performance evaluation using Di (test)
and Di 310
– as indicated in Figure 2 – for both the default hyper-parameters
(steps 2 an 3 ) as well as the Bayesian Optimization (steps 4 to 6 ). Please (training)
note that BO used 10-fold cross-validation on the training set Di
(Step
4 ) to optimize the hyper-parameters of Ak , and subsequently used the resulting optimized hyper-parameters Hopt for model building (Step 5 ). In Step 3 , we (d)
computed the macro-averaged F-measure value fDi of the default model Md . 315
(opt)
Similarly, the macro-averaged F-measure fDi
of the optimized model Mopt is
determined in Step 6 . In order to reduce the effect of bias in the splitting (Step 1 ), the splitting of the dataset, model training and model evaluation were repeated 20 times. Please note that for each of the four learning algorithms Ak , the same 20 splits 320
into training and evaluation sets were used for the corresponding runs. The
16
Figure 3: Overview of the procedure to measure the performance of every base learning algorithm Ak on each dataset Di∈{1,...,351} , under two scenarios, default and optimized. For each dataset, we repeat the procedure 20 times.
resulting 20 macro-averaged F-measure values (one each for the default and optimized models) were averaged over the 20 runs as given below: 20
1 X (opt) (opt) f f¯Di = 20 i=1 Di
(5)
20
1 X (d) (d) f f¯Di = 20 i=1 Di
(6)
Given the mean values of the performance of the default and optimized models, we defined the relative improvement
yDi = 325
(opt) (d) f¯Di − f¯Di (d) f¯
(7)
Di
as a measure of how much the hyper-parameter optimization improved (or possibly decreased) the performance of a learning algorithm Ak . 4.4. Meta-features extraction (Item 4 - Figure 2) For this work, we used a set of 95 meta-features of a number of different types that characterize the datasets Di in the repository: 12 general meta17
330
features [26], 45 statistical meta-features [27], 24 model-based meta-features [28], and 14 landmarking meta-features [8]. In Table A.7, we list details of all metafeatures we used. Moreover, 28 out of the initial 95 meta-features contained NaN (Not a Number) or NA (not available) as values and were subsequently removed, resulting in a total of 68 meta-features that were used for meta-model
335
building. 4.5. Meta-knowledge (Item 5 - Figure 2) We standardized each meta-feature of the meta-knowledge CAk to obtain zero (preprocessed)
mean and unit variance, respectively, resulting in CAk
. The purpose
of this standardization process is to counter the effect that meta-features with 340
larger variance could dominate other features in terms of scale rather than contribution. Class categorization. In the literature, the effect of hyper-parameter optimization on models is generally determined by two classes. For example, Ridd and Giraud-Carrier [5] used a single binary classifier to predict for all algorithms,
345
and Montovani et al. [19] trained a binary classifier to predict for SVM. However, based on the analysis of our results, using two classes is insufficient. In Figure 4, we illustrate the distribution density of the relative performance improvement yAk for all concerned algorithms Ak and all 486 datasets. The shape of the distributions implies that hyper-parameter optimization has harmful ef-
350
fects in some instances (i.e. the performance decreases), minimal positive effect (0 < yDi < 0.05) in many instances, and a positive performance improvement (yDi ≥ 0.05) for the remaining instances. Table 2 summarizes the relative distributions for all four learning algorithms. Using two classes only (for yDi < 0 and yDi ≥ 0) will neglect the effect of
355
minimal positive improvement for a significant number of instances. Therefore, in this study, we use three classes to characterize the effects of hyper-parameter optimization: • Do-not-tune: for yDi ≤ 0; 18
Figure 4: Distribution densities of the relative performance improvement yDi of all algorithms. The 0 and 0.05 cut-off lines are highlighted by red dashed lines.
Table 2: Effects of Hyper-parameter optimization on learning algorithms.
yDi ≤ 0
0 < yDi < 0.05
yDi ≥ 0.05
Decision Tree
51%
42%
7%
Neural Network
45%
32%
23%
Random Forest
51%
39%
10%
Support Vector Machine
28%
17%
55%
Algorithm
• Recommend-to-tune: for 0 < yDi < 0.05; and 360
• Strongly-recommend-to-tune: yDi ≥ 0.05 Intuitively, the Do-not-tune class represents instances where optimized models perform worse than default base models, and hence hyper-parameter optimization is not recommended, whereas the Strongly-recommend-to-tune class represents instances where optimization clearly results in a better performance.
365
For the “middle ground”, that is, the Recommend-to-tune class, context information about the classification problem at hand will determine whether a relative performance improvement of less than 0.05 is significant enough to merit the cost of hyper-parameter optimization or not. The reader may note that while it is seemingly reasonable to make the first
370
cut-off at zero to distinguish negative and positive effects of hyper-parameter optimization, the value of the second cut-off at 0.05 is to some extent “arbitrary” (but justifiable given the distribution density of the relative performance 19
improvement as shown in Figure 4), and depending on the specific modeling condition, a different cut-off point may be more appropriate. 375
Consequently, we translate the variable yAk into a discrete variable yAk encoding the Do-not-tune, Recommend-to-tune, and Strongly-recommend-to-tune (preprocessed)
classes, and use CAk
as the predictor variables and yAk as the depen-
dent variable for the meta-lerning step, respectively. 4.6. Meta-model training (Item 6 - Figure 2) 380
Before training meta-model, we used a stratified split to divide the metaknowledge dataset into two subsets: 85% of the meta-knowledge data for training, and 15% for hold-out testing. The learning algorithm xgboost [29] was chosen to learn the corresponding meta-model. To find settings for xgboost models, we used random search. Model searching was based on 10-fold cross-
385
validation results of meta-models. The configuration of our search included runtime limit of 2 hours, building no more than 100 models and using log-loss as the stopping metric. The tuning range of hyper-parameters for xgboost is also listed in Table 3. Table 3: Hyper-parameter tuning for xgboost meta-models.
Hyper-parameter description
Tune range
1 Row sampling rate per tree
[2e-1, 1]
2 Column sampling rate per tree
2e-1, 1]
3 Column sampling rates per split
[2e-1, 1]
4 Min. relative error improvement threshold to make a split [1e-10, 1e-1] 5 L1 regularization
[0, 5e-1]
6 L2 regularization
[0, 5e-1]
7 Learning rate
[1e-5, 5e-1]
5. Prototype evaluation 390
In this section, we evaluate all meta-models using three evaluations. In the first evaluation, the aim is to estimate the predictive performance of the 20
Table 4: Performance evaluattion of the meta-models on the hold-out test set. For precision and recall of each class, we highlight values that are important according to our given assumptions in Section 5.1.
Base algorithm
Accuracy
Class-I
Class-II
Class-III
Pre.
Rec.
Pre.
Rec.
Pre.
Rec.
DT
0.80
0.87
0.83
0.63
0.75
1.00
0.67
RF
0.78
0.81
0.79
0.76
0.76
0.67
1.00
NN
0.80
0.67
0.67
0.80
1.00
0.86
0.60
SVM
0.91
1.00
0.50
1.00
0.82
0.88
1.00
meta-models. In the second evaluation, our goal is to shed some light on the impact of the proposed framework on base modeling tasks in terms of how our approach would help in avoiding unjustified optimizations. Finally, in the third 395
evaluation, we investigate the possibility of using our approach as an ongoing learning solution. 5.1. Predictive performance evaluation For the first evaluation (performance prediction), we trained meta-models (one for each of the 4 learning algoriths) on the training sets (using the setting
400
described in Section 4.6) and then evaluated their predictive performance on the test datasets. By reserving a portion of data for testing instead of using resampling techniques such as cross-validation, we can draw a more objective conclusion regarding the performance of our meta-models [30]. Table 4 summarizes the performance measures of all meta-models on the
405
15% hold-out test set. The highest performance belongs to the meta-model for SVM with 91% accuracy, followed by the meta-model for NN and DT with 80% accuracy. The meta-model for RF demonstrated the least performance with 78% accuracy. The reader may note that the accuracy of each meta-model is significantly better than the app. 33% a random classifier would achieve.
410
However, the accuracy metric does not tell the full story – lets look at the
21
precision and recall for each meta-model for each of the three classes. In the situation that a dataset does not show a performance improvement for hyperparameter optimization (i.e. Class-I in Table 4), running an unnecessary optimization is likely to be of more concern than missing out on an improved perfor415
mance. Hence, False Positives are more problematic than False Negatives, and precision rather than recall is a more relvant metric. For the other two classes (Recommend-to-tune – Class-II; Strongly-recommend-to-tune – Class-III), missing out on an improved performance is likely to be a less desirable outcome than running an unnecessary optimization. Thus, False Negatives are more of
420
a concern than False Positives and, consequently, recall is more relevant than precision. 5.2. Meta-model impact on modeling evaluation Although the performance estimation has shown that the meta-models have reasonable to good performance, we can see that it is still inexplicit about how
425
our proposed approach would be meaningful to base modeling tasks. Therefore, we evaluate the impact of the meta-models from that perspective. We propose two modeling scenarios S1 and S2, corresponding to two distinguished expectations that modeling tasks may have. First, S1 is the improvement-aspriority scenario. The critical factor of the modeling tasks in this scenario is
430
improvement. This scenario is common in practice where any positive model improvement will always be desired. In contrast, S2 is the time-as-priority scenario where time is the most critical factor and optimization will be performed only for substantial improvement (i.e. yDi ≥ 0.05). 5.2.1. Modeling impact assumptions
435
In Table 5, we present a decision table according to the two proposed scenarios. For both scenarios, we have the actual outcome, prediction outcome and associated impact outcome, respectively. In the scenario S1, for example, given the actual outcome is Do-not-tune, then if the prediction is correct, the impact will be “saved time” because optimization is avoided and time will not be spent
22
Table 5: Decision table for two scenarios of alternatively using time and improvement as the priority in modeling. (Act. is actual, Pred. is prediction, imp. is improvement).
S1: Improvement-as-priority Act.
Pred.
Impact
Act.
Pred.
Impact
I
I
saved time
I
I
saved time
II
wasted time
II
saved time
III
wasted time
III
wasted time
I
missed imp.
I
saved time
II
claimed imp.
II
saved time
III
claimed imp.
III
wasted time
I
missed imp.
I
missed imp.
II
claimed imp.
II
missed imp.
III
claimed imp.
III
claimed imp.
II
III
440
S2: Time-as-priority
II
III
on an ineffective optimization. If the prediction is either Recommend-to-tune or Strongly-recommend-to-tune, however, an optimization will be performed. This is likely to result in “wasted time” as an ineffective optimization is performed. Lets consider another situation where the base model’s actual outcome is Recommend-to-tune. If the prediction is Do-not-tune, the outcome “miss per-
445
formance” as an opportunity to improve the performance is missed. On the other hand, if the prediction is either Recommend-to-tune or Strongly-recommend-totune, optimization will be performed. For those cases, the outcome will be “claimed performance” because the optimization will be performed to improve the performance of the base models. The rest of the table can be interpreted in
450
the same manner. 5.2.2. Impact measurement We calculated the impacts of prediction methods in terms of saved time and claimed improvement γ p and γ t using equations 8 and 9 for the whole test set. Essentially both γ p and γ t represent the percentage of saved time and
455
claimed improvement respectively, that a prediction method could impact on 23
the datasets. γp = P t
5.2.3. Prediction methods
P
pclaimed P + pmissed
pclaimed
γ =P
P
tsaved P tsaved + twasted
(8)
(9)
There are three methods to be used in this evaluation. The first method is META, which is using our meta-learning based approach to make a prediction. 460
The second method is TRI. This method makes a random selection of each of three classes Do-not-tune, Recommend-to-tune and Strongly-recommend-totune with an equal probability 33.33% for each class. The third method is BIN, representing a random selection between to-tune and not-to-tune with a 50% probability for each. This method is common in practice but because it only
465
considers two classes, it is not applicable to the scenario S2. 5.2.4. Results In Table 6, we present the evaluation results. Generally, it is apparent that META outperformed both TRI and META in both scenarios S1 and S2 for all base algorithms in terms of improvement. With regards to time measurement,
470
META had better performance than the others in most cases. Specifically in scenario S1 where improvement is prioritized, META always claimed 98% 100% potential improvement for all models across all algorithms. Although S1 focuses on performance improvement, META still helped to avoid 59% 72% time on unnecessary optimizations. Indeed, it performed better than the
475
others in two out of the 4 algorithms with regards to time measurement. On the hand, although TRI and BI had better outcomes than META in some cases, their performance highly fluctuated. Moreover, in the scenario S2 where time is prioritized over performance improvement, META had a much better performance than TRI in both measurements for all algorithms. META avoided
24
480
96% - 99% unnecessary optimization time and still helped to claim 83% - 97% of potential improvement, depending on the context. Table 6: Measure of the impact on the hold-out test set in both scenarios for all prediction methods. We highlight best values for performance and time measurements. (META is our proposed approach, TRI is random selection with 33.33% of chance for each class, BIN is binary random selection for optimization and non-optimization with 50% of chance for each. Due to binary selection, BIN method is not applicable to S2).
Scenario
S1
S2
Base algorithm
Improvement
Time
META
TRI
BIN
META
TRI
BIN
DT
1.00
0.61
0.84
0.59
0.34
0.68
RF
0.97
0.15
0.32
0.59
0.27
0.55
NN
1.00
0.67
0.73
0.72
0.64
0.10
SVM
0.98
0.81
0.20
0.57
0.60
0.62
DT
0.97
0.33
0.96
0.74
RF
0.95
0.00
0.99
0.50
NN
0.87
0.17
0.97
0.76
SVM
0.83
0.33
0.98
0.79
5.3. Performance enhancement over time evaluation In this section, we present an evaluation to validate the life-long learning capability of our proposed framework (i.e. incrementally adding more data sets 485
to generate new meta-knowledge). Since we are limited in terms of the number of available datasets, we propose to train meta-models with increasing amounts of datasets and test using the same, fixed test set. Algorithm 1 describes our technique to perform this evaluation. We save nbP ts = 150 data points for our incremental training. In the initial run, we train with only nbT rainingP ts =
490
N − nbP ts data points where N is the total number data points in the original training set. For each run, we add 2 more data points from the kept set to the previous training set, retrained the meta-model WAk with newly created training set and tested it on the identical test set.
25
Algorithm 1: Performance measure with different training amounts. Data: γtr the training set, γtest the unseen test set, N the number data points of γtr . Result: mse mean square error. 1
begin nbP ts ← 150
2
nbP tsP erRun ← 2
3
nbT rainingP ts ← N − nbP ts
4
nbRuns ← nbP ts/nbP tsP erRun
5
for i ← 1 to nbRuns do
6
begin
7
← γ[1 : nbT rainingP ts]
8
γsub
9
m ← Train(algorithm = a, data = γsub
tr )
mse ← Predict(model = m, newdata = γtest )
10
nbT rainingP ts ← nbT rainingP ts + nbP tsP erRun
11
end
12
end
13 14
tr
end
We applied the technique to all algorithms studied in this work and collected 495
the mean square error (MSE) of all their corresponding meta-models on the test sets over a total of 75 runs, that is, each run increasing the number of training datasets by 2. Figure 5 describes the trend lines of meta-model performance. We can see that MSE values gradually diminish for all 4 algorithms. Although this evaluation is only performed using a relatively small number of datasets for
500
incremental training, the trend lines of MSE indicate that adding more datasets is likely to enhance the predictive performance of the meta-models over time. Further investigations with an increased number training data will be required to provide additional evidence for this claim, though.
6. Discussion 505
Since real-world applications are generally time-sensitive, deciding whether to tune or not is still critical. As shown in the study, typical random guessing methods, although are common in the community, are unreliable for learning algorithms. Using them to decide tuning or not could eventually lead to either wasting time on unexpected optimization or missing potential improvement in
26
Figure 5: Performance trend lines of meta-models with different amounts of training sets.
510
model performance. In contrast, the evaluations show that our approach is far more reliable and effective. By using it to decide to-tune-or-not problem, modeling tasks could claim 98% - 100% potential improvement in performance and avoid 59% - 72% amount of unplanned optimization time, if the expectation of modeling tasks is always improving model performance. And if time is priori-
515
tized instead of performance, modeling tasks could claim 83% - 97% potential improvement in performance and avoid 98% - 99% amount of unplanned optimization time. More importantly, the third evaluation shows that our approach is a life-long learning solution and is likely to increase its performance if more training data is added.
520
Hyper-parameter optimization, although being computationally expensive, is a common practice to tune machine learning algorithms. However, the prototype of the framework has confirmed and additionally revealed that the effect of this task is non-uniform. The results of our 360-hour experiment8 in the implementation (Section 4.3) are two-fold. First, it confirms knowledge that tuning
525
could not be well justified in many cases, but still an extremely useful treatment to models in several cases when models could be improved substantially [19]. Second, it adds an important insight, although not new in the community, but has been neglected in relevant works: optimization can have harmful effects as 8 The
approximate total run-time of all 4 Microsoft Azure clusters Standard F32s v2 (see
https://azure.microsoft.com/en-us/pricing/details/batch/
27
in a number of cases, tuned models perform worse than default on unseen data. 530
Although we have demonstrated the meta-models with unseen data points and received reasonable to high performance, it is necessary for us to emphasize that the meta-models were learnt from data points of a “region”. In Figure 6, we used t-SNE [31] to visualize the multi-dimensional space of the data specifically used for DT algorithm in two dimensions. The region we intuitively refer to
535
is a two dimensional space constrained by all training points of the region. Specifically in the figure, it is a 2D region constrained by x ∈ [−30.00, 40.00] and y ∈ [−50.05, 51.02]. In the experiment, we have tested the meta-model with test data points (i.e. red points) come from the same region. We have showed that the meta-models for DT algorithm works well for data points come
540
from this region. For data points that are outside this region, that is, ones with x ∈ / [−30.00, 40.00] and/or y ∈ / [−50.05, 51.02], we conjecture that the predictive capability of the meta-model to be less accurate. Further experiments are necessary to provide further justification to this claim.
Figure 6: t-SNE visualization of the meta-knowledge for Decision Tree specifically.
Moreover, we are aware that there are a number of factors in real-world ap545
plications that could be different compared to our case study. A few of them include characteristics of datasets, as well as the plethora of available learning algorithms and optimizers, respectively. Although we have employed 486 datasets from various domains and 4 different learning algorithms in the case 28
study, there are certainly many kinds of datasets and learning algorithms with 550
different characteristics. Additionally, learning algorithms that employ optimization techniques other than BO could also behave differently than what we have investigated in our experiments. Having said that, those mentioned factors are not an inherent limitation of our proposed framework. In fact, the framework is suitable for different classification algorithms, optimizers and datasets, and
555
allows for incrementally adding meta-knowledge to improve the performance of the hyper-parameter optimization predictor. 7. Conclusions and Future Work In this work, we have proposed a framework for predicting the effectiveness of hyper-parameter optimization for traditional classification algorithms. We
560
have also illustrated the framework with a prototype of 4 different learning algorithms and 486 datasets. Our empirical evaluation results indicate that the framework technique can be used to systematically and incrementally determine the problem “to-tune-or-not-to-tune.” In future work, we plan to implement our approach in form of a supporting
565
component for autonomous data mining systems. Particularly, with such system, it will not be necessary to specify machine learning algorithms beforehand as required in our proposed approach. Instead, the system could automatically plan data mining processes for every learning problem according to user criteria as well as characteristics of the data. Furthermore, the framework in this paper
570
could be further implemented by using a larger collection of datasets and/or more expensive algorithms, in order to support the validation of the framework in various conditions. References References
575
[1] J. Bergstra, Y. Bengio, Random Search for Hyper-parameter Optimization, Journal of Machine Learning Research 13 (1) (2012) 281–305. 29
[2] J. Kennedy, R. Eberhart, Particle Swarm Optimization, in: Proceedings of the 1995 International Conference on Neural Networks (Perth, Australia), Vol. 4 of ICNN’95, IEEE, Piscataway, NJ, United States, 1995, pp. 1942– 580
1948. doi:10.1109/ICNN.1995.488968. [3] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas, Taking the Human Out of the Loop: A Review of Bayesian Optimization, Proceedings of the IEEE 104 (1) (2016) 148–175. doi:10.1109/JPROC.2015.2494218. [4] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto-WEKA: Com-
585
bined Selection and Hyperparameter Optimization of Classification Algorithms, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chicago, IL, USA), KDD’13, ACM, New York, NY, USA, 2013, pp. 847–855. doi:10.1145/ 2487575.2487629.
590
[5] P. Ridd, C. Giraud-Carrier, Using Metalearning to Predict when Parameter Optimization is Likely to Improve Classification Accuracy, in: Proceedings of the 2014 International Conference on Meta-learning and Algorithm Selection (Prague, Czech Republic), MLAS’14, CEUR-WS, Aachen, Germany, 2014, pp. 18–23.
595
[6] R. Engels, C. Theusinger, Using a Data Metric for Preprocessing Advice for Data Mining Applications, in: Proceedings of the 13rd European Conference on Artificial Intelligence (Brighton, UK), ECAI’98, IOS Press, Amsterdam, The Netherlands, 1998, pp. 430–434. [7] J. Gama, P. Brazdil, Characterization of Classification Algorithms, in:
600
Proceedings of the 7th Portuguese Conference on Artificial Intelligence (Madeira Island, Portugal), EPIA’95, Springer, Berlin, Heidelberg, 1995, pp. 189–200. doi:10.1007/3-540-60428-6\_16. [8] B. Pfahringer, H. Bensusan, C. G. Giraud-Carrier, Meta-Learning by Landmarking Various Learning Algorithms, in: Proceedings of the 17th Interna-
30
605
tional Conference on Machine Learning (Haifa, Israel), ICML’00, Morgan Kaufmann, San Francisco, CA, USA, 2000, pp. 743–750. [9] J. Mockus, Application of Bayesian Approach to Numerical Methods of Global and Stochastic Optimization, Journal of Global Optimization 4 (4) (1994) 347–365. doi:10.1007/BF01099263.
610
[10] J. Petrak, Fast Subsampling Performance Estimates for Classification Algorithm Selection, in: Proceedings of the ECML-00 Workshop on MetaLearning: Building Automatic Advice Strategies for Model Selection and Method Combination (Barcelona, Spain), ECML’00, 2000, pp. 3–14. [11] R. Leite, P. Brazdil, Active Testing Strategy to Predict the Best Clas-
615
sification Algorithm via Sampling and Metalearning, in: Proceedings of the 19th European Conference on Artificial Intelligence (Lisbon, Portugal), ECAI’10, IOS Press, Amsterdam, The Netherlands, 2010, pp. 309–314. [12] J. N. van Rijn, S. M. Abdulrahman, P. Brazdil, J. Vanschoren, Fast Algorithm Selection Using Learning Curves, in: Proceedings of the 14th Inter-
620
national Symposium on Intelligent Data Analysis (Saint-Etienne, France), ICD’15, Springer, Berlin, Heidelberg, 2015, pp. 298–309. [13] J. Bergstra, R. Bardenet, Y. Bengio, B. K´egl, Algorithms for Hyperparameter Optimization, in: Proceedings of the 24th International Conference on Neural Information Processing Systems (Granada, Spain),
625
NIPS’11, Curran Associates Inc., Red Hook, NY, USA, 2011, pp. 2546– 2554. [14] R. Bardenet, M. Brendel, B. K´egl, M. Sebag, Collaborative Hyperparameter Tuning, in: Proceedings of the 30th International Conference on International Conference on Machine Learning (Atlanta, GA, USA), ICML’13,
630
JMLR.org, 2013, pp. 199–207. [15] K. Swersky, J. Snoek, R. P. Adams, Multi-task Bayesian Optimization, in: Proceedings of the 26th International Conference on Neural Information 31
Processing Systems (Lake Tahoe, Nevada), NIPS’13, Curran Associates, Red Hook, NY, USA, 2013, pp. 2004–2012. 635
[16] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D. D. Cox, Hyperopt: a Python Library for Model Selection and Hyperparameter Optimization, Computational Science & Discovery 8 (1). [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
640
sos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [18] Q. Sun, B. Pfahringer, Pairwise Meta-rules for Better Meta-learning-based Algorithm Ranking, Machine Learning 93 (1) (2013) 141–161. doi:10.
645
1007/s10994-013-5387-y. [19] R. G. Mantovani, A. L. D. Rossi, J. Vanschoren, B. Bischl, A. C. P. L. F. Carvalho, To Tune or not to Tune: Recommending When to Adjust SVM Hyper-parameters via Meta-learning, in: Proceedings of the 2015 International Joint Conference on Neural Networks (Killarney, Ire-
650
land), IJCNN’15, IEEE, Piscataway, NJ, USA, 2015, pp. 1–8.
doi:
10.1109/IJCNN.2015.7280644. [20] S. Sanders, C. Giraud-Carrier, Informing the Use of Hyperparameter Optimization Through Metalearning, in: Proceedings of the 2017 IEEE International Conference on Data Mining (New Orleans, LA, USA), ICDM’17, 655
IEEE, Piscataway, NJ, United States, 2017, pp. 1051–1056. doi:10.1109/ ICDM.2017.137. [21] M. Kuhn, Building Predictive Models in R Using the caret Package, Journal of Statistical Software 28 (5) (2008) 1–26. doi:10.18637/jss.v028.i05. URL https://www.jstatsoft.org/v028/i05
32
660
[22] J. Vanschoren, J. N. van Rijn, B. Bischl, L. Torgo, OpenML: Networked Science in Machine Learning, SIGKDD Explorations 15 (2) (2013) 49–60. doi:10.1145/2641190.2641198. [23] C. E. Rasmussen, Gaussian Processes in Machine Learning, Springer, Berlin,
665
Heidelberg,
2004,
Ch. 4th,
pp. 63–71.
doi:10.1007/
978-3-540-28650-9_4. [24] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning 47 (2-3) (2002) 235–256. doi:10.1023/A:1013689704352. [25] G. Forman, An Extensive Empirical Study of Feature Selection Metrics for
670
Text Classification, Journal of Machine Learning Research 3 (2003) 1289– 1305. [26] C. Castiello, G. Castellano, A. M. Fanelli, Meta-data: Characterization of Input Features for Meta-learning, in: Proceedings of the 2nd International Conference on Modeling Decisions for Artificial Intelligence
675
(Tsukuba, Japan), MDAI’05, Springer, Berlin, Heidelberg, 2005, pp. 457– 468. doi:10.1007/11526018_45. [27] S. Ali, K. A. Smith, On Learning Algorithm Selection for Classification, Applied Soft Computing 6 (2) (2006) 119–138. doi:10.1016/j.asoc.2004. 12.002.
680
[28] Y. Peng, P. A. Flach, C. Soares, P. Brazdil, Improved Dataset Characterisation for Meta-learning, in: Proceedings of the 5th International Conference on Discovery Science (Lubeck, Germany), DS’02, Springer, Berlin, Heidelberg, 2002, pp. 141–152. [29] T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System,
685
in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California,
33
USA), KDD’16, ACM, New York, NY, USA, 2016, pp. 785–794. doi: 10.1145/2939672.2939785. [30] N. Japkowicz, M. Shah, Evaluating Learning Algorithms: A Classification Perspective, Cambridge University, New York, NY, USA, 2011.
690
[31] L. v. d. Maaten, G. Hinton, Visualizing Data Using t-SNE, Journal of Machine Learning Research 9 (2008) 2579–2605. [32] A. Rivolli, L. P. F. Garcia, C. Soares, J. Vanschoren, A. C. P. L. F. de Carvalho, Towards Reproducible Empirical Research in Meta-Learning, CoRR abs/1808.10406.
695
Appendix A. List of meta-features Table A.7: List of types of meta-features used to characterize datasets.
Meta-
features can be either single or multi-valued. For multi-valued meta-features, we used both mean and standard deviation values. In the study, we use the R-package mfe [32] to compute these meta-features. Value
Description
Type
Values
General 1 Ratio of the number of attributes per the number of instances 2
Ratio of the number of categorical attributes per the number of numeric attributes
Single Single
3 Proportion of the classes values
Multi
4 Ratio of the number of instances per the number of attributes
Single
5 Number of attributes
Single
6 Number of binary attributes
Single
7 Number of categorical attributes
Single
8 Number of classes
Single
9 Number of instances
Single
10 Number of numeric attributes
Single
11
Ratio of the number of numeric attributes per the number of categorical attributes
mean, sd
Single
Statistical 1 Canonical correlations between the predictive attributes and the class 2
Center of gravity, which is the distance between the instance in the center of the majority class and the instance-center of the minority class
34
Multi Single
mean, sd
3 4
Absolute attributes correlation, which measure the correlation between each pair of the numeric attributes in the dataset Absolute attributes covariance, which measure the covariance between each pair of the numeric attributes in the dataset
Multi
mean, sd
Multi
mean, sd
5 Number of the discriminant functions
Single
6 Eigenvalues of the covariance matrix
Multi
mean, sd
7 Geometric mean of attributes
Multi
mean, sd
8 Harmonic mean of attributes
Multi
mean, sd
9 Interquartile range of attributes
Multi
mean, sd
10 Kurtosis of attributes
Multi
mean, sd
11 Median absolute deviation of attributes
Multi
mean, sd
12 Maximum value of attributes
Multi
mean, sd
13 Mean value of attributes
Multi
mean, sd
14 Median value of attributes
Multi
mean, sd
15 Minimum value of attributes
Multi
mean, sd
16 Number of attributes pairs with high correlation
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
19 Range of Attributes
Multi
mean, sd
20 Standard deviation of the attributes
Multi
mean, sd
21 Statistic test for homogeneity of covariances
Single
22 Skewness of attributes
Multi
mean, sd
Multi
mean, sd
24 Trimmed mean of attributes
Multi
mean, sd
25 Attributes variance
Multi
mean, sd
Number of attributes with normal distribution. The Shapiro-Wilk 17 Normality Test is used to assess if an attribute is or not is normally distributed Number of attributes with outliers values. The Turkey’s boxplot 18 algorithm is used to compute if an attributes has or does not have outliers
23
Attributes sparsity, which represents the degree of discreetness of each attribute in the dataset
26 Wilks Lambda
Single Model-based
1 Number of leaves of the DT model
Single
2 Size of branches, which consists in the level of all leaves of the DT model Multi 3 4 5
Leaves corroboration, which is the proportion of examples that belong to each leaf of the DT model Homogeneity, which is the number of leaves divided by the structural shape of the DT model Leaves per class, which is the proportion of leaves of the DT model associated with each class
6 Number of nodes of the DT model 7
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Single
Ratio of the number of nodes of the DT model per the number of attributes
35
mean, sd
Single
8
Ratio of the number of nodes of the DT model per the number of instances
9 Number of nodes of the DT model per level 10 11
Repeated nodes, which is the number of repeated attributes that appear in the DT model Tree depth, which is the level of all tree nodes and leaves of the DT model
12 Tree imbalance 13 14
Tree shape, which is the probability of arrive in each leaf given a random walk. We call this as the structural shape of the DT model Variable importance. It is calculated using the Gini index to estimate the amount of information used in the DT model
Single Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Multi
mean, sd
Land-marking 1
Construct a single decision tree node model induced by the most informative attribute to establish the linear separability Elite nearest neighbor uses the most informative attribute in the dataset
2 to induce the 1-nearest neighbor. With the subset of informative attributes is expected that the models should be noise tolerant 3
Apply the Linear Discriminant classifier to construct a linear split in the data to establish the linear separability
4 Evaluate the performance of the Naive Bayes classifier Evaluate the performance of the 1-nearest neighbor classifier. It uses the 5 euclidean distance of the nearest neighbor to determine how noisy is the data 6 7
Construct a single decision tree node model induced by a random attribute Construct a single decision tree node model induced by the worst informative attribute
36
Ngoc Tran received the B.E. degree from Vietnam National University, Ho Chi Minh, Vietnam, in 2006 and the M.S. degree from Vrije Universiteit Brussel, Brussels, Belgium in 2015. He is currently a Ph.D. student at Swinburne University of Technology, Melbourne, Australia. His current research interests mainly focus on machine learning and domain-specific visual language. Jean-Guy Schneider is a Professor at Deakin University, Melbourne, Australia. Ingo Weber is a Principal Research Scientist & Team Leader of the Architecture & Analytics Platforms (AAP) team at Data61, CSIRO in Sydney, Australia. A. K. Qin is an Associate Professor in the Department of Computer Science and Software Engineering and also a core member of the Data Science Research Institute at Swinburne. He is currently leading the machine learning research group based in the Data Science Research Institute.
1
Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: