Hyper-parameter optimization in classification: To-do or not-to-do

Hyper-parameter optimization in classification: To-do or not-to-do

Hyper-parameter Optimization in Classification: To-do or Not-to-do Journal Pre-proof Hyper-parameter Optimization in Classification: To-do or Not-to...

2MB Sizes 0 Downloads 99 Views

Hyper-parameter Optimization in Classification: To-do or Not-to-do

Journal Pre-proof

Hyper-parameter Optimization in Classification: To-do or Not-to-do Ngoc Tran, Jean-Guy Schneider, Ingo Weber, A.K. Qin PII: DOI: Reference:

S0031-3203(20)30051-0 https://doi.org/10.1016/j.patcog.2020.107245 PR 107245

To appear in:

Pattern Recognition

Received date: Revised date: Accepted date:

25 June 2019 19 October 2019 25 January 2020

Please cite this article as: Ngoc Tran, Jean-Guy Schneider, Ingo Weber, A.K. Qin, Hyperparameter Optimization in Classification: To-do or Not-to-do, Pattern Recognition (2020), doi: https://doi.org/10.1016/j.patcog.2020.107245

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd.

Highlights • We found hyper-param tuning is not well justified in many cases but still very useful in a few. • We propose a framework to address the problem of deciding to-tune or not-to-tune. • We implemented a prototype of the framework with 486 datasets and 4 algorithm. • The results indicates our framework is effective at avoiding effects of ineffective tuning. • Our framework enables a life-long learning approach to the problem.

1

Hyper-parameter Optimization in Classification: To-do or Not-to-do Ngoc Trana,c,∗, Jean-Guy Schneidera,b , Ingo Webera,c , A.K. Qina a Department

of Computer Science and Software Engineering, Swinburne University of Technology, Hawthorn VIC 3122, Australia b School of Information Technology, Deakin University, Burwood VIC 3125, Australia c Data61, CSIRO, Eveleigh NSW 2015, Australia

Abstract Hyper-parameter optimization is a process to find suitable hyper-parameters for predictive models. It typically incurs highly demanding computational costs due to the need of the time-consuming model training process to determine the effectiveness of each set of candidate hyper-parameter values. A priori, there is no guarantee that hyper-parameter optimization leads to improved performance. In this work, we propose a framework to address the problem of whether one should apply hyper-parameter optimization or use the default hyper-parameter settings for traditional classification algorithms. We implemented a prototype of the framework, which we use a basis for a three-fold evaluation with 486 datasets and 4 algorithms. The results indicate that our framework is effective at supporting modeling tasks in avoiding adverse effects of using ineffective optimizations. The results also demonstrate that incrementally adding training datasets improves the predictive performance of framework instantiations and hence enables “life-long learning.” Keywords: hyper-parameter optimization, framework, Bayesian optimization, machine learning, incremental learning In machine learning, specifically, traditional classification algorithms, al∗ Corresponding

author Email addresses: [email protected] (Ngoc Tran), [email protected] (Jean-Guy Schneider), [email protected] (Ingo Weber), [email protected] (A.K. Qin)

Preprint submitted to Pattern Recognition

January 31, 2020

though hyper-parameters and parameters are deemed to be practically interchangeable, the difference between them is theoretically significant: hyperparameters require to be specified prior to training by practitioners, while pa5

rameters are estimated from the data during training. For example, the number of hidden nodes is one of hyper-parameters that has to be specified for Neural Networks (NN) before training, while weights and biases of NN are parameters and will be learnt from the data during training. Given a learning algorithm A, the goal is finding a model M that maps from examples X to corresponding labels y of a finite dataset D. We have y ≈ MH (Θ|X)

(1)

where H is a set of hyper-parameters and Θ is a set of parameters. By speci10

fying different sets of hyper-parameters, different models will be generated. As H governs the characteristics of models, it might be useful to search for a set of

hyper-parameters H∗ that yields an optimal model M∗ . In the literature, this problem is referred to as hyper-parameter optimization [1]. Several examples of solutions in the literature for this problem are manual tuning, grid search, ran15

domized search [1], particle swarm optimization [2], and Bayesian optimization (BO) methods [3]. Regardless of which hyper-parameter optimization method is used, this task is generally very expensive in terms of computational costs. For example, in a study by Thornton et al. [4], the average amount of time to optimize hyper-

20

parameters for each dataset was approximately 13 single core CPU hours. That being said, [5] revealed that the optimization task was ineffective in many cases where there was little or no improvement in the performance of models. This result is of particular interest to practitioners: if they knew that a dataset and/or an algorithm was unlikely to benefit from hyper-parameter optimization, they

25

could skip the task and save the time it takes for optimization. Conversely, if they knew that tuning improves the performance, possibly in a significant way, they are more likely to run the hyper-parameter optimization than not. However, there currently is no such guideline or recommender system for this specific 3

problem. Instead, practitioners often make this decision on a case-by-case basis 30

based on their experience, hoping that their decisions will be appropriate for the current tasks. Studying a well-defined guideline to mitigate such practice in the community is certainly a long and arduous journey. As a step towards the goal, we propose a framework to systematically determine the problem “totune-or-not-to-tune” for traditional classification algorithms. To examine the

35

proposed framework in detail, we instantiated the framework with 4 learning algorithms and a collection of 486 datasets, all having different characteristics. Our prototypical implementation of the framework reveals a critical effect of hyper-parameter optimization that has been neglected in relevant work, namely over-fitting. To verify and validate the proposed framework, we evaluated the

40

prototype in a three-fold evaluation. The evaluation results indicate two advantages of the proposed framework. First, it is effective at supporting modeling task in reducing time on unplanned optimization as well as indicating possible improvement for tuning. Second, our framework enables a life-long learning approach as incrementally adding training datasets improves the predictive per-

45

formance. The remainder of this paper is structured as follows. Section 1 introduces some background. In Section 2, related approaches are discussed. Section 3 presents the framework. In Section 4, we describe a case study of the framework. The evaluation of the case study is in Section 5. The work is discussed in

50

Section 6, and Section 7 concludes the paper with a summary of the main contributions as well as future work.

1. Background In this section, we will present some background information on metalearning, which our proposed approach bases on. Besides, we will briefly in55

troduce a particular hyper-parameter tuning method that we used in our prototypical implementation of the framework, namely Bayesian optimization.

4

1.1. Meta-learning In contrast to base-learning where the goal is to accumulate learning experience from a specific learning task, meta-learning focuses on a higher level 60

of learning that exploits accumulated learning experience gained from previous base-learning tasks. The knowledge that meta-learning learns from is called meta-knowledge. Based on learning from meta-knowledge accumulated from past learning tasks, meta-learning provides supplementary information on the selection of various kinds of learning tasks, for example, suggesting appropriate

65

classifiers for classification tasks. Meta-knowledge is formed by characteristics of datasets and performance measures of algorithms on the data. Regarding to meta-features, three typical approaches to compute measures from the data have been proposed, namely simple, statistical, information-theoretic measures, model-based measures, and

70

landmarkers. Those identified approaches measure the data from different perspectives. Simple, statistical and information-theoretic meta-features gather simple descriptive measures of the data (e.g. number of instances, number of classes) [6], statistical measures (e.g. degree of correlation between numeric attributes, skewness of numeric attributes) [7] as well as information theory

75

measures (e.g. class entropy). The model-based approach, on the other hand, characterizes the data indirectly when meta-features are extracted properties of models induced from the data. Finally, the landmarking approach exploits fast and simple learning algorithms to make a quick estimate of the performance on a given dataset [8].

80

Within the scope of the problem to-tune-or-not, meta-learning is of interest when we can use it to develop meta-learning systems which, although far from ideal, mimic learning of human behaviour from previous experience to improve decision making under new circumstances. Intuitively, a person who has encountered a problem multiple times, would likely have more experience to make

85

better decisions under new circumstances. In like manner, the meta-learning systems will learn meta-models from meta-knowledge of previous learning experiences. For a new problem, the systems characterize it by a set of measurements 5

and then use the trained meta-models to make prediction based on those characteristics. 90

1.2. Bayesian optimization In machine learning, one of our main concerns is to solve the problem

max f (x) x

(2)

More often than not, f is a black-box function [9] that we do not know its structure (i.e. concavity, linearity). If evaluations of f are computationally cheap, we could use simple methods, such as grid search and random search, to 95

draw many evaluation samples. However, if f is computationally expensive to evaluate, it is critical to minimize the number of evaluations of f . Bayesian optimization (BO) is a method designed for global optimization of expensive black-box. Since the structure of f is unknown, it uses a surrogate model to approximate f , and an acquisition function to draw samples from the

100

search space. Although there are other methods that also use a surrogate of the objective function, the key difference of BO is that it uses Bayesian statistics to develop surrogates. For this reason, it is able to minimize the number of evaluations by defining a prior belief about f and iteratively updating the prior with drawn samples.

105

Conceptually, the algorithm of BO works as follows. 1. Given t = 0, build a surrogate model of f . 2. For t = 1, 2, ..., N , repeat: (a) Select a promising data point xt that yields the best performance on the surrogate model using the acquisition function.

110

(b) Applying xt on f . (c) Update the surrogate model with new samples. 3. Return xt that yields the largest f (x).

6

2. Related work Given a training dataset, building machine learning models typically involves 115

two steps: selecting a learning algorithm and then optimizing hyper-parameters to maximize model performance. Two related steps are often referred as algorithm selection and hyper-parameter optimization problems. The studied problem in this paper, deciding to-tune or not-to-tune, fits in between those two steps as depicted in Figure 1. The literature review for this study thus

120

consists of an overview of selection methods for machine learning algorithms and/or hyper-parameters, and techniques for deciding to-tune or not-to-tune.

Figure 1: Overview of the studied problem, deciding to-tune or not-to-tune (i.e. grey shaded box), and its relationships with related problems – learning algorithm selection and hyperparameter optimization.

2.1. Selection methods for machine learning algorithms and/or hyper-parameters There is a plethora of techniques that support automatic selection for either machine learning algorithms or hyper-parameter values for particular al125

gorithms. For the algorithm selection problem, some early works propose to perform algorithm selection based on a rough estimation of the performance of algorithms on the data using simplified versions of the algorithms on a data sample, such as landmarking approach [8] and fast subsampling [10]. Later, meta-learning has been proposed as a method to exploit learning experiences

130

from previous learning problems to enhance the selection task. Some examples of them include fast pairwise comparison [11] and loss time curves [12]. In the same manner, some proposed techniques for the problem of automatic hyper-parameter value selection can be essentially classified into two dis7

tinct categories. One of them is a group of methods that is independent of 135

past learning problems, including sequential hyper-parameter optimization [13], random search [1] and Bayesian optimization [3]. The other group is methods that are capable of exploiting previous experiences to enhance future tasks, such as collaborative hyper-parameter tuning [14] and multi-task Bayesian optimization [15]. While all of those mentioned studies concentrate on selecting either

140

algorithms or hyper-parameter values, there are a few studies that propose techniques to choose both learning algorithms and their associated hyper-parameter values. For example, Auto-WEKA, based on WEKA1 , is considered as the earliest work in this area [4]. Similar to Auto-WEKA, hyperopt-sklearn [16] is a Python library that supports algorithm and hyper-parameter selection for

145

scikit-learn [17]. 2.2. Methods for deciding to-tune or not-to-tune Interestingly, and to some extent surprisingly, the effectiveness of hyperparameter optimization on the performance of classification models has only been investigated by a few studies. Sun and Pfahringer [18] showed that the im-

150

pact of hyper-parameter optimization is not universal among datasets: approximately 20% of datasets had no improvement, and less than a 5% improvement for more than 90% of the datasets. However, deciding “to-tune or not-to-tune” was not concerned in this study. Ridd and Giraud-Carrier [5] constructed metamodels to predict whether hyper-parameter optimization could improve the per-

155

formance of models over a single threshold. This study is considered as the first that exploited meta-learning techniques to decide tuning or not. Still, the study constructed only a single prediction model for all learning algorithms. The low performance of the meta-model on test sets of the study evidently demonstrates that the blanket application of a single model for multiple learning algorithms

160

is inappropriate and, therefore, suggests to construct dedicated models for each algorithm. Montovani et al. [19] studied the validity of hyper-parameters at 1 see

https://www.cs.waikato.ac.nz/ml/weka/.

8

the algorithm level for the Support Vector Machine (SVM) specifically using a pool of 143 datasets. Montovani et al. were concerned with multi-class problems but using accuracy as the default metric is deemed as inappropriate since 165

multi-class skewness cannot be reflected. Furthermore, the optimization was performed only on one single fold instead of the whole training set. For this reason, comparing the performance between tuned and default models could be highly biased. More recently, Sanders and Giraud-Carrier [20] studied the ability of using meta-learning techniques to inform the decision of whether “to-

170

tune or not-to-tune” is based on the expected performance improvement and computational costs. The results of the study are promising, indicating that meta-learning is useful to support the task. Nevertheless, over-fitting issue is neglected in the study, since only resampling techniques were used to estimate the performance of base models.

175

In general, all of those mentioned studies are only concerned with determining the problem to-tune-or-not-to-tune on particular cases. In our study, we propose a framework that is extensive enough to accommodate various algorithms, datasets and optimizers. We examine our framework by using a prototypical implementation of it with 4 learning algorithms and 486 datasets, all having

180

different characteristics. We also inspect the effectiveness of hyper-parameter optimization by estimating tuned and default base models on the unseen data. By doing this, we can investigate in depth a critical problem in machine learning but surprisingly, has been neglected in the relevant works, namely over-fitting. As will be shown in the inspection, over-fitting is substantial in tuning when

185

tuned base models actually perform worse than default base models on the unseen data. Moreover, while the mentioned studies only have two-class predictive meta-models, we propose our predictive meta-models to handle three class prediction to reflect various effects of tuning that we will reveal in the inspection.

9

3. The proposed framework 190

In Figure 2, we illustrate the conceptual model of the proposed framework. The framework consists of two distinguished phases: the meta-model training phase and the application phase. In the training phase, the aim is to induce meta-models that are capable of deriving utilizable information from meta-knowledge to make a recommendation in the application phase, that is,

195

to give a user feedback on the anticipated effect of hyper-parameter optimization for a given learning algorithm Ak . The framework contains a number of hot-spots (or variation points) that allow practitioners to adjust the framework to their specific needs.

Figure 2: Overview of the proposed framework. In the training phase, it outputs the metamodel WAk to predict the effectiveness of hyper-parameter optimization for a learning algorithm Ak of interest. In the application phase, it outputs the predicted effect of tuning Ak on the data set Dnew . Besides, Dnew can also be added to the dataset repository (indicated by the dashed line with arrow from “New dataset” to “Repository of datasets”) for future re-training.

In the meta-model training phase of the framework, we take the learning 10

200

algorithm of interest Ak and apply it to an initial repository of N data sets Di , both with default hyper-parameters (“Default” in Figure 2) as well as using an optimizer of choice to tune the hyper-parameters (“Tuned” in Figure 2). The process to evaluate the performance of Ak begins with splitting a data (training)

set Di into two sets with a ratio of choice, namely a training set Di 205

(test)

and a test set Di the training set

. We then train Ak with its default hyper-parameters on

(training) Di

to obtain the default model, and tune Ak using the (training)

selected optimizer also on Di

to attain the tuned model. We proceed by

assessing the performance of both, the default and tuned models, on the test (test)

set Di 210

and use the attained values to compare their performance. In order

to avoid a potential bias of splitting Di into a training and test set, we repeat the process of evaluating the performance of Ak on Di – using different training and test sets – until a termination criterion of choice is met. The performance comparisons for each of the repetitions are then aggregated (e.g. using mean or median values).

215

In addition to evaluating the performance of the algorithm of interest Ak on the data set Di , we also extract attributes of Di to create meta-features. By associating the meta-features of Di to the aggregated performance measures, we obtain meta-knowledge consisting of (i) the data set’s Di meta-features as the attributes and (ii) the aggregated performance measures of Ak as target

220

variable. This process is now repeated for all N data sets in the repository in order to obtain the corresponding meta-knowledge (i.e. combined meta-features and aggregated performance measures). The entire meta-knowledge is then used to train a meta-model WAk (using a learning algorithm of choice) as a predictor

225

of the effectiveness of hyper-parameter optimization on algorithm Ak . The reader may note that the dataset repository is not fixed, but can be incrementally increased over time in order to add new meta-knowledge. Once sufficient new meta-knowledge has been added, the meta-model WAk can be re-trained to exploit the additional meta-knowledge. We will discuss the effect

230

of this incremental learning in Section 5.3. 11

In the application phase of the framework, the aim is to determine the effectiveness of tuning the hyper-parameters of Ak on a new learning set Dnew . We extract the meta-features for Dnew and use the meta-model WAk induced from the training phase to predict the effectiveness of hyper-parameter optimization 235

for Ak on Dnew . Besides, we can also add Dnew to the dataset repository for future meta-model re-training. In the following section, we will demonstrate the applicability of the framework by presenting a prototype implementation thereof using a number of different learning algorithms as well as specific choices of variation points, respec-

240

tively.

4. A prototypical implementation of the proposed framework In order to validate the proposed framework and gain further insights into its applicability, we implemented a prototype instantiation. In Figure 2, we annotate 6 items to be specified for the implementation. They are: 245

1. Item 1: a target algorithm to be decided to-tune or not-to-tune. 2. Item 2: a collection of training datasets for the target algorithm. 3. Item 3: a performance measurement strategy to train and evaluate tune and default models. 4. Item 4: an approach to characterize training datasets.

250

5. Item 5: a meta-knowledge base to train our prospect meta-model. 6. Item 6: an approach to fit a meta-model on the meta-knowledge. Thus, in this section, we will describe our implementation for all of these 6 items in 6 corresponding sub-sections (Section 5.1 - 5.6). 4.1. Base learning algorithms (Item 1 - Figure 2)

255

We selected four base learning algorithms for our prototypical implementation, namely Decision Tree (DT), Random Forest (RF), Neural Network (NN) and Support Vector Machine (SVM). We chose these algorithms because they

12

are generally popular and readily available in machine learning libraries, such as scikit-learn [17] and caret [21]. Moreover, these algorithms represent differ260

ent types of machine learning methods. Particularly, DT belongs to the class of tree-based algorithms; RF belongs to the class of bagging algorithms; NN belongs to the class of multi-layer perceptron algorithms; and SVM belongs to the class of maximum-margin hyperplane algorithms. Table 1 summarizes the hyper-parameters for each chosen learning algorithm, their default values,

265

and the used tuning ranges, respectively. Although we used more than one algorithm to demonstrate the framework, there are still many other algorithms. Practitioners, if prefer different algorithms and/or hyper-parameters, can always implement the framework with their own settings. Table 1: Description of hyper-parameters for chosen learning algorithms. (P : number of predictor variables in datasets) Hyperparameter

Default

Description

Value

Tuning Range

the amount of relative improvement that a cp DT

model needs to split a node. Small values result in large trees and over-fitting. Large

1e-2

[1e-2, 1e-1]

20

[2, 32]

30

[1, 30]

values result in small trees and under-fitting the minimum number of observations per node to split. Increasing it could lead to min-split

over-fitting because the model would learn relationships that are very specific to some training samples

max-depth

split

the maximum depth of any node. It is used to control the complexity of model the node splitting criteria

gini

{gini, information}

the number of trees in the forest. Generally ntree RF

increasing the number of trees could make the model more stable; however, a large number

5e2

[1e2, 1e3]

of trees is computationally expensive to fit the number of variables randomly sampled mtry

for splitting at each node. It effects the importance estimates of variables

13



P

[1,

P 2

]

node-size

size NN

the minimum size of terminal nodes. Large values result in smaller trees to be grown. the number of hidden nodes. Increasing the size of hidden nodes could lead to over-fitting

1

[1, 20]

P

[1, P + 10]

1e-2

[1e-3, 5e-1]

1e-2

[1e-5, 5e-1]

the threshold to stop training iteration. threshold

Tuning it to control the fitting time, which could be computationally expensive for NN.

learning-

the learning rate of backpropagation. Tuning

rate

it to control the convergence of models

act-func

the activation function

logistic {logistic, tanh}

the regularization parameter that controls the C

trade-off between the smooth decision boundary and the number of correct training

1

(0, 1]

1

[1e-2, 1]

1 P

(0, 1)

0

(0, 1)

3

[2, 5]

samples SVM

the training stopping criteria to control tolerance

training time, especially for non-linear kernels where model fitting is computationally expensive the control parameter for the variance degree

gamma

of the non-linear SVM model. Increasing gamma may lead to over-fitting a particular parameter for polynomial kernel,

coef-0

which is used to control the influence of the degree of polynomials on the model the degree of the polynomial kernel function. It is important to control training time and

poly-degree over-fitting since increasing degree values can consume more computational resources and lead to over-fitting

{linear, poly, kernel

the kernel type

radial

radial, sigmoid}

4.2. Base datasets (Item 2 - Figure 2) 270

We collected all of the base datasets used for the evaluation from OpenML [22]. The data sets retrieved from OpenML are widely used in the literature

14

and come from various sources, including UCI2 , KDD3 and WEKA4 . From an initial candidate set of more than 1,000 datasets, we filtered out datasets that did not meet the following criteria: (i) no missing values and (ii) between 100 275

and 100,000 instances. The filtering process resulted in 486 datasets from a variety of different domains (finance, biology, healthcare, etc.). Some example datasets are German credit risk5 , Run/walk information6 and KDD Cup 20097 . 4.3. Performance measure for base algorithms (Item 3 - Figure 2) In this section, we discuss our procedure to measure the performance of ev-

280

ery base algorithm Ak on each of the base datasets Di in order to obtain the required meta-knowledge for the meta-learning. Essentially, we had to measure the performance of base algorithms using both default and optimized (i.e. tuned) hyper-parameters so that we can compare the corresponding default and optimized performances, respectively. In order to make our comparison reliable,

285

we emphasized the choice of a non-biased performance measure procedure. Hyper-parameter optimization method for base models. For our prototype implementation, we chose Bayesian Optimization (BO) methods [3] to perform hyper-parameter optimization for each of the base learning algorithms. More specifically, we used Gaussian Processes (GP) [23] as the surrogate function

290

and Upper Confidence Bounds (UCB) [24] as the acquisition function for BO. We also heuristically configured our optimizer to have 5 initial points and 5 iterations, respectively. We further set a time limit of max. 4 hours for each optimization task. Performance metric for base models. We measured the performance of both op-

295

timized and default base models by macro-averaged F-measure [25]. Essentially, 2 UC

Irvine Machine Learning Repository, see http://archive.ics.uci.edu/ml/ Knowledge Discovery in Databases, see http://kdd.ics.uci.edu/ 4 WEKA Collections of Datasets, see https://www.cs.waikato.ac.nz/ml/weka/ 5 see https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) 6 see https://www.openml.org/d/40922 7 see https://www.openml.org/d/1112 3 UCI

15

for a dataset with q classes, we calculated fλ with respect to each class λ using the one-against-all approach, that is, positive for samples of the λ class and negative for all other samples (cf. Equation 3). The macro-averaged F-measure f is the average value of all fλ values (cf. Equation 4).

fλ =

f=

(1 + β 2 )tpλ 2 λ + β f nλ + f pλ

(3)

q 1X fλ (tpλ , f pλ , tnλ , f nλ ) q

(4)

(1 +

β 2 )tp

λ=1

300

where tp denotes a true positive, f p a false positive, tn a true negative, and f n denotes a false negative. We chose β = 1. Performance measure process. Figure 3 illustrates the flow of the performance measure process as implemented in our prototype. We divided every dataset Di (training)

of the repository stratified into 70% for training Di 305

(test) Di

and 30% for testing

(Step 1 ). Please note that we used stratified sampling to ensure both

sets have approximately the same ratio of examples in each class as the original dataset. (training)

We then performed training and performance evaluation using Di (test)

and Di 310

– as indicated in Figure 2 – for both the default hyper-parameters

(steps 2 an 3 ) as well as the Bayesian Optimization (steps 4 to 6 ). Please (training)

note that BO used 10-fold cross-validation on the training set Di

(Step

4 ) to optimize the hyper-parameters of Ak , and subsequently used the resulting optimized hyper-parameters Hopt for model building (Step 5 ). In Step 3 , we (d)

computed the macro-averaged F-measure value fDi of the default model Md . 315

(opt)

Similarly, the macro-averaged F-measure fDi

of the optimized model Mopt is

determined in Step 6 . In order to reduce the effect of bias in the splitting (Step 1 ), the splitting of the dataset, model training and model evaluation were repeated 20 times. Please note that for each of the four learning algorithms Ak , the same 20 splits 320

into training and evaluation sets were used for the corresponding runs. The

16

Figure 3: Overview of the procedure to measure the performance of every base learning algorithm Ak on each dataset Di∈{1,...,351} , under two scenarios, default and optimized. For each dataset, we repeat the procedure 20 times.

resulting 20 macro-averaged F-measure values (one each for the default and optimized models) were averaged over the 20 runs as given below: 20

1 X (opt) (opt) f f¯Di = 20 i=1 Di

(5)

20

1 X (d) (d) f f¯Di = 20 i=1 Di

(6)

Given the mean values of the performance of the default and optimized models, we defined the relative improvement

yDi = 325

(opt) (d) f¯Di − f¯Di (d) f¯

(7)

Di

as a measure of how much the hyper-parameter optimization improved (or possibly decreased) the performance of a learning algorithm Ak . 4.4. Meta-features extraction (Item 4 - Figure 2) For this work, we used a set of 95 meta-features of a number of different types that characterize the datasets Di in the repository: 12 general meta17

330

features [26], 45 statistical meta-features [27], 24 model-based meta-features [28], and 14 landmarking meta-features [8]. In Table A.7, we list details of all metafeatures we used. Moreover, 28 out of the initial 95 meta-features contained NaN (Not a Number) or NA (not available) as values and were subsequently removed, resulting in a total of 68 meta-features that were used for meta-model

335

building. 4.5. Meta-knowledge (Item 5 - Figure 2) We standardized each meta-feature of the meta-knowledge CAk to obtain zero (preprocessed)

mean and unit variance, respectively, resulting in CAk

. The purpose

of this standardization process is to counter the effect that meta-features with 340

larger variance could dominate other features in terms of scale rather than contribution. Class categorization. In the literature, the effect of hyper-parameter optimization on models is generally determined by two classes. For example, Ridd and Giraud-Carrier [5] used a single binary classifier to predict for all algorithms,

345

and Montovani et al. [19] trained a binary classifier to predict for SVM. However, based on the analysis of our results, using two classes is insufficient. In Figure 4, we illustrate the distribution density of the relative performance improvement yAk for all concerned algorithms Ak and all 486 datasets. The shape of the distributions implies that hyper-parameter optimization has harmful ef-

350

fects in some instances (i.e. the performance decreases), minimal positive effect (0 < yDi < 0.05) in many instances, and a positive performance improvement (yDi ≥ 0.05) for the remaining instances. Table 2 summarizes the relative distributions for all four learning algorithms. Using two classes only (for yDi < 0 and yDi ≥ 0) will neglect the effect of

355

minimal positive improvement for a significant number of instances. Therefore, in this study, we use three classes to characterize the effects of hyper-parameter optimization: • Do-not-tune: for yDi ≤ 0; 18

Figure 4: Distribution densities of the relative performance improvement yDi of all algorithms. The 0 and 0.05 cut-off lines are highlighted by red dashed lines.

Table 2: Effects of Hyper-parameter optimization on learning algorithms.

yDi ≤ 0

0 < yDi < 0.05

yDi ≥ 0.05

Decision Tree

51%

42%

7%

Neural Network

45%

32%

23%

Random Forest

51%

39%

10%

Support Vector Machine

28%

17%

55%

Algorithm

• Recommend-to-tune: for 0 < yDi < 0.05; and 360

• Strongly-recommend-to-tune: yDi ≥ 0.05 Intuitively, the Do-not-tune class represents instances where optimized models perform worse than default base models, and hence hyper-parameter optimization is not recommended, whereas the Strongly-recommend-to-tune class represents instances where optimization clearly results in a better performance.

365

For the “middle ground”, that is, the Recommend-to-tune class, context information about the classification problem at hand will determine whether a relative performance improvement of less than 0.05 is significant enough to merit the cost of hyper-parameter optimization or not. The reader may note that while it is seemingly reasonable to make the first

370

cut-off at zero to distinguish negative and positive effects of hyper-parameter optimization, the value of the second cut-off at 0.05 is to some extent “arbitrary” (but justifiable given the distribution density of the relative performance 19

improvement as shown in Figure 4), and depending on the specific modeling condition, a different cut-off point may be more appropriate. 375

Consequently, we translate the variable yAk into a discrete variable yAk encoding the Do-not-tune, Recommend-to-tune, and Strongly-recommend-to-tune (preprocessed)

classes, and use CAk

as the predictor variables and yAk as the depen-

dent variable for the meta-lerning step, respectively. 4.6. Meta-model training (Item 6 - Figure 2) 380

Before training meta-model, we used a stratified split to divide the metaknowledge dataset into two subsets: 85% of the meta-knowledge data for training, and 15% for hold-out testing. The learning algorithm xgboost [29] was chosen to learn the corresponding meta-model. To find settings for xgboost models, we used random search. Model searching was based on 10-fold cross-

385

validation results of meta-models. The configuration of our search included runtime limit of 2 hours, building no more than 100 models and using log-loss as the stopping metric. The tuning range of hyper-parameters for xgboost is also listed in Table 3. Table 3: Hyper-parameter tuning for xgboost meta-models.

Hyper-parameter description

Tune range

1 Row sampling rate per tree

[2e-1, 1]

2 Column sampling rate per tree

2e-1, 1]

3 Column sampling rates per split

[2e-1, 1]

4 Min. relative error improvement threshold to make a split [1e-10, 1e-1] 5 L1 regularization

[0, 5e-1]

6 L2 regularization

[0, 5e-1]

7 Learning rate

[1e-5, 5e-1]

5. Prototype evaluation 390

In this section, we evaluate all meta-models using three evaluations. In the first evaluation, the aim is to estimate the predictive performance of the 20

Table 4: Performance evaluattion of the meta-models on the hold-out test set. For precision and recall of each class, we highlight values that are important according to our given assumptions in Section 5.1.

Base algorithm

Accuracy

Class-I

Class-II

Class-III

Pre.

Rec.

Pre.

Rec.

Pre.

Rec.

DT

0.80

0.87

0.83

0.63

0.75

1.00

0.67

RF

0.78

0.81

0.79

0.76

0.76

0.67

1.00

NN

0.80

0.67

0.67

0.80

1.00

0.86

0.60

SVM

0.91

1.00

0.50

1.00

0.82

0.88

1.00

meta-models. In the second evaluation, our goal is to shed some light on the impact of the proposed framework on base modeling tasks in terms of how our approach would help in avoiding unjustified optimizations. Finally, in the third 395

evaluation, we investigate the possibility of using our approach as an ongoing learning solution. 5.1. Predictive performance evaluation For the first evaluation (performance prediction), we trained meta-models (one for each of the 4 learning algoriths) on the training sets (using the setting

400

described in Section 4.6) and then evaluated their predictive performance on the test datasets. By reserving a portion of data for testing instead of using resampling techniques such as cross-validation, we can draw a more objective conclusion regarding the performance of our meta-models [30]. Table 4 summarizes the performance measures of all meta-models on the

405

15% hold-out test set. The highest performance belongs to the meta-model for SVM with 91% accuracy, followed by the meta-model for NN and DT with 80% accuracy. The meta-model for RF demonstrated the least performance with 78% accuracy. The reader may note that the accuracy of each meta-model is significantly better than the app. 33% a random classifier would achieve.

410

However, the accuracy metric does not tell the full story – lets look at the

21

precision and recall for each meta-model for each of the three classes. In the situation that a dataset does not show a performance improvement for hyperparameter optimization (i.e. Class-I in Table 4), running an unnecessary optimization is likely to be of more concern than missing out on an improved perfor415

mance. Hence, False Positives are more problematic than False Negatives, and precision rather than recall is a more relvant metric. For the other two classes (Recommend-to-tune – Class-II; Strongly-recommend-to-tune – Class-III), missing out on an improved performance is likely to be a less desirable outcome than running an unnecessary optimization. Thus, False Negatives are more of

420

a concern than False Positives and, consequently, recall is more relevant than precision. 5.2. Meta-model impact on modeling evaluation Although the performance estimation has shown that the meta-models have reasonable to good performance, we can see that it is still inexplicit about how

425

our proposed approach would be meaningful to base modeling tasks. Therefore, we evaluate the impact of the meta-models from that perspective. We propose two modeling scenarios S1 and S2, corresponding to two distinguished expectations that modeling tasks may have. First, S1 is the improvement-aspriority scenario. The critical factor of the modeling tasks in this scenario is

430

improvement. This scenario is common in practice where any positive model improvement will always be desired. In contrast, S2 is the time-as-priority scenario where time is the most critical factor and optimization will be performed only for substantial improvement (i.e. yDi ≥ 0.05). 5.2.1. Modeling impact assumptions

435

In Table 5, we present a decision table according to the two proposed scenarios. For both scenarios, we have the actual outcome, prediction outcome and associated impact outcome, respectively. In the scenario S1, for example, given the actual outcome is Do-not-tune, then if the prediction is correct, the impact will be “saved time” because optimization is avoided and time will not be spent

22

Table 5: Decision table for two scenarios of alternatively using time and improvement as the priority in modeling. (Act. is actual, Pred. is prediction, imp. is improvement).

S1: Improvement-as-priority Act.

Pred.

Impact

Act.

Pred.

Impact

I

I

saved time

I

I

saved time

II

wasted time

II

saved time

III

wasted time

III

wasted time

I

missed imp.

I

saved time

II

claimed imp.

II

saved time

III

claimed imp.

III

wasted time

I

missed imp.

I

missed imp.

II

claimed imp.

II

missed imp.

III

claimed imp.

III

claimed imp.

II

III

440

S2: Time-as-priority

II

III

on an ineffective optimization. If the prediction is either Recommend-to-tune or Strongly-recommend-to-tune, however, an optimization will be performed. This is likely to result in “wasted time” as an ineffective optimization is performed. Lets consider another situation where the base model’s actual outcome is Recommend-to-tune. If the prediction is Do-not-tune, the outcome “miss per-

445

formance” as an opportunity to improve the performance is missed. On the other hand, if the prediction is either Recommend-to-tune or Strongly-recommend-totune, optimization will be performed. For those cases, the outcome will be “claimed performance” because the optimization will be performed to improve the performance of the base models. The rest of the table can be interpreted in

450

the same manner. 5.2.2. Impact measurement We calculated the impacts of prediction methods in terms of saved time and claimed improvement γ p and γ t using equations 8 and 9 for the whole test set. Essentially both γ p and γ t represent the percentage of saved time and

455

claimed improvement respectively, that a prediction method could impact on 23

the datasets. γp = P t

5.2.3. Prediction methods

P

pclaimed P + pmissed

pclaimed

γ =P

P

tsaved P tsaved + twasted

(8)

(9)

There are three methods to be used in this evaluation. The first method is META, which is using our meta-learning based approach to make a prediction. 460

The second method is TRI. This method makes a random selection of each of three classes Do-not-tune, Recommend-to-tune and Strongly-recommend-totune with an equal probability 33.33% for each class. The third method is BIN, representing a random selection between to-tune and not-to-tune with a 50% probability for each. This method is common in practice but because it only

465

considers two classes, it is not applicable to the scenario S2. 5.2.4. Results In Table 6, we present the evaluation results. Generally, it is apparent that META outperformed both TRI and META in both scenarios S1 and S2 for all base algorithms in terms of improvement. With regards to time measurement,

470

META had better performance than the others in most cases. Specifically in scenario S1 where improvement is prioritized, META always claimed 98% 100% potential improvement for all models across all algorithms. Although S1 focuses on performance improvement, META still helped to avoid 59% 72% time on unnecessary optimizations. Indeed, it performed better than the

475

others in two out of the 4 algorithms with regards to time measurement. On the hand, although TRI and BI had better outcomes than META in some cases, their performance highly fluctuated. Moreover, in the scenario S2 where time is prioritized over performance improvement, META had a much better performance than TRI in both measurements for all algorithms. META avoided

24

480

96% - 99% unnecessary optimization time and still helped to claim 83% - 97% of potential improvement, depending on the context. Table 6: Measure of the impact on the hold-out test set in both scenarios for all prediction methods. We highlight best values for performance and time measurements. (META is our proposed approach, TRI is random selection with 33.33% of chance for each class, BIN is binary random selection for optimization and non-optimization with 50% of chance for each. Due to binary selection, BIN method is not applicable to S2).

Scenario

S1

S2

Base algorithm

Improvement

Time

META

TRI

BIN

META

TRI

BIN

DT

1.00

0.61

0.84

0.59

0.34

0.68

RF

0.97

0.15

0.32

0.59

0.27

0.55

NN

1.00

0.67

0.73

0.72

0.64

0.10

SVM

0.98

0.81

0.20

0.57

0.60

0.62

DT

0.97

0.33

0.96

0.74

RF

0.95

0.00

0.99

0.50

NN

0.87

0.17

0.97

0.76

SVM

0.83

0.33

0.98

0.79

5.3. Performance enhancement over time evaluation In this section, we present an evaluation to validate the life-long learning capability of our proposed framework (i.e. incrementally adding more data sets 485

to generate new meta-knowledge). Since we are limited in terms of the number of available datasets, we propose to train meta-models with increasing amounts of datasets and test using the same, fixed test set. Algorithm 1 describes our technique to perform this evaluation. We save nbP ts = 150 data points for our incremental training. In the initial run, we train with only nbT rainingP ts =

490

N − nbP ts data points where N is the total number data points in the original training set. For each run, we add 2 more data points from the kept set to the previous training set, retrained the meta-model WAk with newly created training set and tested it on the identical test set.

25

Algorithm 1: Performance measure with different training amounts. Data: γtr the training set, γtest the unseen test set, N the number data points of γtr . Result: mse mean square error. 1

begin nbP ts ← 150

2

nbP tsP erRun ← 2

3

nbT rainingP ts ← N − nbP ts

4

nbRuns ← nbP ts/nbP tsP erRun

5

for i ← 1 to nbRuns do

6

begin

7

← γ[1 : nbT rainingP ts]

8

γsub

9

m ← Train(algorithm = a, data = γsub

tr )

mse ← Predict(model = m, newdata = γtest )

10

nbT rainingP ts ← nbT rainingP ts + nbP tsP erRun

11

end

12

end

13 14

tr

end

We applied the technique to all algorithms studied in this work and collected 495

the mean square error (MSE) of all their corresponding meta-models on the test sets over a total of 75 runs, that is, each run increasing the number of training datasets by 2. Figure 5 describes the trend lines of meta-model performance. We can see that MSE values gradually diminish for all 4 algorithms. Although this evaluation is only performed using a relatively small number of datasets for

500

incremental training, the trend lines of MSE indicate that adding more datasets is likely to enhance the predictive performance of the meta-models over time. Further investigations with an increased number training data will be required to provide additional evidence for this claim, though.

6. Discussion 505

Since real-world applications are generally time-sensitive, deciding whether to tune or not is still critical. As shown in the study, typical random guessing methods, although are common in the community, are unreliable for learning algorithms. Using them to decide tuning or not could eventually lead to either wasting time on unexpected optimization or missing potential improvement in

26

Figure 5: Performance trend lines of meta-models with different amounts of training sets.

510

model performance. In contrast, the evaluations show that our approach is far more reliable and effective. By using it to decide to-tune-or-not problem, modeling tasks could claim 98% - 100% potential improvement in performance and avoid 59% - 72% amount of unplanned optimization time, if the expectation of modeling tasks is always improving model performance. And if time is priori-

515

tized instead of performance, modeling tasks could claim 83% - 97% potential improvement in performance and avoid 98% - 99% amount of unplanned optimization time. More importantly, the third evaluation shows that our approach is a life-long learning solution and is likely to increase its performance if more training data is added.

520

Hyper-parameter optimization, although being computationally expensive, is a common practice to tune machine learning algorithms. However, the prototype of the framework has confirmed and additionally revealed that the effect of this task is non-uniform. The results of our 360-hour experiment8 in the implementation (Section 4.3) are two-fold. First, it confirms knowledge that tuning

525

could not be well justified in many cases, but still an extremely useful treatment to models in several cases when models could be improved substantially [19]. Second, it adds an important insight, although not new in the community, but has been neglected in relevant works: optimization can have harmful effects as 8 The

approximate total run-time of all 4 Microsoft Azure clusters Standard F32s v2 (see

https://azure.microsoft.com/en-us/pricing/details/batch/

27

in a number of cases, tuned models perform worse than default on unseen data. 530

Although we have demonstrated the meta-models with unseen data points and received reasonable to high performance, it is necessary for us to emphasize that the meta-models were learnt from data points of a “region”. In Figure 6, we used t-SNE [31] to visualize the multi-dimensional space of the data specifically used for DT algorithm in two dimensions. The region we intuitively refer to

535

is a two dimensional space constrained by all training points of the region. Specifically in the figure, it is a 2D region constrained by x ∈ [−30.00, 40.00] and y ∈ [−50.05, 51.02]. In the experiment, we have tested the meta-model with test data points (i.e. red points) come from the same region. We have showed that the meta-models for DT algorithm works well for data points come

540

from this region. For data points that are outside this region, that is, ones with x ∈ / [−30.00, 40.00] and/or y ∈ / [−50.05, 51.02], we conjecture that the predictive capability of the meta-model to be less accurate. Further experiments are necessary to provide further justification to this claim.

Figure 6: t-SNE visualization of the meta-knowledge for Decision Tree specifically.

Moreover, we are aware that there are a number of factors in real-world ap545

plications that could be different compared to our case study. A few of them include characteristics of datasets, as well as the plethora of available learning algorithms and optimizers, respectively. Although we have employed 486 datasets from various domains and 4 different learning algorithms in the case 28

study, there are certainly many kinds of datasets and learning algorithms with 550

different characteristics. Additionally, learning algorithms that employ optimization techniques other than BO could also behave differently than what we have investigated in our experiments. Having said that, those mentioned factors are not an inherent limitation of our proposed framework. In fact, the framework is suitable for different classification algorithms, optimizers and datasets, and

555

allows for incrementally adding meta-knowledge to improve the performance of the hyper-parameter optimization predictor. 7. Conclusions and Future Work In this work, we have proposed a framework for predicting the effectiveness of hyper-parameter optimization for traditional classification algorithms. We

560

have also illustrated the framework with a prototype of 4 different learning algorithms and 486 datasets. Our empirical evaluation results indicate that the framework technique can be used to systematically and incrementally determine the problem “to-tune-or-not-to-tune.” In future work, we plan to implement our approach in form of a supporting

565

component for autonomous data mining systems. Particularly, with such system, it will not be necessary to specify machine learning algorithms beforehand as required in our proposed approach. Instead, the system could automatically plan data mining processes for every learning problem according to user criteria as well as characteristics of the data. Furthermore, the framework in this paper

570

could be further implemented by using a larger collection of datasets and/or more expensive algorithms, in order to support the validation of the framework in various conditions. References References

575

[1] J. Bergstra, Y. Bengio, Random Search for Hyper-parameter Optimization, Journal of Machine Learning Research 13 (1) (2012) 281–305. 29

[2] J. Kennedy, R. Eberhart, Particle Swarm Optimization, in: Proceedings of the 1995 International Conference on Neural Networks (Perth, Australia), Vol. 4 of ICNN’95, IEEE, Piscataway, NJ, United States, 1995, pp. 1942– 580

1948. doi:10.1109/ICNN.1995.488968. [3] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas, Taking the Human Out of the Loop: A Review of Bayesian Optimization, Proceedings of the IEEE 104 (1) (2016) 148–175. doi:10.1109/JPROC.2015.2494218. [4] C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, Auto-WEKA: Com-

585

bined Selection and Hyperparameter Optimization of Classification Algorithms, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chicago, IL, USA), KDD’13, ACM, New York, NY, USA, 2013, pp. 847–855. doi:10.1145/ 2487575.2487629.

590

[5] P. Ridd, C. Giraud-Carrier, Using Metalearning to Predict when Parameter Optimization is Likely to Improve Classification Accuracy, in: Proceedings of the 2014 International Conference on Meta-learning and Algorithm Selection (Prague, Czech Republic), MLAS’14, CEUR-WS, Aachen, Germany, 2014, pp. 18–23.

595

[6] R. Engels, C. Theusinger, Using a Data Metric for Preprocessing Advice for Data Mining Applications, in: Proceedings of the 13rd European Conference on Artificial Intelligence (Brighton, UK), ECAI’98, IOS Press, Amsterdam, The Netherlands, 1998, pp. 430–434. [7] J. Gama, P. Brazdil, Characterization of Classification Algorithms, in:

600

Proceedings of the 7th Portuguese Conference on Artificial Intelligence (Madeira Island, Portugal), EPIA’95, Springer, Berlin, Heidelberg, 1995, pp. 189–200. doi:10.1007/3-540-60428-6\_16. [8] B. Pfahringer, H. Bensusan, C. G. Giraud-Carrier, Meta-Learning by Landmarking Various Learning Algorithms, in: Proceedings of the 17th Interna-

30

605

tional Conference on Machine Learning (Haifa, Israel), ICML’00, Morgan Kaufmann, San Francisco, CA, USA, 2000, pp. 743–750. [9] J. Mockus, Application of Bayesian Approach to Numerical Methods of Global and Stochastic Optimization, Journal of Global Optimization 4 (4) (1994) 347–365. doi:10.1007/BF01099263.

610

[10] J. Petrak, Fast Subsampling Performance Estimates for Classification Algorithm Selection, in: Proceedings of the ECML-00 Workshop on MetaLearning: Building Automatic Advice Strategies for Model Selection and Method Combination (Barcelona, Spain), ECML’00, 2000, pp. 3–14. [11] R. Leite, P. Brazdil, Active Testing Strategy to Predict the Best Clas-

615

sification Algorithm via Sampling and Metalearning, in: Proceedings of the 19th European Conference on Artificial Intelligence (Lisbon, Portugal), ECAI’10, IOS Press, Amsterdam, The Netherlands, 2010, pp. 309–314. [12] J. N. van Rijn, S. M. Abdulrahman, P. Brazdil, J. Vanschoren, Fast Algorithm Selection Using Learning Curves, in: Proceedings of the 14th Inter-

620

national Symposium on Intelligent Data Analysis (Saint-Etienne, France), ICD’15, Springer, Berlin, Heidelberg, 2015, pp. 298–309. [13] J. Bergstra, R. Bardenet, Y. Bengio, B. K´egl, Algorithms for Hyperparameter Optimization, in: Proceedings of the 24th International Conference on Neural Information Processing Systems (Granada, Spain),

625

NIPS’11, Curran Associates Inc., Red Hook, NY, USA, 2011, pp. 2546– 2554. [14] R. Bardenet, M. Brendel, B. K´egl, M. Sebag, Collaborative Hyperparameter Tuning, in: Proceedings of the 30th International Conference on International Conference on Machine Learning (Atlanta, GA, USA), ICML’13,

630

JMLR.org, 2013, pp. 199–207. [15] K. Swersky, J. Snoek, R. P. Adams, Multi-task Bayesian Optimization, in: Proceedings of the 26th International Conference on Neural Information 31

Processing Systems (Lake Tahoe, Nevada), NIPS’13, Curran Associates, Red Hook, NY, USA, 2013, pp. 2004–2012. 635

[16] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D. D. Cox, Hyperopt: a Python Library for Model Selection and Hyperparameter Optimization, Computational Science & Discovery 8 (1). [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-

640

sos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [18] Q. Sun, B. Pfahringer, Pairwise Meta-rules for Better Meta-learning-based Algorithm Ranking, Machine Learning 93 (1) (2013) 141–161. doi:10.

645

1007/s10994-013-5387-y. [19] R. G. Mantovani, A. L. D. Rossi, J. Vanschoren, B. Bischl, A. C. P. L. F. Carvalho, To Tune or not to Tune: Recommending When to Adjust SVM Hyper-parameters via Meta-learning, in: Proceedings of the 2015 International Joint Conference on Neural Networks (Killarney, Ire-

650

land), IJCNN’15, IEEE, Piscataway, NJ, USA, 2015, pp. 1–8.

doi:

10.1109/IJCNN.2015.7280644. [20] S. Sanders, C. Giraud-Carrier, Informing the Use of Hyperparameter Optimization Through Metalearning, in: Proceedings of the 2017 IEEE International Conference on Data Mining (New Orleans, LA, USA), ICDM’17, 655

IEEE, Piscataway, NJ, United States, 2017, pp. 1051–1056. doi:10.1109/ ICDM.2017.137. [21] M. Kuhn, Building Predictive Models in R Using the caret Package, Journal of Statistical Software 28 (5) (2008) 1–26. doi:10.18637/jss.v028.i05. URL https://www.jstatsoft.org/v028/i05

32

660

[22] J. Vanschoren, J. N. van Rijn, B. Bischl, L. Torgo, OpenML: Networked Science in Machine Learning, SIGKDD Explorations 15 (2) (2013) 49–60. doi:10.1145/2641190.2641198. [23] C. E. Rasmussen, Gaussian Processes in Machine Learning, Springer, Berlin,

665

Heidelberg,

2004,

Ch. 4th,

pp. 63–71.

doi:10.1007/

978-3-540-28650-9_4. [24] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning 47 (2-3) (2002) 235–256. doi:10.1023/A:1013689704352. [25] G. Forman, An Extensive Empirical Study of Feature Selection Metrics for

670

Text Classification, Journal of Machine Learning Research 3 (2003) 1289– 1305. [26] C. Castiello, G. Castellano, A. M. Fanelli, Meta-data: Characterization of Input Features for Meta-learning, in: Proceedings of the 2nd International Conference on Modeling Decisions for Artificial Intelligence

675

(Tsukuba, Japan), MDAI’05, Springer, Berlin, Heidelberg, 2005, pp. 457– 468. doi:10.1007/11526018_45. [27] S. Ali, K. A. Smith, On Learning Algorithm Selection for Classification, Applied Soft Computing 6 (2) (2006) 119–138. doi:10.1016/j.asoc.2004. 12.002.

680

[28] Y. Peng, P. A. Flach, C. Soares, P. Brazdil, Improved Dataset Characterisation for Meta-learning, in: Proceedings of the 5th International Conference on Discovery Science (Lubeck, Germany), DS’02, Springer, Berlin, Heidelberg, 2002, pp. 141–152. [29] T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System,

685

in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California,

33

USA), KDD’16, ACM, New York, NY, USA, 2016, pp. 785–794. doi: 10.1145/2939672.2939785. [30] N. Japkowicz, M. Shah, Evaluating Learning Algorithms: A Classification Perspective, Cambridge University, New York, NY, USA, 2011.

690

[31] L. v. d. Maaten, G. Hinton, Visualizing Data Using t-SNE, Journal of Machine Learning Research 9 (2008) 2579–2605. [32] A. Rivolli, L. P. F. Garcia, C. Soares, J. Vanschoren, A. C. P. L. F. de Carvalho, Towards Reproducible Empirical Research in Meta-Learning, CoRR abs/1808.10406.

695

Appendix A. List of meta-features Table A.7: List of types of meta-features used to characterize datasets.

Meta-

features can be either single or multi-valued. For multi-valued meta-features, we used both mean and standard deviation values. In the study, we use the R-package mfe [32] to compute these meta-features. Value

Description

Type

Values

General 1 Ratio of the number of attributes per the number of instances 2

Ratio of the number of categorical attributes per the number of numeric attributes

Single Single

3 Proportion of the classes values

Multi

4 Ratio of the number of instances per the number of attributes

Single

5 Number of attributes

Single

6 Number of binary attributes

Single

7 Number of categorical attributes

Single

8 Number of classes

Single

9 Number of instances

Single

10 Number of numeric attributes

Single

11

Ratio of the number of numeric attributes per the number of categorical attributes

mean, sd

Single

Statistical 1 Canonical correlations between the predictive attributes and the class 2

Center of gravity, which is the distance between the instance in the center of the majority class and the instance-center of the minority class

34

Multi Single

mean, sd

3 4

Absolute attributes correlation, which measure the correlation between each pair of the numeric attributes in the dataset Absolute attributes covariance, which measure the covariance between each pair of the numeric attributes in the dataset

Multi

mean, sd

Multi

mean, sd

5 Number of the discriminant functions

Single

6 Eigenvalues of the covariance matrix

Multi

mean, sd

7 Geometric mean of attributes

Multi

mean, sd

8 Harmonic mean of attributes

Multi

mean, sd

9 Interquartile range of attributes

Multi

mean, sd

10 Kurtosis of attributes

Multi

mean, sd

11 Median absolute deviation of attributes

Multi

mean, sd

12 Maximum value of attributes

Multi

mean, sd

13 Mean value of attributes

Multi

mean, sd

14 Median value of attributes

Multi

mean, sd

15 Minimum value of attributes

Multi

mean, sd

16 Number of attributes pairs with high correlation

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

19 Range of Attributes

Multi

mean, sd

20 Standard deviation of the attributes

Multi

mean, sd

21 Statistic test for homogeneity of covariances

Single

22 Skewness of attributes

Multi

mean, sd

Multi

mean, sd

24 Trimmed mean of attributes

Multi

mean, sd

25 Attributes variance

Multi

mean, sd

Number of attributes with normal distribution. The Shapiro-Wilk 17 Normality Test is used to assess if an attribute is or not is normally distributed Number of attributes with outliers values. The Turkey’s boxplot 18 algorithm is used to compute if an attributes has or does not have outliers

23

Attributes sparsity, which represents the degree of discreetness of each attribute in the dataset

26 Wilks Lambda

Single Model-based

1 Number of leaves of the DT model

Single

2 Size of branches, which consists in the level of all leaves of the DT model Multi 3 4 5

Leaves corroboration, which is the proportion of examples that belong to each leaf of the DT model Homogeneity, which is the number of leaves divided by the structural shape of the DT model Leaves per class, which is the proportion of leaves of the DT model associated with each class

6 Number of nodes of the DT model 7

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Single

Ratio of the number of nodes of the DT model per the number of attributes

35

mean, sd

Single

8

Ratio of the number of nodes of the DT model per the number of instances

9 Number of nodes of the DT model per level 10 11

Repeated nodes, which is the number of repeated attributes that appear in the DT model Tree depth, which is the level of all tree nodes and leaves of the DT model

12 Tree imbalance 13 14

Tree shape, which is the probability of arrive in each leaf given a random walk. We call this as the structural shape of the DT model Variable importance. It is calculated using the Gini index to estimate the amount of information used in the DT model

Single Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Multi

mean, sd

Land-marking 1

Construct a single decision tree node model induced by the most informative attribute to establish the linear separability Elite nearest neighbor uses the most informative attribute in the dataset

2 to induce the 1-nearest neighbor. With the subset of informative attributes is expected that the models should be noise tolerant 3

Apply the Linear Discriminant classifier to construct a linear split in the data to establish the linear separability

4 Evaluate the performance of the Naive Bayes classifier Evaluate the performance of the 1-nearest neighbor classifier. It uses the 5 euclidean distance of the nearest neighbor to determine how noisy is the data 6 7

Construct a single decision tree node model induced by a random attribute Construct a single decision tree node model induced by the worst informative attribute

36

Ngoc Tran received the B.E. degree from Vietnam National University, Ho Chi Minh, Vietnam, in 2006 and the M.S. degree from Vrije Universiteit Brussel, Brussels, Belgium in 2015. He is currently a Ph.D. student at Swinburne University of Technology, Melbourne, Australia. His current research interests mainly focus on machine learning and domain-specific visual language. Jean-Guy Schneider is a Professor at Deakin University, Melbourne, Australia. Ingo Weber is a Principal Research Scientist & Team Leader of the Architecture & Analytics Platforms (AAP) team at Data61, CSIRO in Sydney, Australia. A. K. Qin is an Associate Professor in the Department of Computer Science and Software Engineering and also a core member of the Data Science Research Institute at Swinburne. He is currently leading the machine learning research group based in the Data Science Research Institute.

1

Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: