Information Sciences 261 (2014) 237–262
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Towards UCI+: A mindful repository design Núria Macià ⇑, Ester Bernadó-Mansilla Grup de Recerca en Sistemes Intelligents, La Salle – Universitat Ramon Llull, 08022 Barcelona, Spain
a r t i c l e
i n f o
Article history: Received 10 January 2012 Received in revised form 11 April 2013 Accepted 24 August 2013 Available online 5 September 2013 Keywords: Data repository Data complexity Classification Synthetic data set
a b s t r a c t Public repositories have contributed to the maturation of experimental methodology in machine learning. Publicly available data sets have allowed researchers to empirically assess their learners and, jointly with open source machine learning software, they have favoured the emergence of comparative analyses of learners’ performance over a common framework. These studies have brought standard procedures to evaluate machine learning techniques. However, current claims—such as the superiority of enhanced algorithms—are biased by unsustained assumptions made throughout some praxes. In this paper, the early steps of the methodology, which refer to data set selection, are inspected. Particularly, the exploitation of the most popular data repository in machine learning—the UCI repository—is examined. We analyse the type, complexity, and use of UCI data sets. The study recommends the design of a mindful data repository, UCI+, which should include a set of properly characterised data sets consisting of a complete and representative sample of real-world problems, enriched with artificial benchmarks. The ultimate goal of the UCI+ is to lay the foundations towards a well-supported methodology for learner assessment. Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction Machine learning (ML) is much more than running algorithms on apparent ready-to-use data sets [32]. The ML discipline emerged to gain insight into complex tasks such as reasoning, problem solving, and language processing [20]. However, the adoption of the attribute-value data representation changed and limited the course of ML research. The easy access to classification and regression problems, coupled with the proliferation of experimental science, led practitioners to focus not only on the development of algorithms for such tasks, but also on a repetitive performance competition—which persists, but has no sense nowadays. Historically, the STATLOG project [28] compared classification algorithms—also referred to as classifiers or learners—on large real-world problems and tried to develop a set of data descriptors to help decide which algorithms were best suited to solve a given problem. The project took many years to complete because most of the data sets and algorithms followed their own format. Any comparison was extremely expensive and, therefore, the need for a standardisation was acknowledged by the community. Kohavi et al. [17] proposed the first common framework: the machine learning library in C++. Unfortunately, the platform choice—SGI C++ compiler—made the widespread use of the tool difficult. Researchers had to wait for other implementations such as Weka [37], PRTools [13], or KEEL [2]. Once several tools were available, many researchers focused on what in the past was the holy grail of ML studies, the possibility of massive comparisons—although the No-FreeLunch (NFL) theorem [38] came along and demonstrated the weakness of such analyses. For instance, Lim et al. [21] carried out a comparison of prediction accuracy, complexity, and training time of 33 classification algorithms. Their study contains ⇑ Corresponding author. E-mail addresses:
[email protected] (N. Macià),
[email protected] (E. Bernadó-Mansilla). 0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2013.08.059
238
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
what is now a deeply-rooted practice, which consists of randomly selecting at most twenty data sets from public repositories, applying all the considered algorithms to these data, and performing a comparison of all the results. The algorithmic refinement obtained by such a brute-force approach has been gamed though, and the current experimental test bed is not representative enough to sustain most of the improvements proposed in the literature. Moreover, the ML community should aim to extract structures or patterns that can be reasoned about and used for prediction instead of just comparing numeric outcomes. To this end, we need solid methodologies based on thoughtful, structured repositories. The purpose of this paper is to critically examine the use of the most popular public repository, the UCI Machine Learning Repository [11], and its role in the dissemination of the experimental assessment methodology. We revise the type of data sets contained in the UCI and whether they form a complete and representative sample. Based on this study, we propose a set of changes concerning the characterisation and standardisation of data sets and the enhancement of the sample with real-world problems and artificial benchmarks. The remainder of this paper is organised as follows. Section 2 presents how the UCI repository is lined with the experimental community. The study starts from the belief that this repository appeared with the aim of providing data to researchers as well as to encourage them to investigate specific problems. Nevertheless, it came into vogue as a benchmark source to evaluate and compare the performance of incoming learning techniques. This use was grounded on the wrong assumption of a complete, reliable framework composed of standardised real samples. Thus, we analysed the type and representativeness of the UCI repository. First, Section 3 revises format, standardisation, and consistency of the data sources provided by the UCI. To continue with further analyses regarding the coverage of UCI data sets, a preprocessing stage was performed, as detailed in Section 4. Then, Section 5 shows the coverage of the UCI sample by estimating its diversity with respect to classifiers’ performance and data complexity. Section 6 revives the actual UCI’s potential if endowed with artificial data sets and a wellfounded learner assessment methodology. In closing, Section 7 points out some final remarks regarding the experimental methodology and envisages research trends.
2. The role of the UCI in the learner assessment methodology The UCI repository is the most well-known collection of data sets in ML since the early nineties. Although it includes synthetic problems and data generators, the UCI is mostly known for its set of real-world problems. The fact that a great number of data sets come from data sources of real applications has given prestige to the UCI and favoured its status quo in the experimental methodology. This section reviews the repercussion of this data sharing initiative and anticipates some threats. 2.1. Contributions of the UCI In the last few decades, the fast increase of computing power allowed researchers to perform large computations involving a performance test of a given algorithm over a varied set of problems. The algorithm of interest could also be easily compared with other well-known algorithms. Thus, both the development of software tools providing public implementations for the most common learners and the release of data set collections promoted the experimental research in ML. Without being explicitly designed nor properly discussed, a common practice was consolidated to assess the performance and competence of new learners. In this context, the major contribution of the UCI repository was to provide the community with real-world or quasi real-world problems—mostly for classification. The availability of these problems became a breakthrough for ML researchers, since they stopped worrying about collecting data sets to concentrate on the algorithm development. Each new algorithm proposal could be applied and tested on existing data sets. The UCI then became an essential part of experiments from where claims about the superiority and competence of given learners were supposedly well founded. Later on, some studies added criticism to the experimental methodology and suggested the use of a moderate test bed size and the support of proper statistical tests, such as multiple comparison procedures and non-parametric statistics [7,12]. These recommendations do not suffice, however, to amend the deficiencies of the methodology, which should deserve much attention. The following sections probe the components of the most common practice. 2.2. The UCI test One of the established conventions for the assessment of a learner is briefly stated as follows: 1. Select a number of data sets from the UCI repository. 2. Select a number of other well-known classifiers as reference learners, such as C4.5 (decision tree) and SVM (support vector machine). 3. Run the classifiers over the collection of data sets and compare the performance (usually classification accuracy) of the classifier with respect to the reference classifiers. 4. Apply statistical tests to investigate the competence of the given classifier, that is, whether it significantly improves the others.
239
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
If the classifier outperforms the others in the selected data sets, it represents an improvement which should merit publication. Hence, passing what we coin as the UCI test is currently a necessary condition to report significant contributions to the scientific community. However, certain decisions that have to be taken for each phase call the reliability of the UCI test into question, as described below. 2.2.1. Selected data sets How many data sets should be selected? Which ones? Why? These questions arise at the first step of the methodology. A previous study [25] that investigated papers from the proceedings of the International Conference on Machine Learning and the Pattern Recognition Journal in the period 2008–2010 reported that only eight data sets are selected in the experiments on average. Table 1 (a summary of the aforementioned study) points out that 55.3% of the papers claimed the superiority of a given classifier based on a performance study over 2–10 data sets, and 16.7% of the papers used between 11 and 30 data sets. On the other hand, the four top-most popular problems are Iris, Adult, Wine, and Breast-cancer Wisconsin (see Table 2). Yet, there is not any explanation of why such a small number of data sets neither why this particular selection of problems. These questions remain unanswered and rise the feeling that the use of the UCI has been lately flawed by the danger of benchmarks [33]—a tailored development for a cherry-picked test set which entails shallow experiments and unfounded conclusions. 2.2.2. Selected learners The selection of reference classifiers from standard implementations provided in software tools also poses significant doubts. In particular, we do not know (1) whether the selection of learners is representative enough and (2) whether the selected learners are properly configured to work at their best performance. It is often the case that the algorithm under study is thoroughly designed and tuned, while the remaining algorithms are run with their baseline configuration. This can be attributed to the difficulty of understanding the details of each algorithm, even though some tools are open source and the code is available. Anyway, researchers do not pay much attention to the tuning of the selected reference algorithms, which may consequently bias the results in favour of the proposed algorithm. 2.2.3. Running learners Since the de facto establishment of the UCI test, some contributions related to algorithm design are stuck on demonstrating small improvements of existing ML techniques over the same small collection of data sets. These kinds of contributions seem to be far from the discovery of new learning paradigms and/or from significant achievements to key problems of the discipline of ML. Although empirical results are necessary to test an algorithm, running experiments over a small set of UCI data sets gives us limited knowledge about the domains of competence of the algorithm and the type of complexity posed by the data set. For instance, after several decades of testing algorithms on the Iris problem, it is still hard to determine (1) the maximum accuracy attainable in the problem and (2) which classifier reaches the maximum value and why. This is due to the following reasons:
Table 1 Number of data sets used in experiments. Number of data sets
1
2–10
11–30
>30
Papers using this number of data sets (%)
25.5
55.3
16.7
2.3
Table 2 Most popular classification problems from the UCI repository according to http://archive.ics.uci.edu/ml/ on March, 05 2013 (hits counted since 2007) and their characteristic description. #Cl is the number of classes, #Inst is the number of instances, and #Att is the number of attributes. #Real, #Int, and #Nom indicate the number of real-, integer-, and nominal-valued attributes respectively. %missInst, %missAtt, and %missVal correspond to the percentage of instances with missing values, attributes with missing values, and the total percentage of missing values respectively. Finally, %Maj is the percentage of instances of the majority class and %Min is the percentage of instances of the minority class. Data set
Hits
#Cl
#Inst
#Att
#Real
#Int
#Nom
%missInst
%missAtt
%missVal
%Maj
%Min
Iris Adult Wine Breast cancer wisconsin (D) Car evaluation Abalone Poker hand Internet advertisements Yeast
412,403 290,053 253,117 209,808 196,586 161,552 143,149 104,711 104,315
3 2 3 2 4 29 10 2 10
150 48,842 178 699 1728 4177 1,025,010 3279 1484
4 14 13 9 6 8 11 1558 8
4 0 13 0 0 7 0 3 8
0 6 0 9 0 0 5 0 0
0 8 0 0 6 1 6 1555 0
0.00 7.41 0.00 2.29 0.00 0.00 0.00 28.06 0.00
0.00 21.43 0.00 11.11 0.00 0.00 0.00 0.19 0.00
0.00 0.95 0.00 0.25 0.00 0.00 0.00 0.05 0.00
33.33 76.07 39.89 65.52 70.02 16.50 50.12 86.03 31.20
33.33 23.93 26.97 34.48 3.76 0.02 7.80e 4 13.97 0.34
240
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
1. The accuracy obtained depends on the data sampling used in the experiments (test and training partitions) [8]. Thus, regardless of the fact that the study is based on the same problem, the results of several algorithms from different papers cannot be directly compared. 2. If the data set sampling is used in a controlled experimental framework, then all of the algorithms should be optimised for a fair comparison. 3. The intrinsic complexity of the data set is unknown. That is, the classification error can be caused by an unfitted classifier design (learner’s limitation) or by intrinsic difficulties of the problem (data limitation). For instance, sparse data—i.e. data that are not representative enough of the underlying concept—or presence of noise set a maximum reachable accuracy rate. This value is still estimated from the performance achieved by the algorithms, which necessarily involves not only the algorithm’s search procedure but also the adaptation of the algorithm and its knowledge representation to the geometry of the problem. Therefore, running enhanced algorithms on collections of data samples provides few answers to the essential problems faced with classification tasks. 2.2.4. Supported claims The application of statistical tests to tables of results (where columns represent algorithms and rows data sets) has supported the claims on the superiority of a large number of learners. Nonetheless, the relevance of such claims can be argued from two perspectives: 1. The superiority of a learner on the available set of problems is not a proof of its competence on a more general class of problems. The NFL theorem demonstrated that there cannot be a single learner performing best in all of the domains [38]. 2. The superiority of a learner on a particular set of problems, among those available in the collection, is not properly understood yet. There has not been any proven relationship between the performance of learners and measures of the extrinsic characteristics of the data sets (e.g. the number of attributes or the number of instances) [24]. Some studies could only identify correlations between measures of geometrical complexity and the performance of learners, but it still requires further investigations [3]. Indeed, there are still fundamental questions regarding the design of experiments that need to be discussed within the ML community. The next section revises the role of repositories and sets the basis for the proposal of a ‘‘re-founded’’ repository, which would represent a step towards the design of better founded experimental frameworks. 2.3. Towards a well-founded UCI test In summary, the UCI test has a number of limitations that may negatively influence the advances on the field. Although we do no blame the UCI repository for such limitations, we understand that the availability of the data sets bolstered the diffusion of this ill-supported and strongly data dependent procedure. Therefore, we advocate that repositories provide a representative sample of problems, large enough to incorporate the complexities of real-world problems. Data sets should be duly characterised, so that a practitioner addressing a specific challenge of the community, e.g. unbalanced problems, could find a good set of such problems in the repository. In addition, the organisation of the repository itself should provide practitioners with guidelines for the experimental set-up. The next sections peruse the content of the repository and analyse whether the UCI, as it is currently defined, could sustain a well-founded experimental methodology. The analysis leads us to demand its redesign towards the UCI+, a mindful repository—in the sense that it should be designed in a way that solves the issues addressed in Section 6—supporting a mindful experimental methodology for learner assessment. 3. UCI: An apparent benchmarking site The UCI repository is composed of 199 problems divided into four categories: classification (134), regression (17), clustering (8), and other (45).1 This section takes a close view of these data sets and lists the issues encountered while preprocessing them. The issues are organised by standards, format, definition, and others. Tables 3,4, and 5 report the inconsistencies of the classification data sets. 3.1. Standards The following refers to those issues regarding encoded information that do not follow the de facto conventions. 1
The sum of the values is 204 due to the fact that some problems can belong to two or more categories.
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
241
Table 3 Summary of inconsistencies and errors of the UCI data sets (I).
3.1.1. Missing values Some data sets use integers instead of the standard notation—the question mark—to represent missing values. This, usually unnoticed, can severely affect the model extraction and provide misleading patterns. 3.1.2. Class attribute The class attribute does not follow any standard. It is neither in the last column nor labelled as a class attribute. The documentation often needs to be consulted to correctly identify it. If the class is not specified, software such as Weka just assumes that the class is the last attribute, converting the data in a very different problem. In addition, problems for classification have to be set apart from the ones for association-rule tasks. 3.2. Format The following refers to those issues regarding the data file format used to represent the learning concept—also referred to as target concept.
242
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
Table 4 Summary of inconsistencies and errors of the UCI data sets (II).
3.2.1. File format The format of the data sets can vary. In general, the header information is in a separate file from the data; the latter is usually stored in a csv format. Some of them are in unusual formats for classification task such as Lisp, Prolog, Excel, or Matlab. 3.2.2. Partition Some data sets require merging test and training files to obtain the original data set. Some require integrating the class labels as well. 3.3. Description of the target concept The following refers to those intrinsic issues regarding the definition of the target concept.
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
243
Table 5 Summary of inconsistencies and errors of the UCI data sets (III).
3.3.1. Out of category Some regression problems are listed in the classification category and, therefore, used to test classifier systems. For some of them, the documentation specifies that if the data set is tackled as a classification problem, the instances have to be grouped in a different way. However, these data sets are often used without properly looking at the definition of the target concept, which may lead to meaningless results. 3.3.2. Mismatch between definition and sample The definition of the target concept sometimes contains classes that are not present in the collected data. This also happens for attributes, which never take the values defined in the header of the problem. This results in constant attributes that do not provide informative data about the problem. Hence, the dilemma of removing or keeping these attributes from the concept definition arises. From the experts’ point of view, the attribute should remain since there is a theoretical reason.
244
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
Nevertheless, from the practical point of view, the attribute is useless for the ML technique. This should lead to the artificial generation of samples and reproduce the missing rare phenomena that occur in nature. 3.3.3. Multi-label Some problems are multi-label, i.e. one instance can belong to two different classes/objects at the same time. This is frequent in image processing. In order to adapt these data sets to classical approaches, they have to be preprocessed, for instance, by dividing the data set into as many files as labels. 3.3.4. Identifiers Some data sets contain attributes that take unique identifiers, such as an ID number. These kinds of attributes need to be removed since they do not provide informative data for the classification task. 3.3.5. Lack of information The documentation is sometimes vague and does not explicitly indicate the purpose of the classification and the attribute to predict. 3.3.6. Multiple interpretation According to the documentation, some problems can result in different data sets depending on the class grouping. 3.3.7. Attribute representation Categorical attributes are represented by numerical values. This change in the representation may have a high impact in those learning processes that use distance computation. 3.3.8. Multi-valued attributes Some attributes can take more than one value at the same time. This entails instances of variable size, which have to be treated if a classical classifier is applied. 3.4. Others Duplicate. The same problem can have different entries in the repository; just the name changes. Availability. Some problems are only available upon request. All the details about the above issues are reported in A. These issues entail a non-deterministic preprocessing and data cleaning which are not usually specified in contributions and may result in different test beds. Consequently, the performance obtained in different contributions may not be comparable, as previously explained in Section 2.2.3. The next section describes how we proceeded to unify the data sets in order to conduct a fair data complexity analysis. 4. Data preprocessing There are previous attempts in the literature to provide the UCI repository data sets in a standard format. Some prominent initiatives are supported by software tools such as Weka,2 TunedIT,3 or KEEL.4 However, the work done only corresponds to a mechanical formatting far from the understanding of each problem and their underlying issues, as previously discussed. This section details the operations carried out to provide a unified set, which was used to perform an analysis of the representativeness of the sample. First, a formatting process was applied as follows: 1. Construct the description header. 2. Merge all the data files into a single one: test, training, and class labels provided plus the constructed header. Then, a process of data cleaning was performed on the resulting formatted data sets: 1. 2. 3. 4. 5. 2 3 4
Eliminate attributes that encoded unique identifiers. Label attributes classes and position them in the last column. Standardise missing values. Remove don’t care. Identify nominal attributes encoded by numerical values.
http://www.cs.waikato.ac.nz/ml/weka/. http://tunedit.org/. http://keel.es/.
245
60 40 0
20
Test accuracy (%)
80
100
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
C4.5
IB7
LogR MLP
NB
PART
RF
SMO
Fig. 1. Test accuracies of C4.5, IB7, Logistic (LogR), Multilayer Perceptron (MLP), Naive Bayes (NB), PART, Random Forest (RF), and SMO over the UCI repository.
From 134 classification entries,5 only 107 could be processed—24 entries (marked with ‘*’ in Tables 3,4, and 5) were discarded because of format deficiencies or lack of information about the target concept, and three of them (marked with ‘**’ in Tables 3,4, and 5) were restricted problems, only available upon request (we emailed the source, but we did not receive any response). After the data preprocessing, our final data panel was composed of 166 data sets (which are available upon request). Although we proposed a standardisation of the UCI, we insist that only the authors of the data can disambiguate definition issues and contribute to build a solid sample. 5. Coverage and representativeness of the UCI sample This section analyses whether the UCI repository provides a representative sample of real-world problems. This is a desirable property for repositories to become a reference point to researchers to investigate learner behaviour. Therefore, we estimate the diversity of the sample based on the performance of several classifiers, the extrinsic characteristics, and the geometrical complexity of the UCI data sets. 5.1. Classifiers’ performance To gauge the coverage of the UCI sample with respect to learners’ performance, we contrasted the accuracy rate obtained by eight classifiers from different learning paradigms: C4.5, PART, IBk, Random Forest, Multilayer Perceptron, SMO, Logistic, and Naive Bayes. This selection represents some of the fundamental algorithms of ML. We used the implementation provided by Weka—which allows the replicability of our experiments—with the following configuration: (1) k = 7 for the IBk, (2) a polynomial kernel of order 5 for the SMO, and (3) the default parameters for the rest. (In this case, the use of default configurations does not affect the results of the experiment since we are not trying to demonstrate the superiority of a classifier; we aim to approximate the coverage and difficulty of the UCI sample from the perspective of general classifiers.) The ten-fold cross-validation procedure [8] was used to estimate the classification accuracy. Fig. 1 represents the spread of accuracy rates for each classifier across the 166 data sets. Note that the median value of the boxplots ranges in the interval [82.94, 90.11]. The values between the lower and upper quartile (which correspond to the 25th and 75th percentile respectively) range in the interval [64.96, 96.67]. Furthermore, the average performance goes from 75.83 to 81.77. These intervals, far from random classification, indicate that (1) we can easily find a learning scheme that captures the knowledge structure of these problems and (2) the sample may be not diverse enough to spot differences between the learning schemes. Classifiers easily learnt the structure of 97% of the problems; only 3% were challenging. These outliers correspond to the following problems: Abalone-29Classes [21.2, 26.6], AutoUni-au6_1000 [14.6, 22.7], AutoUniau6_250_drift_au6_cd1_500 [14.9, 20.4], LibrasMovement-9 [6.7, 24.4], and MetaData [0, 1.5]. The accuracies obtained on these problems are equally low and indicate that these five data sets pose certain difficulty to the different learning schemes. However, the number of challenging problems is still small and it would be worthwhile to enlarge the repository in this range of complexity. Fig. 2 plots the results of the eight classifiers for each data set. Note that the accuracy of the classifiers is very similar for most of the data sets and the spread of accuracy rates is small. Interestingly, those problems that present a large variability in 5
Some entries contain more than one version of the same problem.
246
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
the test accuracy, such as Artificial Characters, Balance Scale, Chess King Rook vs King, Libras Movement 10, Monk-1, Robot Execution Failures LP1, Trains, Wall Following Robot Navigation 24 variables, are not amongst the most used in the literature [25]. Table 2 shows the most popular classification problems as reported from the UCI repository. Observe that these data sets
Test accuracy (%) 0
20
40
60
80
Abalone−29Classes Abalone−3Classes AcuteInflammation−InflammationOfUrinaryBladder AcuteInflammation−NephritisOfRenalPelvisOrigin Adult Annealing Arcene Arrhythmia ArtificialCharacters AudiologyStandardized AutoUni−au1_1000 AutoUni−au4_2500 AutoUni−au6_1000 utoUni−au6_250_drift_au6_cd1_500 AutoUni−au6_cd1_400 utoUni−au7_300_drift_au7_cpd1_800 AutoUni−au7_700 AutoUni−au7_cpd1_500 BalanceScale BalloonsAdult+Stretch BalloonsAdultStretch ellowSmall+AdultStretch BalloonsYellowSmall ansfusionServiceCenter BreastCancerWisconsinDiagnosis BreastCancerWisconsinOriginal BreastCancerWisconsinPrognostic BreastTissue−4Classes BreastTissue−6Classes Cardiotocography−10Classes Cardiotocography−3Classes CarEvaluation ChessKingRookVsKing ChessKingRookVsKingPawn CongressionalVotingRecords Connect4 ConnectionistBenchSonarMinesVsRocks ConnectionistBenchVowelRecognition elRecognition−Reduced ContraceptiveMethodChoice CreditApproval Dermatology Ecoli Flags GlassIdentification HabermanSurvival HayesRoth tDisease−Processed−cleveland tDisease−Processed−hungarian tDisease−Processed−switzerland tDisease−Processed−va tDisease−Reprocessed−hungarian Hepatitis HillValleyWithNoise HillValleyWithoutNoise HorseColic−SurgicalLesion ImageSegmentation InternetAdvertisements Ionosphere Iris Isolet KDDCup1999−Corrected KDDCup1999−KddcupData10PercentCorrected Lenses LetterRecognition LibrasMovement−10 LibrasMovement−1 LibrasMovement−5 LibrasMovement−8 LibrasMovement−9 LibrasMovement wResolutionSpectrometer LungCancer Madelon MAGICGammaTelescope MammographicMass MetaData MolecularBiology−PromoterGeneSequences unctionGeneSequences Monks−1 Monks−2 Monks−3 MultipleFeatures Mushroom Mushroom−Expanded Musk−1 Musk−2 Nursery OpticalRecognitionOfHandwrittenDigits oneLeveDetection−Eighthr oneLeveDetection−Onehr ageBlocksClassification Parkinsons enBasedRecognitionofHandwrittenDigits PimaIndiansDiabetes PokerHands PostOperativePatient RobotExecutionFailures−LP1 RobotExecutionFailures−LP2 RobotExecutionFailures−LP3 RobotExecutionFailures−LP4 RobotExecutionFailures−LP5 SemeionHandwrittenDigit ShuttleLandingControl SoybeanLarge SoybeanSmall Spambase SPECTFHeart SPECTHeart ustraliandCreditApproval StatlogGermanCredirCard manCredirCard−Numeric StatlogHeart StatlogImageSegmentation StatlogLandsatSatellite StatlogShuttle StatlogVehicleSilhouttes SteelsPlatesFaults SyntheticControlChartTimeSeries eachingAssistantEvaluation ThyroidDisease−Allbp ThyroidDisease−Allhyper ThyroidDisease−Allhypo ThyroidDisease−Allrep ThyroidDisease−Ann ThyroidDisease−Dis yroidDisease−Hypothyroid yroidDisease−NewThyroid ThyroidDisease−Sick yroidDisease−SickEuthyroid TicTacToeEndgame Trains VolcanoesOnVenus−A1 VolcanoesOnVenus−A2 VolcanoesOnVenus−A3 VolcanoesOnVenus−A4 VolcanoesOnVenus−B1 VolcanoesOnVenus−B2 VolcanoesOnVenus−B3 VolcanoesOnVenus−B4 VolcanoesOnVenus−B5 VolcanoesOnVenus−B6 VolcanoesOnVenus−C1 VolcanoesOnVenus−D1 VolcanoesOnVenus−D2 VolcanoesOnVenus−D3 VolcanoesOnVenus−D4 VolcanoesOnVenus−E1 VolcanoesOnVenus−E2 VolcanoesOnVenus−E3 VolcanoesOnVenus−E4 VolcanoesOnVenus−E5 wingRobotNavigation−24Variables wingRobotNavigation−2Variables wingRobotNavigation−4Variables mDatabaseGenerator−Version1 mDatabaseGenerator−Version2 Wine WineQualityRed WineQualityWhite Yeast Zoo
Fig. 2. Variability of the test accuracy of the classifiers itemised for each data set.
100
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
247
correspond to accuracy rates with low variance and high median. One could even deduce that these problems are popular because they typify concepts that are easily solvable with current machine learning techniques properly tuned. Thus, testing over the popular sample may favour good results—which leads to publishable conclusions. We value the task of the UCI repository in the field of ML by providing data sets which allowed practitioners to focus on new research directions such as algorithms for multi-label or missing-value problems. Nonetheless, regarding the most popular data sets only three out of nine contain missing values. Besides, over the last few decades, the UCI data sets have often been used to justify the excellence of a given classifier through a performance comparison. We believe that the state of maturity of learners is leading the ML community to the study of more challenging problems such as big data and semi-supervised learning. Precisely, it is because of this current and next-future need of the ML community that we claim that the repository be enhanced to continuously support the community with further advances. Repositories should foster the reuse of results, savings on computational resources, a forum for negative results, and a framework for the study of data characterisation.
5.2. Data complexity As previously mentioned, learners’ accuracy depends on (1) the match between the learner (solving approach) and the characteristics of the problem and (2) the intrinsic complexity of the data set. In the previous section, we studied the UCI from the former aspect, i.e. the learning approach. The next sections assess the data complexity of the repository by using independent estimates.
5.2.1. Extrinsic complexity Fig. 3 provides a description of the external characteristics of the UCI sample. #Inst is the number of instances and #Att is the number of attributes. #Int, #Real, and #Nom indicate the number of real-, integer-, and nominal-valued attributes respectively. %MissInst corresponds to the percentage of instances with missing values. The values of these characteristics range in the following intervals. The number of instances goes from 11 to 1,025,011 at maximum. The number of attributes varies between 2 and 10,000. Although it seems that the sample provided by the UCI is wide ranging, Figs. 4 and 5 show that there are gaps in the space. Particularly 75% of the data sets contain less than 35 attributes and 80% of the data sets contain less than 10,000 instances. Thus, the distribution of the UCI data sets tends to be skewed towards small data sets. Surely, researchers seeking to perform studies on learners’ scalability would benefit from a greater number of big data sets. This would match Holte’s contribution [15], which snubbed the UCI repository for only being a limited sample. In his experimentation, he showed that most of the data sets were correctly classified, reaching high
Max: 9165 3rdQ: 7 Mean: 2 1stQ: 0 Min: 0
Max: 256 3rdQ: 7 Mean: 0 1stQ: 0 Min: 0
12 6 2
−1.0
0
0
0
0
0
10
1
2000
20
−0.5
2
4
5
30
3
4000
0.0
40
4
8
10
6000
50
5
10
0.5
60
6
8000
15
70
Max: 1 3rdQ: 0 Mean: 0 1stQ: 0 Min: 0
1.0
Max: 10000 3rdQ: 3 Mean: 0 1stQ: 0 Min: 0 14
Max: 10000 3rdQ: 37 Mean: 19 1stQ: 7 Min: 2 7
Max: 1025011 3rdQ: 3975.5 Mean: 1001 1stQ: 269.5 Min: 11
Inst
Att
Int
Real
Nom
MissInst
Fig. 3. Distribution of the extrinsic characteristics of the UCI repository data sets.
248
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
accuracies, with simple rules. So, this sample, despite being composed of real-world problems, does not reflect the real conditions that arise in practice. Regarding the challenge of large-scale data sets, the UCI provides few problems with these characteristics: KDDCup1999Corrected (311,030 instances), KDDCup1999-KddcupData10PercentCorrected (494,022), and PokerHands (1,025,011). However, these problems are not often selected for experiments [25]—possibly due to the high computational cost involved. In addition, some of these problems contain up to 40% of duplicate instances as shown in Fig. 6. Marked in red, 35% of the problems have duplicate instances. This opens the discussion about the presence and effect of duplicates in testing problems. Since Holte’s criticism in 1993, the UCI repository has grown, but its diversity is still low in terms of intrinsic complexity, as proven in the next section.
100 1
10
Number of attributes
1000
10000
5.2.2. Intrinsic complexity Data complexity has formally been studied in the literature. For instance, the Kolmogorov complexity [18] is a universal measure that deals with complexity from a descriptive point of view. Such complexity is also related to the universal probability distribution [35] which approximates any computable distribution. Given a complete sample, the Kolmogorov complexity is defined as the length of the shortest program that describes the class boundary. Nonetheless, the Kolmogorov complexity is known to be incomputable [27] or semi-computable [19]. Therefore, other estimates have been designed to analyse the class boundary complexity, which mainly extract different geometrical indicators from the data set, e.g. how
1e+00
1e+02
1e+04
1e+06
Number of instances Fig. 4. Extrinsic characteristics—number of instances and number of attributes—of the UCI data sets (logarithmic scale).
14 20
Number of data sets
Number of data sets
12 10 8 6 4
15
10
5
2 0
0 101
102
103
104
105
106
101
102
103
Number of instances
Number of attributes
(a)
(b) Fig. 5. Histogram.
104
249
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
the classes are distributed in the feature space or the discriminative power of attributes. We focus then on the measures proposed by Ho and Basu [14], which are intended to serve as an alternative to the theoretical measures. The descriptors have been divided into three categories: (1) discriminative power of attributes (i.e. Fisher’s discriminant ratio), (2) the separability of classes (i.e. the degree of compactness of the classes, the length of class boundary), and (3) the sparsity (i.e. the ratio of the number of instances to the number of dimensions). Table 6 introduces the geometrical complexity descriptors and gives a brief description of their meaning. For more details, the reader is referred to [29].
Duplicate instances (%) 0
10
20
30
Abalone−29Classes Abalone−3Classes AcuteInflammation−Inflammation Of UrinaryBladder AcuteInflammation−NephritisOfRenalPelvisOrigin Adult Annealing Arcene Arrhythmia ArtificialCharacters AudiologyStandardized AutoUni−au1_1000 AutoUni−au4_2500 AutoUni−au6_1000 AutoUni−au6_250_drift_au6_cd1_500 AutoUni−au6_cd1_400 AutoUni−au7_300_drift_au7_cpd1_800 AutoUni−au7_700 AutoUni−au7_cpd1_500 BalanceScale BalloonsAdult+Stretch BalloonsAdultStretch BalloonsYellowSmall+AdultStretch BalloonsYellowSmall BloodTransfusionServiceCenter BreastCancerWisconsinDiagnosis BreastCancerWisconsinOriginal BreastCancerWisconsinPrognostic BreastTissue−4Classes BreastTissue−6Classes Cardiotocography−10Classes Cardiotocography−3Classes CarEvaluation ChessKingRookVsKing ChessKingRookVsKingPawn CongressionalVotingRecords Connect4 ConnectionistBenchSonarMinesVsRocks ConnectionistBenchVowelRecognition ConnectionistBenchVowelRecognition−Reduced ContraceptiveMethodChoice CreditApproval Dermatology Ecoli Flags GlassIdentification HabermanSurvival HayesRoth HeartDisease−Processed−cleveland HeartDisease−Processed−hungarian HeartDisease−Processed−switzerland HeartDisease−Processed−va HeartDisease−Reprocessed−hungarian Hepatitis HillValleyWithNoise HillValleyWithoutNoise HorseColic−SurgicalLesion ImageSegmentation InternetAdvertisements Ionosphere Iris Isolet KDDCup1999−Corrected KDDCup1999−KddcupData10PercentCorrected Lenses LetterRecognition LibrasMovement−10 LibrasMovement−1 LibrasMovement−5 LibrasMovement−8 LibrasMovement−9 LibrasMovement LowResolutionSpectrometer LungCancer Madelon MAGICGammaTelescope MammographicMass MetaData MolecularBiology−PromoterGeneSequences MolecularBiology−SpliceJunctionGeneSequences Monks−1 Monks−2 Monks−3 MultipleFeatures Mushroom Mushroom−Expanded Musk−1 Musk−2 Nursery OpticalRecognitionOfHandwrittenDigits OzoneLeveDetection−Eighthr OzoneLeveDetection−Onehr PageBlocksClassification Parkinsons PenBasedRecognitionofHandwrittenDigits PimaIndiansDiabetes PokerHands PostOperativePatient RobotExecutionFailures−LP1 RobotExecutionFailures−LP2 RobotExecutionFailures−LP3 RobotExecutionFailures−LP4 RobotExecutionFailures−LP5 SemeionHandwrittenDigit ShuttleLandingControl SoybeanLarge SoybeanSmall Spambase SPECTFHeart SPECTHeart StatlogAustraliandCreditApproval StatlogGermanCredirCard StatlogGermanCredirCard−Numeric StatlogHeart StatlogImageSegmentation StatlogLandsatSatellite StatlogShuttle StatlogVehicleSilhouttes SteelsPlatesFaults SyntheticControlChartTimeSeries TeachingAssistantEvaluation ThyroidDisease−Allbp ThyroidDisease−Allhyper ThyroidDisease−Allhypo ThyroidDisease−Allrep ThyroidDisease−Ann ThyroidDisease−Dis ThyroidDisease−Hypothyroid ThyroidDisease−NewThyroid ThyroidDisease−Sick ThyroidDisease−SickEuthyroid TicTacToeEndgame Trains VolcanoesOnVenus−A1 VolcanoesOnVenus−A2 VolcanoesOnVenus−A3 VolcanoesOnVenus−A4 VolcanoesOnVenus−B1 VolcanoesOnVenus−B2 VolcanoesOnVenus−B3 VolcanoesOnVenus−B4 VolcanoesOnVenus−B5 VolcanoesOnVenus−B6 VolcanoesOnVenus−C1 VolcanoesOnVenus−D1 VolcanoesOnVenus−D2 VolcanoesOnVenus−D3 VolcanoesOnVenus−D4 VolcanoesOnVenus−E1 VolcanoesOnVenus−E2 VolcanoesOnVenus−E3 VolcanoesOnVenus−E4 VolcanoesOnVenus−E5 WallFollowingRobotNavigation−24Variables WallFollowingRobotNavigation−2Variables WallFollowingRobotNavigation−4Variables WaveformDatabaseGenerator−Version1 WaveformDatabaseGenerator−Version2 Wine WineQualityRed WineQualityWhite Yeast Zoo
Fig. 6. Percentage of duplicate instances per data sets.
40
250
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
Some of these measures were found to be highly correlated with the maximum accuracy rate achievable by a classifier for a given problem [3,34,22]. Therefore, we measured the complexity of the UCI data sets with respect to this set of fourteen complexity descriptors. Fig. 7 depicts the distribution of the complexity measures of the UCI data sets. First, we look at the measures that evaluate the discriminative power of attributes (Fs). Note that measures F1v, F2, and F4 tend to take values that indicate that there are many data sets where a reduced set of features are informative enough to separate the classes. For example, F2 computes the overlap of the tails of distributions defined by the instances of each class. It takes very low values on average (F2l = 0.05) which mean that the attributes can easily discriminate the instances of different classes. Measures labelled as Ns consider the shape of the class boundary and whether the classes are compact or spread in the feature space. High values of these measures mean that classes are spread and the boundary is complex, while low values
Table 6 Summary of the complexity measures. Measures of overlap in the feature values from different classes F1 F1v F2 F3 F4
Maximum fisher’s discriminant ratio Directional-vector maximum fisher’s discriminant ratio Overlap of the per-class bounding boxes Maximum (individual) feature efficiency Collective feature efficiency
Measures of class separability L1 L2 N1 N2 N3
Minimised sum of the error distance of a linear classifier Training error of a linear classifier Fraction of points on the class boundary Ratio of average intra/inter class nearest neighbour distance Leave-one-out error rate of the one-nearest neighbour classifier
Measures of geometry, topology, and density of manifolds L3 N4 T1 T2
Non-linearity of a linear classifier Non-linearity of the one-nearest neighbour classifier Fraction of maximum covering spheres Average number of points per dimension
Max: 0.99
Max: 0.5
Max: 0.5
Max: 1
Max: 1.48
Max: 1
Max: 0.93
Max: 1
Max: 102501
3rdQ: 2.16
3rdQ: 0.94
3rdQ: 1
3rdQ: 0.72
3rdQ: 0.26
3rdQ: 0.5
3rdQ: 0.52
3rdQ: 0.88
3rdQ: 0.32
3rdQ: 0.44
3rdQ: 1
3rdQ: 342.86
Mean: 0.57
Mean: 5.09
Mean: 0.05
Mean: 0.46
Mean: 0.94
Mean: 0.53
Mean: 0.15
Mean: 0.25
Mean: 0.28
Mean: 0.59
Mean: 0.13
Mean: 0.22
Mean: 1
Mean: 50
1stQ: 0.2
1stQ: 1.02
1stQ: 0
1stQ: 0.13
1stQ: 0.38
1stQ: 0.3
1stQ: 0.05
1stQ: 0.05
1stQ: 0.1
1stQ: 0.34
1stQ: 0.05
1stQ: 0.05
1stQ: 0.97
1stQ: 10.32
Min: 0
Min: 0
Min: 0
Min: 0
Min: 0
Min: 0.03
Min: 0
Min: 0
Min: 0
Min: 0.01
Min: 0
Min: 0
Min: 0.52
Min: 0.02
600 500
0.98
400
0.4
0.6
0.3
0.3
F1v
F2
F4
L1
L2
L3
N2
200 100
0.94
0.2
N1
N3
0
0.0
0.0
0.0
0.0
0.0
0.0
0.1
0.2
0.1
0.1
0.2
F3
0.0
0.0
0.0
0
0
5
0.2
1
0.5
0.2
0.5
300
0.96
0.4
0.4
0.2
0.2
10
0.4
0.4
2
0.3
15
1.0
0.6
0.6
3
0.6
1.0
20
0.5
0.8
0.4
0.4
0.8
1.5
0.8
0.8
0.6
1.00
1.5
1.0
0.5
0.5
1.0
1.0 4
25
5 4 3 2 1 0
F1
700
Max: 9.93
3rdQ: 13.58
0.7
Max: 1
3rdQ: 2.59
30
Max: 15282.17 Max: 2684.78 Max: 98.69
N4
T1
Fig. 7. Distribution of the intrinsic characteristics of the UCI repository data sets (with no outliers).
T2
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
251
indicate the opposite. The results plotted in Fig. 7 show that these measures take mostly low values (N1l = 0.28, N2l = 0.59, and N3l = 0.13). N1 measures the number of points that are located near the class boundary. The low values on average of this measure (N1l = 0.28) indicate that it may be easy for the learner to model the class boundary accurately, since the boundary is clearly defined. Regarding the values of Ls, which highlight gaps in the class boundary, L2max = 0.5 and L3max = 0.5 express a low and medium interleaving between classes. As for the values of Ts, we observe that values of T1 are close to 1. This means that the instances are not grouped in sharp clusters. T2 returns the ratio of the number of instances in the data set to the number of attributes. The description of the data sets contains 50 times more instances than attributes on average. The previous sections evaluated the complexity coverage of the UCI with respect to the performance of several classifiers and the geometrical description of the target concepts. Both analyses confirmed that the UCI data sets is not a representative sample of the universe of problems. Indeed, designing an experimental test bed formed by a small sample of UCI data sets (the average number of data sets used is eight) can be a superfluous test, worsen by the little representativeness of the UCI problems. The design of a test bed should include data sets which intrinsic and extrinsic characteristics range across all the possible values. Not only the dimensionality of the problems, measured by the number of instances and number of attributes, should be spread enough across the measurement space, but also the intrinsic complexities. The study of the universe of problems is closer to a lost cause due to the infinity of both the space and its dimensions. Thus, we first need to bound this space in order to analyse the nature of data. In the project MetaL [1], for instance, the characterisation was grounded on simple measures—e.g. the number of attributes, the number of classes, etc.—, statistical measures—e.g. mean and variance of numerical attributes—, and information theory-based measures—e.g. entropy of classes and attributes as in [4]. Later, Pen et al. proposed fifteen characterisation measures based on the description of the decision tree that represented the data [30]. In our approach, the novelty is the use of intrinsic complexity estimates to characterise the problem. However, it is complicated to find the way to condense all this information in a succinct set of measures. Note that in our experimentation we limited the space to the fourteen measures described by [14]. We consider that any measurement space should meet the following properties: (1) completeness, i.e. the complexity spectrum has to be covered, (2) resolution, i.e. the dimensions used to build the space should provide sufficient granularity to reveal differences among problems, and (3) representativeness, i.e. any problem has to be able to be located on the space. Diversity can be found in real-world problems, but definitely it is not largely found in the UCI repository. Real-world data sets usually contain noise and inconsistencies, which are useful to evaluate the robustness of learners. In this regard, the UCI repository does not pose many difficulties to the classifiers, since the presence of noise and inconsistencies is limited. Realworld problems are also affected by the presence of missing values. However, the UCI does not provide many data sets where missing values have a strong influence on the results. Scalability of learners with increasing amounts of data can be hardly evaluated through UCI data sets. Thus, we may think of the UCI as a repository valid as a proof of concept for evaluating the results of a given classifier, but certainly a classifier should be tested on an set of problems that provide diversity on realworld complexities. The UCI should be enhanced with more challenging real-world problems and, additionally, with the inclusion of synthetic data sets covering the gaps where current UCI data sets are not present. The next section presents and evaluates a set of artificial problems that could help in this direction.
5.3. Comparison with an artificial sample In order to verify whether the UCI repository could be enhanced with an artificial sample of problems, we used a set of 80,000 artificial data sets, whose details are explained below. These data sets were designed to analyse whether they could cover the regions of the complexity measurement space where the UCI has not any representative. The extrinsic characteristics can be easily controlled in an artificial sample, as well as the introduction of noise or missing values. What represents a major challenge is to obtain a collection of data sets whose target concepts are spread across a wide range of geometrical complexities. For this purpose, we relied on the aforementioned data complexity measures and designed an algorithm which was able to synthetically generate data satisfying a given constraint of complexity. This was achieved by means of an evolutionary multi-objective algorithm—Non-dominated Sorting Genetic Algorithm II (NSGA-II) [6]—, which was configured to simultaneously optimise a set of complexity measures, such as the maximisation of N1 and the minimisation of F1. The synthesising approach finds the vector t = [z1, . . ., zk]T, 2 6 k 6 n, where n is the total number of instances and zi is instance i, (t is a sub-set of [z1, . . ., zn]T), which optimises (minimises or maximises) f(t) = [F1v(t), F1(t), F2(t), F3(t), F4(t), L1(t), L2(t), L3(t), N1(t), N2(t), N3(t), N4(t), T1(t), T2(t)] and subject to the following constraints: (1) k P kmin, where kmin is the minimum number of instances specified by the user, (2) class imbalance specified by the user, and (3) no duplicate instances. The particular instances of the data set were obtained by sampling a real-world problem from the UCI, in an attempt to preserve the underlying real-world structures of the target concept. The multi-objective problem was stated as follows. Given a data source, e.g. the Pima Indian Diabetes problem, we sample the data set so that the resulting data set satisfies a given set of objectives in terms of optimisation of complexity measures. By performing multiple combinations (several multi-objective formulations several original data sources several seeds for the random number generator) we obtained a set of 80,000 artificial data sets. For more details about the data set generation, the reader is referred to [26,23].
1e+05
1.0
1.0
1.0
8e+04
0.9
0.8
0.8 0.4
6e+04
0.8
4e+04 0e+00
0.6
2e+04
0.2
0.2
200
0.2
400
0.7
0.4
0.4
600 0.4
0.2
0.1
0.1
0.2
0.4 0.2
2
0.6
0.6
0.3
0.3
0.6
6 4
0.4 0.2
F1v
F2
F3
F4
L1
L2
L3
N1
N2
0.0
0.0
0
0.0
0.0
0.0
0.0
0
0
0.0
0
20
500
1000
40
1500
0.6
60
0.6
800
0.4
2000
0.8
8
0.8
80
0.8
1000
0.5
0.5
1.0
2500
1.0
10
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
100
252
N3
N4
T1
T2
Fig. 8. Comparison between the UCI repository (yellow boxplot) and the artificial landscape (green boxplot). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 8 depicts the complexity of the landscape collection composed of 80,000 artificial data sets. Let’s compare the variability of the data complexity obtained with our artificial sample and the one obtained with the UCI. Note that the values of measures Fs in the artificial sample almost overlap with the values obtained with the UCI sample. This could be attributed to the fact that sampling instances from a given source does not enhance the variability regarding the discriminative power of attributes. Accordingly, data sets that result from sampling problems from the UCI cannot be spread in a higher range than the original sample. However, instance sampling could possibly change the geometry of the class distribution. If we have a look at measures Ns and Ls, this seems to confirm our hypothesis. Note that measures Ns, especially N2, N3, and N4 are spread in complementary regions to those covered by the UCI, as happens with L2 and L3 as well. The results obtained regarding the coverage of the artificial sample across the data complexity measurement space are encouraging, since we only used five seeds. Fig. 9 shows how these seeds expand in the complexity space. Nevertheless, we acknowledge that testing a given classifier in a set of 80,000 problems may be disproportionate. In fact, we are not asserting that this procedure should be followed. Our experiments are only a proof of concept that the UCI could be enhanced with artificially generated problems. From here on, if a practitioner aims to test a given classifier on a representative sample, then a reduced number of data sets could be selected from the collection of 80,000. The criteria to select such representative data sets should be good coverage of the complexity measurement space. The number of selected data sets would depend on the density (granularity) of such sampling in the measurement space. A similar test was carried out in [9] with only 19 problems, with the added value that these 19 identified data sets representing domains of competence of different families of classifiers. In addition, the dimensionality of the complexity space could be reduced by applying a principal component decomposition. A sampling according to the number of data sets required could be performed on the new space defined by the two principal components. One could also argue that the whole measurement space is not completely covered with the collection of 80,000 data sets. Our hypothesis is that the empty regions could be covered by introducing a diverse set of original sources (only five seed data sets were used) and testing other multi-objective formulations (including more complexity measures and more objectives). Moreover, the inclusion of noise and missing values should be further exploited. As a summary, we believe that these artificial targeted-complexity problems, which capture the essence of real-world structures, may help to define a set of benchmarks that contributes to identify domains of competence, test the properties of learners, and improve them.
253
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262 −3
x 10
−3
6
4
4
2
2
0
0
PC2
PC2
6
−2
−2
−4
−4
−6
−6
−8
−8
−10 −4.5
−4
−3.5
−3
−2.5
−2
PC1
x 10
−10 −4.5
−1.5 −3 x 10
−4
−3.5
−3
−2.5
−2
−1.5 −3
PC1
x 10
(b)
(a) −3
6
x 10
4 2
PC2
0 −2 −4 −6 −8 −10 −4.5
−4
−3.5
−3
−2.5
−2
PC1
−1.5 −3 x 10
(c) −3
−3
x 10
6
4
4
2
2
0
0
PC2
PC2
6
−2
−2
−4
−4
−6
−6
−8
−8
−10 −4.5
−4
−3.5
−3
PC1
(d)
−2.5
−2
−1.5 −3 x 10
x 10
−10 −4.5
−4
−3.5
−3
PC1
−2.5
−2
−1.5 −3 x 10
(e)
Fig. 9. Coverage derived from each seed data set: (a) Checkerboard, (b) Pima, (c) Spiral, (d) Wave Boundary, and (e) Yin Yang.
6. Towards UCI+ Along the paper, the UCI repository has been analysed and two main shortcomings have been detected: the sample is not standardised and it is not representative enough. Simultaneously to this analysis, an underlying criticism to the methodology
254
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
for learner assessment has come up; the current experimental methodology has been adopted by the research community without being questioned. The limitations of both the UCI and the methodology could be overcome by taking a meaningful perspective of learner assessment procedures, which should be supported by the presence of UCI data sets properly classified for this purpose. In the design of a learner evaluation procedure, one should clearly establish the goal of the assessment. Three of the most usual objectives are testing whether the learner is (1) robust to specific data characteristics, (2) better than others for a particular type of problem, and (3) best for a given problem of interest. In all these cases, the researcher would like to obtain results which are comparable with the previous results published in the literature. This section elaborates how the limitations of the learner assessment procedure could be tackled by redesigning the UCI repository. 6.1. Testing the robustness of the learner One could consider a test of the robustness of a given learner in terms of scalability, noise, or missing values. In such case, the sample taken in the experimental design should be representative. Two aspects could be considered in the UCI+ to allow robustness tests. 6.1.1. Revision and maintenance Data sets should be reviewed and updated. Some problems should be considered for educational purpose only. This is the case for the Iris problem. Besides, problems whose definition is ambiguous or incomplete should be removed from the repository. 6.1.2. Completeness A complete sample should incorporate a set of problems covering the whole range of complexities. Achieving such completeness could be possible with the inclusion of artificial problems and an established data characterisation. This approach would provide a more controlled framework to study learners’ behaviours with respect to data deficiencies, such as noise, sparsity, or missing values. Completeness is also necessary to provide support to the study of the domains of competence of learners. 6.2. Testing the learner on a type of problem Investigating the domains of competence of a given learner seems to be a more insightful approach than testing a single classifier for robustness, provided that the NFL theorem reminds us of the non-existence of global learners. If we have a varied set of problems, with different types of complexities, we can perform experiments of what type of classifier is best fitted to each type of problem. This may result in a better comprehension of which kind of classifier should be applied given a data set, considering it is fully characterised by its complexity. This is a comprehension that may be shared by the community. We acknowledge that another approach taken in the literature is to train meta-learners that can automatically adapt to a particular problem, such as in [36,16]. In some cases, for instance [5], the meta-classifier uses a description of the data set to adjust itself. In this latter case, also the comprehension of the type of meta-learner evolved might be useful according to the particular type of problem and, thus, information of the type of classifier best suited to a given problem can be obtained. To promote similar studies, it would be useful to incorporate in the repository well-defined categories of problems, both from the point of view of domains of application and from the point of view of data complexity. A complete sample of problems would make this possible. 6.2.1. Categories of problem Data sets from specific research areas such as genetics or big data should be included in the repository. The repository should evolve towards an up-to-date testing framework. 6.2.2. Data set characterisation Data sets should be accompanied with a data complexity description, involving extrinsic characteristics (e.g. number of attributes, number of instances, as they usually include) and intrinsic complexity descriptors (e.g. data complexity estimates, if there is known presence of noise, class imbalance, etc.). With this characterisation, the results of given learners could be matched with data characteristics and try to contribute to the knowledge of the domains of competence of learners. Furthermore, the UCI repository should provide the multiples views of categorisation. 6.3. Testing the learner on a given problem A common situation in real-world problems is to provide a solution for a challenging problem. Sometimes, the research community has been more concerned about finding robust and general-purpose learners than giving solutions to particular problems. One of the difficulties that researchers may find is the access to real-world data sets; sometimes because ML researchers are separated from the disciplines where these problems arise or because data sets may have non-disclosure agreements and they are not publicly available.
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
255
Challenging problems. The UCI+ should include a section of challenging real-world problems and enforce competitions among the research community. This would promote advances in those challenging problems as well as improvements over existing methods. 6.4. Making results comparable Even though previous results in the literature are run on the same data sets, data can be preprocessed differently and folds can be constructed in different ways. Besides, researchers often tend to run baseline methods (available from software tools) and, consequently, bias the results towards the learner at hand. To avoid this, the UCI+ should provide standardised data sets that require no further pre-formatting and, possibly, a mechanism to make the results of published papers visible and available to other researchers. 6.4.1. Standardisation All the data sets should follow the same format, e.g. arff format and/or csv format. Ready-to-use data sets aim to prevent data pre-formatting tasks that may affect the comparison of results. In addition, the repository should provide the corresponding folds to perform the error estimation. Indeed, this requires stipulating which methods are the most suitable to estimate the classification error. 6.4.2. Learners’ results Given a set of data sets perfectly standardised, the UCI+ could allow researchers to share the results of their learners by means of a table published in the same repository. This table could include the reference to the original paper and the detailed results of the algorithm necessary to perform the statistical tests. This may provide useful information to either focus on solving a given problem or select the appropriate reference learners while studying the domains of competence. 7. Conclusions and further trends The validation of ML techniques—the ones that should provide solutions to the big, complex problems—has been made over either unknown problems or well-known test problems with few attributes, few instances, few classes, simple boundaries, or regular structures. No real, controlled peculiarities that strongly influence the classifiers are tested. This section summarises the risks of using repositories and benchmarks without a proper methodology for learner assessment and points out future lines towards the use of enhanced repositories that could support well-founded methodologies. The current experimentation often results in classifiers performing well on a toy sample. On top of that, the classifier could be biased accidentally from the very beginning, due to the narrow focus of the learner assessment methodology which is usually based on passing the UCI test. We detected that the data sets from the UCI repository are not exploited to get insights into the knowledge domains but wrongly exploited as a number matrices to perform algorithmic testing. This work stresses that: 1. Most claims performed in the ML and pattern recognition (PR) community are based on experimental comparisons of the performance of a given classifier against a set of other well-known classifiers. These claims are badly supported as long as there is not any study on the type and number of selected data sets. 2. The UCI repository is one of the most used repositories among the ML and PR community. We acknowledge the role of the UCI repository in allowing advances in the maturity of the classifiers in the past and recent literature. However, as classifiers have become more mature and there are new challenges that need to be addressed (e.g. big data), it is necessary to enhance the UCI repository one step further to continuously allow advances of the ML and PR fields. 3. The data sets included in the UCI come from different sources and some of them are not properly documented to fully understand the target concept behind. We also detected some inconsistencies and format errors that may bias the classifiers’ performance and, consequently, lead to questioning the conclusions obtained from it. 4. Artificial data sets may be a good alternative to real-world data sets. They might be (i) designed according to specific complexities that need to be tested and (ii) useful to evaluate the performance of classifiers in increasingly difficult problems as for example, in scalability studies. After arising our concern about the mechanical—and somehow useless—current research on ML, especially on classification, we comment on one of the twofold ambitious goal behind our criticism: (1) raise a new status for classification tasks (with well-founded methodology for learner assessment and consolidated benchmarks) and (2) boost those early interests in ML again—reasoning, problem solving, and language processing. Regarding the former issue, we pursue a change of mentality and the update of the experimental methodology for ML techniques. The methodology would be aligned with the ideas proposed by Prechelt [31]. He defended that the tuning of the learners should be performed during training. Then, once the optimal model is obtained, the testing performance should be measured. Inspired by his recommendations, we suggest taking Prechelt’s proposal one step further by using synthetic
256
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
problems to assess the learner in testing, and then validating the learner on a different set of problems with real-world structures. This can help us to gain understanding of domains of competence and define rules for experimental set-ups. In this scenario, data characterisation takes a crucial role. This new proposal of methodologies for learner assessment should be promoted by the redesign of a new repository, called UCI+, which would include current UCI data sets and would be enriched with a more diverse set of problems—basically, singular real-world problems that still pose challenges to current ML methods, a set of problems and its characterisation, and artificial problems. The resulting UCI+ would become well-known if it is endorsed by the current UCI’s popularity. Unfortunately, generating artificial problems, designing, implementing, and adopting a new methodology for learner assessment involve a long run and is subject first to the acceptance of the community. Our contributions may be useful to initiate this debate in the community, which goes further than the scope of this paper. Standardising the methodologies that test the performance of learners should include the discussion on the need for repositories that contain different types of data sets, covering the range of complexities present in real-world data sets. Also, much more efforts should be invested to allow real-world problems and data sets from industry be released to the scientific community. Since it is difficult that repositories may achieve a full range of complexities, a complementary alternative is to provide artificial data sets. The advantage of artificial data sets is that they can cover a specific region of the complexity space with the desired granularity.
Acknowledgements The authors thank the Ministerio de Educación y Ciencia for its support under projects TIN2008-06681-C06-05, Fundació Crèdit Andorrà, and Govern d’Andorra. In addition, we would like to acknowledge Dr. Pier Luca Lanzi for his inspiring comments, and Dr. Francisco Herrera and Dr. Salvador García for their engaging discussions.
Appendix A. Scrutiny of the UCI This section provides a detailed analysis of the UCI data sets by reporting those data sets that required major pre-processing. We suggest best practices and arise the need for authors’ clarification. Abalone. Although the UCI repository provides only one single file, two versions of Abalone exist: (1) the regression problem and (2) the classification problem. Originally, this is a regression problem since it consists of predicting the age of an abalone from physical measurement. The age of the individuals under observation is determined by the number of rings, which ranges from 1 to 29 (by adding 1.5 to this value, we obtain the age in years). However, the documentation indicates that the problem can be treated as a classification problem as well. To this end, the authors specify three categories by grouping the number of rings as follows: 1–8, 9–10, and 11-on. Authors’ clarification required: There are not individuals whose number of rings is 28. Could this be an error? Best practice: The regression file should be transformed according to the authors’ recommendation if used as a classification problem. Acute Inflammation. Again, the UCI repository provides only one file. However, the data set contains two attributes to predict. This means that the data set has to be processed and divided into two classification problems to predict (1) the inflammation of urinary bladder and (2) the nephritis of renal pelvis origin. Best practice: Learners should be run over data sets whose classification purpose is clear. Adult. This problem is also known as Census Income and, therefore, has a duplicate entry in the UCI repository. The files adult.data and adult.test were merged to obtain the complete data set. It is important to note that adult.test does not follow the same notation as adult.data—each instance ends with a full stop. This may cause problems when matching classes if not corrected. Authors’ clarification required: The attribute named HoursPerWeek is up to 99. This value is considerably high. Should it be interpreted as a missing value? Best practice: The methodology for learner assessment requires the application of validation techniques such as 10fold cross-validation [8]. Therefore, partitions of the data should be provided in order to perform fair comparisons. Annealing. The documentation contains extra information that is not present in the samples. For instance, all the nominal attributes except PorductType, Shape, and Bore can take the value ‘–’, which means not_applicable. Authors warn not to mix it up with missing values. However, none of the instances has this value. On the contrary, the data contain a high percentage of missing values. The definition of classes typifies class 4 (as a nominal value), but there are not instances with such a value. Therefore, we removed it from the definition header. The files anneal.data and anneal.test were merged to obtain the complete data set. Authors’ clarification required: The attributes M, Corr, Jurofm, S, and P are constant attributes, referring to missing values, that is, there are not any measurements for these attributes across the whole set of samples. Should these attributes be removed? Arcene. The documentation indicates that the problem is composed of eight files; six of them contain the data and are described as follows.
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
257
dataname_train.data: Training set (a sparse or a regular matrix, patterns in lines, features in columns). dataname_valid.data: Validation set. dataname_test.data: Test set. dataname_train.labels: Labels (truth values of the classes) for training examples. dataname_valid.labels: Validation set labels (withheld during the benchmark). dataname_test.labels: Test set labels (withheld during the benchmark). However, only five files are provided. Hence, the final data set is built by merging arcene_train.data, arcane_ train.labels, arcene_valid.data, and arcane_valid.labels. Arrythmia. Authors’ clarification required: There are instances whose attribute Age takes value 0. Should this be considered as a missing value? Artificial Characters. It is defined as a multi-instance problem. Each sample is described by a variable set of instances and stored in independent files. learn.zip contains 1000 files that describe a total of 5109 instances, and test.zip contains 5000 files encoding 25,545 instances. All the instances were merged to obtain the final data set. In addition, the second and third attributes which specify the line of the instance in the file—irrelevant and constant attributes—were removed. Finally, the class attribute was moved to the last column.6 Best practice: The attribute to predict should be named class and displayed in the last column. Audiology (Original). The data set does not follow a standard format. The use of the standardised version provided by the repository is recommended. On a side note, this data set should be processed again since the assumptions made in the standardised version could vary. Audiology. The files audiology.standardised.data and audiology.standardi zed.test were merged to create the final classification data set. In addition, we removed the unique identifier attribute. Australian Sign Language Signs. It is not clear how the data should be grouped. Authors’ clarification required: The documentation indicates that there are different subjects involved in the experimentation through different sessions. Is each session an individual data set? In addition, each file corresponds to a class. However, are alive0 and alive1 the same class? We did not have enough information to format the data set. Australian Sign Language Signs (High Quality). Same problem as the Australian Sign Language Sings (see Sect. Appendix A). AutoUni. It is a generator of data sets. However, several data set samples are provided. We considered those that are already formatted in arff—eight in total. Badges. The data, artificially generated as an activity for conference attendees, are described by only one attribute. The goal was to find out the function used for the generation through this single attribute (people’s names). It cannot be considered as a classification problem. Balance Scale. The class attribute was moved to the last column. On a side note, although the data are described by numeric values, they are categorical attributes. Balloons. The UCI repository provides four files for this problem which correspond to four different experiments. Best practice: The version used in the experimentation should be specified. Breast Cancer. The data set is available upon request. We emailed the librarians but did not get any response. Breast Cancer Wisconsin (Diagnosis). The data set contains a unique identifier which was removed, and the class attribute was moved to the last column. Breast Cancer Wisconsin (Original). The data set contains a unique identifier which was removed. Breast Cancer Wisconsin (Prognostic). The data set contains a unique identifier which was removed, and the class attribute was moved to the last column. Breast Tissue. The data set is given in excel format which requires formatting. It contains a unique identifier which was removed, and the class attribute was moved to the last column. In addition, the documentation of the problem indicates that the data can be transformed into a problem of four classes by merging the following three: fibro-adenoma (fad), mastopathy (mas), and glandular (gla). Therefore, we obtained two versions of the problem. Cardiotocography. The data set is given in excel format which requires formatting. Moreover, there are two versions of the problem: 3 classes and 10 classes, extracted from the raw data. The attributes filename and SegFile were removed since they are unique and do not enclose relevant information. Authors’ clarification required: The attribute DR is constant with a fixed value to 0. Should this attribute be removed? Census Income KDD. To obtain the final data set, the files census-income.dat and census-income.test were merged. There is an extra attribute whose meaning needed to be identified. Actually, the problem is described by either 40 attributes or 42. Character Trajectories. The file is in matlab format. We did not process it.
6 We applied this transformation because if the class attribute is not named class, software such as Weka [37] just assumes that the class (attribute to predict) is the last attribute, converting the data in a very different classification problem.
258
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
Chess King-Rook vs. King. The definition of the problem is based on the position of the pieces on the chessboard. Therefore, it is important to think about whether to define the attributes as nominal or as integer in order to interpret distances. CMU Faces. We were not able to process format .pmg. Congressional Voting Records. The class attribute was moved to the last column. Connectionist Bench (Sonar, Mines, vs. Rocks). The UCI repository provides three files: sonar-all.data, sonar.mines, and sonar.rocks. The two latter refer to the results of the experiments. The former corresponds to the classification problem in a csv format. Connectionist Bench (Vowel Recognition – Deterding Data). The UCI repository provides three files: vowel-context.data, vowel.data, and vowel.tr-orig-order. The former refers to the classification problem, the second is the raw data from the experimentation, and the latter is not documented. We observed that the former is the same as the original file but without the attributes TrainOrTest, SpeakerNumber, and Sex. Thus, we obtained two versions of the problem. Contraceptive Method Choice. Although the data are described by numeric values, note that the documentation indicates that most of them represent symbolic attributes. Credit Approval. The attribute A4 can take four values. However, there are not instances with value t—we kept it in the definition though. Demospongiae. The authors provide different formats of the data sets (lisp and prolog), but none of them is a standard format. In addition, it is not a simple classification problem since some attributes are multi-valued, which means that the size of the instances is variable. Dermatology. Authors’ clarification required: Only five classes are defined while the data contain six. This should be revised. On the other hand, the attribute Age has some values to 0. Should they be replaced with a ’?’ (missing value)? Dexter. It is a sparse matrix coded in a pair-valued attribute and requires major processing. Dorothea. Same case as Arcene (see Sect. Appendix A). The final data set is built by merging doroteha_train.data, dorothea_train.labels, dorothea_valid.data, and dorothea_valid.labels. Echocardiogram. The attribute Name was removed since its value is constant—string name to hide the name of the patients. Authors’ clarification required: The class attribute, named AliveAt1 and containing missing values, is derived from the two first attributes Survival and StillAlive. It is not clear whether these two attributes should be in the data. Furthermore, instances with unknown class should be removed. Ecoli. The first attribute SequenceName was removed since it is a unique identifier. Flags. The first attribute corresponding to the name of each country was removed as it is a unique identifier. The attribute Religion was moved to the last column since it is the class to predict. Gisette. Same case as Arcene (see Sect. Appendix A). The final data set is built by merging gisette_train.data, gisette_train.labels, gisette_valid.data, and gisette_valid.labels. Glass Identification. The documentation reports seven different types of glass. However, there are not instances of class 4—corresponding to vehicle window non-float processed. Thus, this class was removed from the definition header. In addition, the first attribute was removed as it is a unique identifier. Hayes-Roth. The two files provided do not contain the same information. hayes-roth.data contains a name attribute which was removed as it is a unique identifier. To obtain the final data set, hayes-roth.data processed and hayesroth.test were merged. Heart Disease. There are different versions of this problem. We used the files that had been already processed or reprocessed. We detected that the pre-processing was not affected by the type of the attributes, so some of the values were interpreted as real or integer values when they were categorical. For instance, the attribute Thal—which can take only the following three values 3, 6, and 7 with categorical meaning—was coded as a real value in ProcessedCleveland. ProcessedHugarian and ProcessedVa have the attribute Ca constant with a fixed value of 0. ProcessedSwitzerland has the attribute Col constant with a fixed value of 0. Such preprocessing is not well documented. Authors’ clarification required: ReprocessedHugarian has no explanation about the meaning of 9, which may symbolise missing values. Hepatitis. The class attribute was moved to the last column. Hill Valley. There are two versions of the problem: (1) with noise and (2) without noise. To obtain both versions, the files Hill_Valley_with_noise_Training.data and Hill_Valley_with_noise_Testing.data were merged, and the files Hill_Valley_without_noise_Training.data and Hill_Valley_without_noise_Testing.data were merged. Horse Colic. The documentation indicates that the class target can be defined by attribute 23,24,25,26, or 27. This means that there are five possible versions of the problem. Thus, we obtained HorseColic-Outcome.arff and HorseColicSurgicalLesion.arff from attributes 23 and 24. Attribute 25 leads to HorseColic-TypeOfLesion-1.arff, but it is a sort of regression problem since the class is a coded number. The other possibilities were discarded since the class is constant. On a side note, attribute 28 was removed since the authors point out that this variable is of no significance; the pathology data is not included or collected for the cases. Finally, we detected some mismatches between the definition of the attributes and the real values. For instance, the attribute Age is defined as a binary attribute whose values are 1 and 2, but in the data the values are 1 and 9. Similar with the attribute CapillaryRefullTime. The attribute NasograstricReflux does not range in the specified interval [0,14]; the actual is shorter: [1,3].
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
259
Image Segmentation. The attribute RegionPixelCount is constant with a fixed value of 9. The class attribute was moved to the last column. Authors’ clarification required: Should the attribute RegionPixelCount be removed as it is constant? Ionosphere. Attribute 2 is constant with a fixed value of 0. Iris. The UCI repository provides two versions of the problem: bezdekIris.data and iris.data. Actually, bezdekIris.data is a correction of the data set released by [10]. The 35th sample is now 4.9,3.1,1.5,0.2,Iris-setosa (the error was in the fourth feature). The 38th sample is now 4.9,3.6,1.4,0.1,Iris-setosa (the errors were in the second and third features). Best practice: The version used in the experimentation should be specified (or, in this case, the original file removed as it is obsolete). ISOLET. To obtain the final data set, isolet1+2+3+4.data and isolet5.data were merged. Japanese Credit Screening. It is a version in Lisp format of the Credit Approval problem. Japanese Vowels. The class attribute is not specified. We did not understand the purpose of the classification problem. KDD Cup 1999. There are three versions of the problem. Not all of them contain the same classes. For instance, KDDCup1999-Corrected does not contain the class spy. This is one of the largest problems in the repository (4,000,000 instances). Yet, around 40% of the instances are replicate instances. Letter Recognition. The class attribute was moved to the last column. Libras Movement. There are six versions of the problem, five of them are subsets. Low Resolution Spectrometer. The first attribute, referring to the LRS-name, was removed as it is a unique identifier. The class attribute was moved to the last column. The LRS-class values range from 0 to 99—coding class in the 10s and subclass in the 0s. However, the instances represent only 48 classes—{2, 3, 4, 5, 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 50, 69, 71, 72, 73, 79, 80, 81, 82, 85, 91, 92, 95, 96}. Lung Cancer. The problem was updated by inserting some missing values in the data set. Thus, the results obtained in previous works are not comparable anymore. All the data are described by numeric values, in spite of the fact that they refer to nominal attributes. The class attribute was moved to the last column. Lymphography. The data set is available upon request. We emailed the librarians but did not get any response. Madelon. Same case as Arcene (see Sect. Appendix A). The final data set is built by merging madelon_train.data, madelon_train.labels, madelon_valid.data, and madelon_valid.labels. Mammographic Mass. The documentation differentiates between ordinal and nominal values. Mechanical Analysis. The data do not follow a standard format and it is a multi-label classification problem. Meta-data. This is a regression problem since the class attribute is continuous. The problem could be transformed into a classification problem by considering the attribute AlgName—the name of the algorithm—the class. Moreover, there is a conceptual error in the definition since they use the question mark as the ‘don’t care’ value. Possibly, the instances that contain this notation should be removed. MiniBooNE Particle Identification. It seems a regression problem since there are not symbolic or numeric labels. The data set is only composed of real values. Molecular Biology (Promoter Gene Sequences). The second attribute, referring to the instance name, was removed as it is a unique identifier. The class attribute was moved to the last column. Molecular Biology (Protein Secondary Structure). The data only contain one attribute and the class. It cannot be considered as a classification problem. Molecular Biology (Splice-junction Gene Sequences). The second attribute, referring to the instance name, was removed as it is a unique identifier. The class attribute was moved to the last column. The definition of the problem should be revised because of the meaning of some values. That is, D, N, S, and R symbolise the following values: D: A or G or T, N: A or G or C or T, S: C or G, and R: A or G. Actually, this representation encodes combinations of instances that should be explicitly represented. Monks. The three versions of the problem contain an attribute corresponding to an identifier. Although the documentation indicates that it is a unique symbol for each instance, we did not remove it since there are different instances with the same ID value. On a side note, the class attribute was moved to the last column. Multiple features. To obtain the final data set, the files mfeat-fac, mfeat-fou, mfeat-kar, mfeat-mor, mfeat-pix, and mfeat-zer were concatenated. The class value was also added. Mushroom. There are two versions of the problem: (1) normal and (2) expanded. In both, the attribute VeilType is constant with a fixed value of p, the instances do not range across the intervals defined, the class attribute was moved to the last column, and the header is defined by real values. Musk (Version 1). Attribute 2 was removed as it is a unique identifier. Musk (Version 2). Attribute 2 was removed as it is a unique identifier. Optical Recognition of Handwritten Digits. To obtain the final data set, the files optdigits.tra and optdigits.tes were merged. All the attributes are integers that range in the interval [0,16]. However, we observed that attributes 1 and 40 are constant with a fixed value of 0. Other attributes have restricted intervals. Ozone Level Detection. There are two versions of the problem. For both, the attribute Date was removed. P53 Mutants. There are eight versions of the problem. The documentation indicates that there are 5,409 attributes per instances. However, the data only contain one plus the class.
260
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
Page Blocks Classification. The file provided is not formatted and requires processing. Parkinsons. There are two versions of the problem: (1) Parkinsons and (2) Parkinsons-Updrs. For the former, the attribute status, which is the class attribute, was moved to the last column. The latter is a two-variable regression problem. Pen-Based Recognition of Handwritten Digits. To obtain the final data set, the files pendigits.tra and pendigits.tes were merged. Pima Indian Diabetes. The documentation indicates that the data set contains missing values. However, no values in the data are marked with the standard notation, so this could not be verified. Pittsburgh Bridges. It is not a typical classification problem. The authors indicate that there are no classes in the domain. The domain DESIGN composed of five properties has to be predicted based on seven attributes. It is a multi-label problem. Poker Hands. The attributes of ranks of cards—C1, C2, C3, C4, and C5—represent Ace, 2, 3, . . ., Queen, and King by numerical values. However, the distances between Ace and King should be revised in the definition of the problem. On the other hand, to obtain the final data set, the files poker-hand-training-true.data and poker-hand-testing.data were merged. Post-Operative Patient. The attribute COMFORT can range in the interval [0,20]. However, in the set of instances it ranges in the interval [5,15]. Primary Tumor. The data set is available upon request. We emailed the librarians but did not get any response. Reuters Transcribed Subset. Missing information to process the format. Reuters-21578 Text Categorization Collection. Missing information to process the format. Robot Execution Failures. There are five versions of the learning problem. The class attribute was moved to the last column. The files required some processing to obtain a csv format. SECOM. Data contain NaN values. Semeion Handwritten Digit. The attributes are binary while real values are used for their codification. We processed the class since each digit was coded in a different attribute. That is, nine attributes to represent each digit. Shuttle Landing Control. The description of the attributes indicates that all of them are nominal. However, in the file the values are coded with numeric values; not being able to determine the equivalence. On a side note, the class attribute was moved to the last column. Soybean (Large). The class attribute was moved to the last column. The definition of the attributes differs from the real data. The nominal values are coded with integers. Soybean (Small). There are some constant attributes. The files fisher-order and stepp-order are the same data but the instances are differently sorted. Spambase. The attributes’ ranges are generally more restricted than specified. SPECT Heart. To obtain the final data set, the files SPECT.train and SPECT.test were merged. The class attribute was moved to the last column. The definition indicates that the values range in the interval [0,3], but not all the instances cover the entire range. SPECTF Heart To obtain the final data set, the files SPECTF.train and SPECTF.test were merged. The class attribute was moved to the last column. The definition indicates that the values range in the interval [0,100], but not all the instances cover the entire range. Spoken Arabic Digits. The train and test files require different treatments in order to add the class attribute. In the documentation, authors differentiate the experiments that involved males and females. However, we did not consider this information as an attribute since it is not explicitly indicated in the description. We could not build the final data set because there is a confusion between blocks and instances. Statlog (Australian Credit Approval). This is the same problem as Credit Approval. However, the description of the data has slightly been changed by numeric values—although some of them describe categorical attributes. In addition, the class is defined by the set {1,2}, but in the data the class is codified by the binary set {0,1}. Statlog (German Credit Card). There are two versions of the problem. The second version appears as a numerical transformation of the attributes. Again, this kind of transformation calls the effect of the distance computation between Boolean and numeric attributes into question. Statlog (Image Segmentation). The problem definition is the same as Image Segmentation, but the data differ from 4,623 instances. The attribute RegionPixelCount is constant with a fixed value of 9 too. Statlog (Landsat Satellite). All the attributes are continuous and their values should range in the interval [0,255]. Nevertheless, not all the attributes cover the defined space. To obtain the final data sets, the files sat.trn and sat.test were merged. Statlog (Shuttle.) To obtain the final data sets, the files shuttle.trn and shuttle.tst were merged. Statlog (Vehicle Silhouettes). To obtain the final data sets, the files xaa.dat, xab.dat, xac.dat, xad.dat, xae.dat, xaf.dat, xag.dat, xah.dat, and xai.dat were merged. Steel Plates Faults. The file Faults27x7.var generates an internal error when downloading. The data are presented as a multi-label problem, each class coded in a binary attribute. We transformed it into a single-class problem by merging the seven attributes and assigning a different value to each one. In addition, there is no documentation about the type of the attributes. Synthetic Control Chart Time Series. We added the class to the provided file.
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
261
Syskill and Webert Web Page Ratings. The purpose of the data set does not seem to be well established. It is not a classic classification problem; it is a study about web pages. The information does not follow a standard format and it is not possible to extract information from the files that represent the summaries of the experiments—named index and stored in the four folder of the zip provided. The main issue is that from the five attributes—file-name, rating, url, date-rated, and title—that describe the problem: file-name is a unique identifier, rating is the class, and the authors point out that url, date-rated, and title are not used for the learning process; they were collected for other purposes. In addition, we detected an inconsistent methodology since the authors defined three classes but decided to merge two—medium and cold—in some experiments since there is one (medium) that has minor frequency. Teaching Assistant Evaluation. It is worth noting that again the data are described by numeric attributes while they represent categorical concepts. Thyroids Disease. There are eleven different databases. ThyroidsDisease-Allbp. To obtain the final data sets, the files allbp.data and allpb.test were merged. ThyroidsDisease-Allhyper. To obtain the final data sets, the files allhyper.data and allhyper.test were merged. ThyroidsDisease-Allhypo. To obtain the final data sets, the files allhypo.data and allhypo.test were merged. ThyroidsDisease-Allrep. To obtain the final data sets, the files allhypo.data and allhypo.test were merged. ThyroidsDisease-Dis. To obtain the final data sets, the files dis.data and dis.test were merged. ThyroidsDisease-Sick. To obtain the final data sets, the files sick.data and sick.test were merged. The attribute TBGMeasured is always f, which means that this parameter is never measured and, as a consequence, the attribute TBG takes missing values across all the instances. We considered that the problem is not well defined since binary attributes do not provide information when they are false, being replaced with missing values. This also happens for the attributes TSH, T3, TT4, T4U, and FTI. Authors’ clarification required: There is an instance whose value for the attribute Age is 455. This mistake can only be fixed by the authors of the problem; the value may be 45 or 55. ThyroidsDisease-Ann. To obtain the final data sets, the files ann-train and ann-test were merged. ThyroidsDisease-NewThyroid. The attribute T3-ResinUptakeTest refers to a percentage and its value goes beyond 100— the interval is [65,144]. Moreover, there is a note saying: ‘‘there is a slight possibility of having the attribute numbers mixed up, see [2a] if it matters.’’ This comment is not appropriate for a scientific document. The meaning of attributes always matter; it is crucial to interpret and validate the model extracted by the learners. The file thyroid0387.data has not the class and contains a unique identifier in the last column. Problems with Hypothyroid and SickEuthyroid. Trains. There are two versions of the problem, but only one is in the format one instance per line. On a side note, some files correspond to documentation of other files that are not available. The data do not follow the standard format; we replaced ’ ’ with ’?’. The attributes NumWheelsC5, NumLoadsC5, LengthC5, RectangleNextToHexagon, HexagonNextToHexagon, and CircleNextToCircle are constant. UJI Pen Characters. The data do not follow a standard format. UJI Pen Characters (Version 2). The data do not follow a standard format. University. The data are provided in Lisp format which allows multi-valued attributes—the size of the instances is variable. It is not a classic classification problem. URL Reputation. It is not possible to code the data in arff format due to the high dimensionality. Volcanoes on Venues. There are twenty versions of the problem. For each experiment—{A1, A2, A3, A4, B1, B2, B3, B4, B5, B6, C1, D1, D2, D3, D4, E1, E2, E3, E4, E5}—, we merged all the file .lxyv, which gather the class and three attributes. Wall-Following Robot Data. There are three versions of the problems: (1) 24 attributes, (2) 4 attributes, and (3) 2 attributes. The three data sets were updated three times—in July, August, and September. However, there is not any difference between files. We processed the latest version. Wine. The class attribute was moved to the last column. Wine Quality. There are two versions of the problem: (1) red and (2) white wine. The documentation indicates that the problems can be tackled as classification problems since the quality is measured between 0 and 10. However, it is a regression problem. We observed that for the red wine the quality ranges in the interval [3,8], and for the white wine, it ranges in the interval [3,9]. Yeast. Although the sequence name is an identifier, it is not unique. Zoo. The documentation indicates that the first attribute is unique; a different animal name for each instance. However, we found two instances with the name frog. In addition, there is the specie girl. Although these observations are written down in the documentation with no explanation, we considered this problem badly defined.
References [1] Metal: A meta-learning assistant for providing user support in machine learning and data mining, 1998. [2] J. Alcala-Fdez, L. Sánchez, S. García, M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, V. Rivas, J. Fernández, F. Herrera, Keel: A software tool to assess evolutionary algorithms for data mining problems, Soft Computing 13 (2009) 307–318. [3] E. Bernadó-Mansilla, T.K. Ho, Domain of competence of XCS classifier system in complexity measurement space, IEEE Transactions on Evolutionary Computation 9 (2005) 82–104. [4] P. Brazdil, J. Gama, B. Henery, Characterizing the applicability of classification algorithms using meta-level learning, in: Proceedings of the European Conference on Machine Learning, 1994, pp. 83–102.
262
N. Macià, E. Bernadó-Mansilla / Information Sciences 261 (2014) 237–262
[5] C. Castiello, G. Castellano, A.M. Fanelli, MINDFUL: A framework for meta-inductive neuro-fuzzy learning, Information Sciences 178 (2008) 3253–3274. [6] K.D. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Transactions on Evolutionary Computation 6 (2002) 182–197. [7] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30. [8] T.G. Dietterich, Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation 10 (1998) 1895–1924. [9] R.P.W. Duin, M. Loog, E. Pekalska, D.M.J. Taz, Feature-based dissimilarity space classification, in: ICPR 2010, Lecture Note in Computer Science, vol. 6388, Springer, 2010, pp. 46–55. [10] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (1936) 179–188. [11] A. Frank, A. Asuncion, UCI machine learning repository, 2010. [12] S. García, F. Herrera, An extension on ‘‘statistical comparisons of classifiers over multiple data sets’’ for all pairwise comparisons, Journal of Machine Learning Research 9 (2008) 2677–2694. [13] F. van der Heijden, R.P.W. Duin, D. de Ridder, D.M.J. Tax, Classification, Parameter Estimation and State Estimation – An Engineering Approach Using Matlab, John Wiley & Sons, 2004. [14] T.K. Ho, M. Basu, Complexity measures of supervised classification problems, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 289–300. [15] R.C. Holte, Very simple classification rules perform well on most commonly used datasets, Machine Learning 11 (1993) 63–90. [16] N. Jankowski, K. Graßbczewski, Universal meta-learning architecture and algorithms, in: Meta-Learning in Computational Intelligence, Studies in Computational Intelligence, vol. 358, Springer, Berlin Heidelberg, 2011, pp. 1–76. [17] R. Kohavi, D. Sommerfield, J. Dougherty, Data mining using MLC++, a machine learning library in C++, International Journal on Artificial Intelligence Tools 6 (1997) 537–566. [18] A.N. Kolmogorov, Three approaches to the quantitative definition of information, Problems in Information Transmission 1 (1965) 1–7. [19] N. Krasnogor, D.A. Pelta, Measuring the similarity of protein structures by means of the universal similarity metric, Bioinformatics 20 (2004) 1015– 1021. [20] P. Langley, The changing science of machine learning, Machine Learning 82 (2011) 275–279. [21] T.S. Lim, W.Y. Loh, Y.S. Shih, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms, Machine Learning 40 (2000) 203–229. [22] J. Luengo, A. Fernández, S. García, F. Herrera, Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling, Soft Computing (2010). [23] N. Macià, Data complexity in supervised learning: a far-reaching implication, Ph.D. thesis, La Salle — Universitat Ramon Llull, 2011. [24] N. Macià, E. Bernadó-Mansilla, A. Orriols-Puig, On the dimensions of data complexity through synthetic data sets, in: Frontiers in Artificial Intelligence and Applications, Lecture Notes in Computer Science, vol. 184, IOS Press, 2008, pp. 244–252. [25] N. Macià, E. Bernadó-Mansilla, A. Orriols-Puig, T.K. Ho, Learner excellence biased by data set selection: a case for data characterisation and artificial data sets, Pattern Recognition 46 (2013) 1054–1066. [26] N. Macià, A. Orriols-Puig, E. Bernadó-Mansilla, In search of targeted-complexity problems, in: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, ACM, 2010, pp. 1055–1062. [27] J.M. Maciejowski, Model discrimination using an algorithmic information criterion, Automatica 15 (1979) 579–593. [28] D. Michie, D.J. Spiegelhalter, C. Taylor, J. Campbell (Eds.), Machine Learning, Neural and Statistical Classification, Ellis Horwood, Upper Saddle River, NJ, USA, 1994. [29] A. Orriols-Puig, N. Macià, T.K. Ho, Documentation for the data complexity library in C++, Technical Report, La Salle – Universitat Ramon Llull, 2010. [30] Y. Peng, P.A. Flach, C. Soares, P. Brazdil, Improved dataset characterisation for meta-learning, in: Proceedings of the 5th International Conference on Discovery Science, 2002, pp. 141–152. [31] L. Prechelt, PROBEN 1–A set of benchmarks and benchmarking rules for neural network training algorithms, Technical Report, Universität Karlsruhe, Fakultat fur Informatik, 1994. [32] L. Saitta, F. Neri, Learning in the ‘‘real world’’, Machine Learning 30 (1998) 133–163. [33] S.L. Salzberg, On comparing classifiers: pitfalls to avoid and a recommended approach, Data Mining and Knowledge Discovery 1 (1997) 317–328. [34] J.S. Sánchez, R.A. Mollineda, J.M. Sotoca, An analysis of how training data complexity affects the nearest neighbor classifiers, Pattern Analysis and Applications 10 (2007) 189–201. [35] R.J. Solomonoff, The kolmogorov lecture. the universal distribution, machine learning, The Journal Computer 66 (2003) 598–601. [36] R. Vilalta, C.G. Giraud-Carrier, P. Brazdil, Meta-learning – concepts and techniques, in: Data Mining and Knowledge Discovery Handbook, 2010, pp. 717–731. [37] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed., Morgan Kaufman, San Francisco, 2005. [38] D.H. Wolpert, The lack of a priori distinctions between learning algorithms, Neural Computation 8 (1996) 1341–1390.