Pattern Recognition Letters 19 Ž1998. 537–544
DataGen: a generator of datasets for evaluation of classification algorithms 1 Dmitri A. Rachkovskij ) , Ernst M. Kussul Cybernetics Center, National Ukrainian Academy of Sciences, Prospect GlushkoÕa 40, KieÕ 252650 GSP, Ukraine Received 23 October 1997; revised 13 March 1998
Abstract Dataset generators are useful for the evaluation of an algorithm’s performance because they allow control of the characteristics and amount of data used for benchmarking. We propose a dataset generator called DataGen that allows varying the number of input features and output classes, the complexity and realizations of class regions, the distributions of data samples, the noise level, the number of data samples. A C language listing of basic DataGen version is provided. q 1998 Published by Elsevier Science B.V. All rights reserved. Keywords: Benchmarking; Evaluation; Classification; Supervised learning; Datasets; Data generator; Synthetic data
1. Why dataset generator? Proper experimental evaluation of supervised learning algorithms for classification tasks is considered to be very important for progress in pattern recognition, machine learning, neural networks, and related fields. The number of publicly accessible benchmark collections that can be used for such testing has been growing within the recent years ŽMerz and Murphy, 1996; Murphy and Aha, 1995; Fahlman and White, 1993; Prechelt, 1994; and references therein.. Most of the datasets presented in those repositories contain real-world data. This is very valuable, since such data are usually difficult to
)
Corresponding author. E-mail:
[email protected].. E le c tro n ic a n n e x e s a v a ila b le . S e e h ttp :r r www.elsevier.nlrlocaterpatrec. 1
collect for reasons including cumber, cost, and confidentiality. Due to the same reasons, real-world datasets usually contain a limited amount of data. Therefore a non-trivial problem appears to make use of such a limited amount of data for the proper configuring, training, and testing of a classifier ŽPrechelt, 1996; Dietterich, 1997; Flexer, 1996.. Also, this precludes investigating how a particular algorithm will behave under increasing amount of training data, or, how the size of a training set should be increased to achieve the desired classification accuracy. Besides, datasets possess peculiarities Žcomplexity, noise, irrelevant attributes, etc.. that strongly influence the test results for particular classifier. Regretfully, for natural data these characteristics are usually unknown. Even if they would be known, they are fixed. Therefore it is impossible to explore the behavior of a learning algorithm under variations in these characteristics.
0167-8655r98r$19.00 q 1998 Published by Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 Ž 9 8 . 0 0 0 5 3 - 1
538
D.A. RachkoÕskij, E.M. Kussulr Pattern Recognition Letters 19 (1998) 537–544
Artificial datasets with controlled characteristics afford better understanding of benchmarking results and estimation of domain parameters where one or another classifier will have an advantage. There are well-known artificial benchmark problems such as the XOR and the generalized XOR Ž n-bit parity. problem, the n-bit encoder, the symmetry problem, the T–C problem ŽRumelhart and McClelland, 1986., as well as more sophisticated, e.g., the two spirals problem ŽFahlman and White, 1993., or superpositions of normal distributions. However most well-known test problems that are widely used today are still very simple. They usually have few attributes, few classes, simple class boundaries, a regular structure, or can be solved with 100% accuracy. Therefore they are not adequate models of complicated real-world problems. Besides, it is impossible to make in advance all the test problems with all the characteristics that may become necessary to a developer for testing his or her algorithms. These limitations would be overcome by dataset generators. In this paper we consider a dataset generator that is able to generate data with varied number of features Žattributes. and classes, complexity of class regions, noise level, number of samples Žinstances., and some other data parameters. The original version of this generator was proposed by Ernst Kussul in 1992, implemented in 1993–1994 together with Tatyana Baidyk and Vladimir Lukovich, and used for benchmarking of neural classifiers developed by us ŽKussul et al., 1994, 1993.. However the generator itself was not described in those references. So we decided to elaborate it more comprehensively and to present it in this paper.
2. Idea of the dataset generator The basic idea of the proposed dataset generator consists in the following. A dataset is built in two stages. At the first stage, the feature space is partitioned into regions corresponding to different classes. At the second stage, data instances with class labels are generated in accordance with partitioning of the first stage. The partitioning of the feature space is done by a built-in Žinternal. classifier. The internal classifier can be of any type that allows building of decision regions with desired characteristics. To gen-
erate complex non-linear disconnected and concave regions, well-known universal classifiers can be used, e.g., the potential function classifier ŽPFC. or the nearest neighbor classifier ŽNNC. ŽDuda and Hart, 1973.. The input parameters controlling the desired characteristics of data to be generated are userspecified parameters that also define the structure of the internal classifier. They include the number of input features Žattributes. A, the number of classes C, and the parameter R controlling the complexity of class regions. Classifier parameters defining specific realization of the feature space partitioning into class regions are taken from a pseudo random numbers generator. At the second stage, the data instances are generated pseudo randomly with controlled distribution parameters and are classified by the built-in classifier providing the class labels.
3. Generation of class regions 3.1. Construction of the built-in classifier For convenience let us choose the nearest neighbor classifier with direct storage as the built-in classifier. Direct storage means that all the samples of the training set are used as the reference points to calculate the distance to a pattern to be classified. The training set is generated as follows. For one of the classes, choose R pseudo random A-dimensional points in the feature space. Write these points with the class label into the training set. Choose R new random points for another class and add them to the training set. Repeat this for all C classes. As a result we get C ) R labeled points in the A-dimensional feature space. These points are used for training of the internal classifier – in our case, by their direct storage as the table of reference points of the internal classifier. Alternatively, these points could be used for training any internal classifier using some version of training algorithm for that particular classifier. Thus the class regions in the feature space have been formed by specifying the built-in classifier algorithm, generating the training set for it, and training the built-in classifier on this set.
D.A. RachkoÕskij, E.M. Kussulr Pattern Recognition Letters 19 (1998) 537–544
539
Fig. 1. Example of generated class regions of various complexity: Ža. As 2, C s 4, Rs1; Žb. As 2, C s 4, Rs16. Reference points of the training set are shown for each class.
Fig. 3. Example of generated class regions at different random realizations of reference points for the same input parameters: Ža,b. As 2, C s 4, Rs 4.
3.2. The shape of generated class regions
creasing the number of dimensions A leads to a decreasing number of non-connected class regions and complexity of class boundary shapes, since the regions can be connected through more dimensions. To get approximately the same complexity of crossected class regions, the number of reference points should increase exponentially with A Žsee Fig. 2.. Since the reference points for the internal classifier are generated at random, various realizations of class regions are possible for the same input parameters ŽFig. 3..
The number R of reference points per class determines the maximum number of non-connected regions occupied by that class. In practice, the number of non-connected regions per class is less than R because the regions around the nearby reference points of the same class tend to merge. The complexity parameter R also influences the shape of generated class regions. The greater is R)C, the greater is the number of local peculiarities of class boundaries Že.g., bendings, convex and concave segments.. Examples of class regions generated by the built-in classifier at various R are shown in Fig. 1. The number of generated non-connected regions depends essentially on the dimensionality of the feature space. For certain values of R and C, in-
4. Generation of data instances To get the labeled data samples at the output of DataGen, we should generate them and then classify
Fig. 2. Example of generated class regions at various dimensionality of the feature space: Ža. A s 2, C s 4, R s 4; Žb. A s 8, C s 4, R s 4; Žc. A s 8, C s 4, R s 512. 2D cross-sections are shown for Žb. and Žc..
540
D.A. RachkoÕskij, E.M. Kussulr Pattern Recognition Letters 19 (1998) 537–544
by the internal classifier. The distribution of generated data samples in the feature space may be chosen variously. We will consider two random distributions. 4.1. Uniform random distribution Let us generate random points uniformly distributed in the feature space. Each point is input to the trained built-in classifier, and the latter determines the class for each point. This procedure continues until we get enough samples S per class ŽFig. 4.. Then we consider these data as the dataset generated by our data generator. This set then can be subdivided into training and test subsets to be used for benchmarking of classifiers. 4.2. Normal random distribution As an example of non-uniform random distribution let us consider the Gaussian distribution. Let us place the centroids of Gaussians at the reference points of the built-in classifier. Then the reference points can be regarded as the class prototypes, and data points normally distributed around them – as their instances. The standard deviation value is the input parameter. If a generated point falls into a class region different from that of the center, or falls outside the feature space, it is discarded. Examples of generated samples are shown in Fig. 5. Under varied numbers of reference points and the dimensionality of feature space, one may wish to
Fig. 4. Example of different numbers of randomly distributed samples generated for the same class regions. As 2, C s 4, Rs 4. Ža. Ss10; Žb. Ss 50.
Fig. 5. Example of generated samples normally distributed around the reference points. As 2, C s 4, Rs 4, Ss 200, different standard deviation values: Ža. 0.09; Žb. 0.03.
change the value of standard deviation. The number of generated samples that were discarded because of falling into different class regions may be used as the feedback for such a change.
5. The noise To simulate the ‘‘measurement’’ noise, a random value with Gaussian distribution, zero mean, and user-specified standard deviation is added to each feature value. The number of samples that were classified as belonging to a different class after adding noise may be used as the feedback for choice
Fig. 6. Example of uniformly generated data instances with Gaussian noise added to feature values. As 2, C s 4, Rs 4, Ss 500, the standard deviation value of noise: Ža. 0.09; Žb. 0.03. Only the samples falling into different classes after adding noise are shown.
D.A. RachkoÕskij, E.M. Kussulr Pattern Recognition Letters 19 (1998) 537–544
of the standard deviation. If the resulted feature value falls outside the feature range, the noise component is regenerated. Examples are shown in Fig. 6.
541
The noise due to the ‘‘typing’’ errors in feature value or class name during the input process can be simulated by the random replacement of their correct
Fig. 7. The C listing of the basic DataGen version.
542
D.A. RachkoÕskij, E.M. Kussulr Pattern Recognition Letters 19 (1998) 537–544
values with random values at a specified Žlow. probability.
6. The implementation The algorithm of DataGen has been implemented using the C programming language. The listing of its simplified basic version is shown in Fig. 7. An attempt has been made to write the code as portable as possible, short, and simple. For this purpose only ANSI C language and libraries have been used. The feature values are generated on the interval w0.0,1.0x with 2 16 different gradations. The function rnd( ) generates random numbers on the interval w0,2 16 x. Different argument arrays of rnd( ) are to provide independent sources of random numbers for various program demands. The function norm( ) generates random numbers with Gaussian distribution, zero mean, and unit standard deviation. The function gen_refp( ) generates random points that are used as the training set or reference points for the built-in classifier. The nearest neighbor classifier NNC( ) is used as the built-in classifier in this version of DataGen. The function gen_norm( ) is used to generate labeled data samples normally distributed around the reference points with the user-specified standard deviation SD_NRM. The number of generated samples discarded due to falling out of their class regions is counted for each class in disc[ ] array. The ratio rat_dis of the number of discarded samples to the total number of generated samples may be used to adjust SD_NRM. The function gen_rand( ) generates and labels randomly distributed data samples. The choice of normally or uniformly distributed data samples is defined by NRM. The function add_noise( ) adds random numbers with normal distribution, zero mean, and userspecified standard deviation SD_NOIS to the sample attributes, thus simulating the measurement noise. The number of samples of each class that entered another class region after adding measurement noise is counted in misc[ ], and their ratio rat_mis to the total number of samples S)C can be used to adjust SD_NOIS. The function ch_val( ) randomly changes both the attribute values with the probability defined by
E_A and the class labels with the probability defined by E_C. These changes simulate erroneously entered attribute values and teacher errors respectively. This function also writes the data samples obtained to the file. Two data files named train.nnn and test.nnn are produced for the same class regions. As suggested by their names, these files may be used for training and testing of classifiers. Different nnn correspond to different realizations of class boundaries for the same user-specified input parameters. The number of file pairs is defined by NoF. The function FP_STAT writes some statistics mentioned above at the end of each data file. We intend to place the DataGen code Žas well as its possible further versions. to the UCI Repository ŽMerz and Murphy, 1996..
7. Past usage Previous versions of this dataset generator have been developed and used for comparison of the potential function, the nearest neighbor, and the back propagation classifiers against our neural network classification algorithms ŽKussul et al., 1994, 1993.. We have generated datasets with varied complexity Žup to 4., number of classes Žup to 16., number of features Žup to 16., number of samples in the training set Žfrom 125 to 16,000.. It allowed us to carry out rather thorough comparison and conclude that for large training sets our neural network classifiers approached the accuracy of the best Žpotential function. classifier, and in the speed of classification and especially training it outperformed all evaluated classifiers. We have not been able to obtain such results using available artificial test tasks because of their simplicity.
8. Related work Let us consider two known generators capable of producing rather complicated datasets: the Second Data Generation Program ŽDGPr2. ŽBenedict, 1990; Merz and Murphy, 1996., and the Synthetic Classification Data Sets ŽSCDS. ŽMelli, 1997..
D.A. RachkoÕskij, E.M. Kussulr Pattern Recognition Letters 19 (1998) 537–544
DGPr2 generates data distributed as a superposition of multidimensional Gaussians with randomly chosen centroids. Only two classes are available, one is inside regions with the centers situated at the centers of Gaussians, another contains remaining instances. Therefore the density of the first class samples is much higher than that of the second class. Complexity of class regions generated by DPGr2 is comparable with the special case of DataGen for two classes, normally distributed instances, and no noise. SCDS generates datasets based on a high-level symbolic model representation. It automatically generates classification rules that define classes. Each rule consists of conjunctions and disjunctions of certain attribute values and generates one class. Some real-world characteristics such as irrelevant or missing attributes, noisy or missing attribute values, nominal data types are supported. SCDS is most appropriate for testing of rule-based systems. It is subject to the difficulties of dealing with continuous features and generating only axis-parallel class boundaries.
9. Possible modifications of datagen To extend the capabilities of this dataset generator, various modifications of the described basic version are possible. Some of modifications are straightforward and easy to make and were not included into the program listing of Fig. 7 solely due to a limited space. They include: Ø Missing features. Can be implemented by ignoring some of the features of generated dataset. Ø IrreleÕant features. Irrelevant features are generated at random and are not used by the internal classifier for labeling of the samples. Ø Missing Õalues. Introduced by ignoring some values of some features Ø Different complexity of class regions for different classes. Introduced by varying the number of reference points R for different classes. Ø Different number of patterns per class. S parameter should be specified for each class. Ø Binary features. To convert random numbers to binary features with a given probability of 1, compare it with a threshold proportional to the probability of 1.
543
The following modifications may require further exploration: Ø Nominal features. Will require changes in the Euclidean metrics of the built-in classifier or feature representation. Ø Dependent features. The types of dependencies require further discussion. Ø Regression datasets. Gradual functions of the input features should be specified. Ø Sequences. Dependency of the function values on its previous values should be introduced. Ø UnsuperÕised learning. For non-uniform Že.g., normal. distribution, unsupervised learning can be used to approximate its density.
10. Conclusions The dataset generator described in the paper allows generation of datasets with the following characteristics that can be varied: the number of the input features and output classes, the complexity and realizations of class regions, the data sample distributions, the noise level, the number of data samples. We hope that DataGen will be useful for the comparison of the accuracy and speed of the supervised classification algorithms. We also hope that the feedback from those who may find DataGen useful will help us in overcoming of its drawbacks and in evaluating which of the possible modifications of DataGen are worth while.
Acknowledgements This work was supported in part by the International Science Foundation Grants U4M000 and U4M200, as well as by the INTAS Project 93-0560. The authors would like to thank Yuri Ostapchuk for compiling DataGen on a number of platforms, Fred Runyan and Tanya Olar for their comments, and anonymous referees for their valuable suggestions.
References Benedict, P.A., 1990. The use of synthetic data in dynamic bias selection. In: Proc. 6th Aerospace Applications of Artificial Intelligence Conference, Dayton, OH, October.
544
D.A. RachkoÕskij, E.M. Kussulr Pattern Recognition Letters 19 (1998) 537–544
Dietterich, T.G. Statistical tests for comparing supervised classification learning algorithms. Submitted for publication. Duda, R.O., Hart, P.E., 1973. Pattern Classification and Scene Analysis. Wiley, New York. Fahlman, S.E., White, M., 1993. CMU neural network learning benchmark database whttp: rrwww.cs.cmu.edu rafsrcs rprojectrconnect rbenchrINDEXx. Carnegie Mellon University, School of Computer Science, Pittsburgh, PA. Flexer, A., 1996. Statistical evaluation of neural network experiments: minimum requirements and current practice. In: Trappl, R. ŽEd.., Proc. 13th European Meeting in Cybernetics and Systems Research, Austrian Society for Cybernetic Studies, Vienna, pp. 1005–1008. Kussul, E.M., Baidyk, T.N. Lukovich, V.V., Rachkovskij, D.A., 1993. Adaptive neural network classifier with multifloat input coding. In: Proc. 6th Intl. Conf. NeuroNimes’93, Nimes, Oct. 25–29. Kussul, E.M., Baidyk, T.N., Lukovich, V.V., Rachkovskij, D.A., 1994. Adaptive high performance classifier based on random threshold neurons. In: R. Trappl ŽEd.., Cybernetics and Systems’94. World Scientific Publishing, Singapore, pp. 1687– 1695.
Melli, G., 1997. SCDS – A Synthetic Classification Data Set generator whttp: rrfas.sfu.carcsrpeople rGradStudents rmellirSCDSrx. Simon Fraser University, School of Computing Science, Burnaby, BC. Merz, C.J., Murphy, P.M., 1996. UCI Repository of machine learning databases whttp: rrwww.ics.uci.edu r ; mlearnr MlRepository.htmlx. University of California, Department of Information and Computer Science, Irvine, CA. Murphy, P.M., Aha, D.W., 1995. UCI Repository of machine learning databases wftp:rrics.uci.edu rpubrmachine-learningdatabasesx. University of California, Department of Information and Computer Science, Irvine, CA. Prechelt, L., 1994. PROBEN1 – a set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21r94. Universitat ¨ Karlsruhe, Fakultat ¨ fur ¨ Informatik. Prechelt, L., 1996. A quantitative study of experimental evaluations of neural network learning algorithms: current research practice. Neural Networks 9 Ž3., 457–462. Rumelhart, D., McClelland, J. ŽEds.., 1986. Parallel Distributed Processing: Exploration in the Microstructure of Cognition. MIT Press, Cambridge, MA.