INFORMA T10N SCIENCES 10, I - 16 ( 1976 )
1
Artificial Data for Pattern Recognition B. G. BATCHELOR Department of Electronics, University of Southampton, Southampton, S095NH, England Communicated by A. M. Andrew
ABSTRACT The paper describes the generation of three types of artificial data and their use as test material in pattern recognition research. Type A data: The user clef'rues the perfect decision surface. The classes are separable and the pdf's flat. This type is useful in two ways: (i) To investigate whether a learning procedure can achieve a minimal-cost solution. (ii) To compare the powers of two classifiers. Type B data: The user defines the optimal decision surface. The classes are not separable; the degree of overlap between the classes can be controlled by the user. The pdf's are approximately flat, except in regions close to this optimal decision boundary. This type is useful in the following ways: (i) To study the effect of varying the overlap between classes upon a learning procedure. (ii) To compare the powers of two classifiers on a random problem. Type C data: This type is a model of natural, clustered data. The user specifies the location, height, and spread of a number of "hills" in the pdf (for each class). These parameters allow us to calculate the pdf's and hence the Bayes' classification, at any given point. This provides a powerful tool for the objective evaluation of a learning classifier, operating on a realistic problem.
1. I N T R O D U C T I O N Artificial, or synthetic, data has been widely used for testing and evaluating procedures in pattern recognition and related areas o f data analysis. There are several obvious advantages resulting f r o m the use o f artificial rather than natural data: (a) It is usually cheaper to generate artificial data than to collect natural data. (b) The e x p e r i m e n t e r has complete control over the structure o f the data. (c) The run-store requirements o f a software data generator are small. (This allows us to perform useful research on c o m p u t e r s which have a small main store.) To date, the use o f artificial data has almost invariably been a c c o m p a n i e d by ©American Elsevier Publishing Company, Inc., 1976
2
B.G. BATCHELOR
informal (subjective) methods of assessing the performance of the procedure under investigation. This is often based on the visual inspection of an artificial measurement space; the experimenter places a number of points in a twodimensional space and demonstrates how his learning procedure responds, by drawing decision surfaces before and after learning. This paper discusses how artificial data may be used to form more objective criteria than this. As we shall see, there are many ways in which artificial data can be used with good effect. We shall concentrate upon the generation of artificial data for use in the study of learning procedures for pattern recognition. We shall generate data for use in a 2-class discrimination problem. Multiclass data can be produced by a straightforward extension of the techniques described below. We can generate data for use in other, related subjects by simply ignoring the fact that our artificial data generators produce a signal to indicate the correct classification of each pattern. In order to generate artificial data, we must use a number of good-quality random, or pseudorandom, number sources [1, 2]. In the appendix we list the desired properties of these random number sources. ALGOL - like notation is used in the flow charts. The following Glossary lists those variables having a global usage. Locally defined variables are not listed for the sake of brevity.
GLOSSARY
Name
Type (Real, R; Range Integer, I; (of Elements Vector, V) if a Vector)
No. of Elements (Vectors)
Range of Index (j)
Comments
V
Any finite real number
q
[1, NCT]
V, W
V
[0, 11
q
-
Random vectors, uniformly distributed. V = (VI . . . . . Vo). W = (w~ . . . . . Wq).
X
V
Any finite real number
q
-
Artificial pattern vector. X = (Xl . . . . . Xq).
I
-+1
-
[1, NCT]
/'th mode used in generating type C data .... % )
Indicates class membership of cluster No./.
ARTIFICIAL DATA FOR PATTERN RECOGNITION /~
R
30
-
n
R
NC1, NC2
I
31
P/
R
[0, II
PDF1, PDF2
R
/>0
Pdf of class 1 (class 2) calculated at a given point Y using known values for G],/'~, ~ , C/..
q
I
2, 3....
s x)
R
[--,,
T, T(X), T(X,n)
I
+1
u
R
[0, II
No. of elements of X. See Equation (1) Artificial teacher decisions Random, uniformly distributed in [0, 1]
[_oo,+ oo ]
__
-
+
]
[1, NCT] _
[1, NCT]
Spread of cluster No./. Random variate, having normal distribution. Mean = O, standard deviation = o. NC1 is No. of clusters in class 1 Similarly for NC2. NCT =NC1 + NC2. Probability that a given pattern is from cluster No./.
2. GENERATION OF TYPE A DATA Figure l(a) shows a scattergram for Type A data, together with the decision surface of the corresponding artificial teacher. Figures l(b) and l(c) show settle! of the probability density functions (pd0, taken along the line AB. Notice the sharp transitions from one class to another (points C, D, and E ) and that the distributions are otherwise perfectly flat. These are the two characteristics which identify Type A data. The method of generating Type A data is quite straightforward (Fig. 2): (a) The artificial pattern vector, X, is generated by equating it with a randon vector (V1 . . . . . V ) . The desired properties of the V/ are listed in the glossary. (b) We choose a classifier which is to simulate the teacher. (The type of classifier selected for this purpose is discussed in the following section.) The vector X is presented to this classifier, whose output defines the artificial teacher decision, T(X). Notice that T(X) is uniquely defined by X. It is worthwhile pointing out that Type A data is a special case of Type B.
4
B.G. BATCHELOR X2
A" •
x Class1 ~ x •
XL;Decision surface of ' ~ ' ~ ~ simulated teacher
x x •
• oo
•
•
~l
•
I
•
I • • I
•
i
1"
•
•
Ii
B X 1
pdf (Class1)
-]
pdf (Class 2)
,
I A
I
C
I I
'
il
E
I l
1
: B
Fig. 1. Type A data. (a) Artificial measurement space. Circles and crosses denote samples from the two classes. The broken square indicates the range of allowed points. (b) Pdf for class 1, along the line A B . (c) Pdf for class 2, along the line AB.
Random No. generators
P,
x2
i
x
I~ xq
ID T(X)
Artificial Teacher
Fig. 2. Method of generating Type A data. V 1 , . . . , Vq have uniform pdf's.
ARTIFICIAL DATA FOR PATTERN RECOGNITION
5
3. USES OF TYPE A DATA (a) Type A data is useful because it allows us to define the optimal decision surface in terms of any classifier that we may choose. Suppose that we are developing learning roles for the nearest neighbor classifier (NNC). We should like to know whether a given learning procedure is likely to achieve an optimal solution, using a minimum amount of storage. Using an NNC to simulate the teacher, in a Type A generator, we can define what the optimal solution is. Moreover, we know exactly how many locates are required to achieve this solution.* During learning we should "hide" parameters of the artificial teacher from the learning NNC. However, there is no reason why we should not compare the parameters of the two classifiers.t This method of defining an ideal solution has been useful in studies upon a number of classifiers [3-7]. (b) Suppose that we possess two classifiers, Cl and 6"2, each of which has an effective learning rule. We can use Type A data to test the hypothesis that C1 is more/less powerful than C2. To do this we use classifier C2 to simulate the teacher and C1 in the learning role. If C1 is more powerful than 6"2, we shall find that this learning experiment results in a low error rate. On the other hand, when the roles of C1 and C2 are reversed we should find a relatively large error rate. This technique was of great value to the author in helping him to verify that the nearest neighbor and potential function classifiers are similar in performance [5, 8, 9]. 4. GENERATION OF TYPE B DATA Figure 3(a) shows a scattergram of Type B data. This diagram also shows the optimal decision surface. It is not possible to draw the decision surface of the artificial teacher in this space, because the classes are not separable in Type B data. Figures 3(b) and 3(c) show sections of the pdf's along the line AB. Notice that the transitions between classes are not sharply defined, as they are in Type A data (compare Figs. 1 and 3). Notice that away from the edges of the classes, the pdf's for Type B data are nearly flat. Type B data is generated as follows (see Fig. 4): (a) The aritificial pattern vector, X, is generated in just the same way as for Type A data. (b) The artificial teacher is again simulated by a classifier. However a random * This number may be smaller than the number of locates in the artificial teacher, which may contain redundant locates [3]. ~fThis is not possible using an NNC, because the same decision surface can be produced by different configurations of locates [3].
6
B.G. BATCHELOR
term (n in Fig. 4) is added to the classifier just before its final threshold. The effect of n is sometimes to reverse the decision of the classifier. The probability that the decision is reversed by n is a function of: (i) o, the standard deviation of n. (ii) The "distance" of X from the optimal decision surface.* [This "distance" is measured by IS(X) l in Fig. 4.] Two special cases are of interest: (i) a = 0; we obtain type A data. (ii) o > > I S (X) I for all X; T(X, n) is independent of X. (The artificial teacher assigns points to their classes at random.) For values between these two extremes, o controls the extent of the class overlap. Increasing o increases the overlap.
X2 . Optimal decision .~,~ surface
• .
C~s.~
1
.' o •
• °
"~1 °
°
•
,%1 °I
"
.
°
I P X1
pdf (Class
1)
pdf (Class
2 )
I I I I
I I = I
'
'
C
D
I I I
E
I I I t
B
Fig. 3. Type B data. Comments for Fig. 1 apply here. Those points which would be incorrectly classified using the optimal decision surface are encircled. (a) Artificial measurement space. (b) Pdf for class 1, along the line A B . (c) Pdf for class 2, along the line AB.
ARTIFICIAL DATA FOR PATTERN RECOGNITION
[--~
7
Uniform "~X2
X_
.~XqJ
I
I
1
'
I --
r--q
Gausslan
---
"~ T ( x - , n )
-
--
_11
" ~
Normal Classifier would not include this input
Fig. 4. Method of generating Type B data. Without the input from n, the classifier would produce the optimal decision surface.
5. USES OF TYPE B DATA (a) Type B data is useful in studying the effects of varying the level of class overlap upon the stability of a learning procedure. An experiment, similar to that described in Section 3, paragraph (a), allows us to do this quite simply. We perform a number of such experiments, which are identical, except that we use different values of o, in our data generator. We then compare the classifiers, obtained by learning, with the optimal classifier (Fig. 5). (b) In this paragraph we describe a comparative test for two, or more, learning procedures. We begin by generating a small set of data, in which the points have been assigned to their respective classes at random [Section 4, special case (ii)]. This data is then used to train the classifier that we wish to compare. After learning is complete, we test these classifiers on the same data. This will yield two error rates for each classifier (one for each of the two pattern classes). These error rates form the basis for our choosing the "best" classifier. This procedure tests the ability of the classifers to cope with a random problem. Since small data sets are used, we cannot expect them to represent their parent populations adequately. In particular the distribution of points in the artificial measurement space will not appear to be uniform; we shall see clusters, containing just a few points. This clustering arises as a result of our using
8
B.G. BATCHELOR Accuracy
Procedure 2 I x
\
x X
x
X Procedure 1 x
IJ
Noise, 0
Fig. 5. Illustrating the use of Type B data for comparing the effects of class overlap upon the performance of two learning procedures. In this example, a linear classifier was trained to model an artificialteacher, which was also based on a linear classifier. Full details are given in [4]. a small sample. This experiment uses even more sparsely populated spaces than we should normally expect to find in a natural problem. 6. CLUSTERS The literature on statistical pattern recognition contains many references to the idea of clustering and numerous examples are quoted. A cluster may be pictured, in 3 dimensions, as being like a galaxy, or a swarm of bees frozen in flight. No particular shape is envisaged for a cluster; it might be spherically symmetrical,* elongated into an ellipsoid (possibly curved), or have an irregular shape. A cluster is a collection of points associated with a "hill" in the probability density function. A class may contain several clusters; the p d f of such a class has several hills. A cluster does not usually have a well-defined boundary. Furthermore, it is often difficult, even in 2 dimensions, to detect the difference between an elongated cluster and two, or more, overlapping clusters. It seems reasonable to assume that clustering is a frequently occurring phenomenon in natural data, in view of the numerous examples known. We feel justified, therefore, in taking clustering as a established fact, and designing artificial data generators to model natural, clustered data.
* We shall use the names of familiar 3-dimensional objects rather than their multidimensional counterparts.
ARTIFICIAL DATA FOR PATTERN RECOGNITION
9
We have already discussed some of the possible shapes for dusters. However, it is probably most convenient to generate artificial data which contains only spherical dusters. This is not unduly restrictive since we can always approximate a complex shape by compounding several spherical clusters together. Spherical clusters are simple enough for us to retain an intuitive feel for the situation we are contriving, even when we are working with multidimensional data. We cannot do this nearly so easily with more complex shapes. For purposes ofiUustration, we shall describe the generation of Type C data using Gaussian clusters. 7. GENERATION OF TYPE C DATA
X
~-~,.-
-- % Class 2
Contours of components of the p d f , which Is a mixture of gaussian distributions Class 2 \ % ~ ~ . , / - ~ X1
pdf (Class 1)
A
pdf (Class 2)
A Bayes'decision Class 2
c,_,
A
I ] I
I[
I I I I
II I I I I I I
III
B 'I I
B
B
Fig. 6. Summary of Type C data. (a) Artificial measurement space. (b) Pdf for class 1, along the line AB. (c) Pdf for class 2, along the line A B . (d) Bayes' optimal decision is obtained by comparing the pdf's. The Bayes' decision corresponds to the class giving the larger pdf.
10
B.G. BATCHELOR ENTRY
Generate
Generate random vector (~)
Vl&Wl
X I : - ,~(-2 log e VI ) x sin (J~ W i)
~BOX A
..........
i
. . . .
Temporary value of Xi
Generate u
~/for I:: lmp_t~ --,,,Vn...CT,o /
Random cluster selector Range of u is [0,1] and i A i r ~ PI; Ao~O j=l
)
NO
~ Y E S ~ I
T: : Ci
-X:= Hi -X+ G-i
it h cluster selected ~... T : ± I,Ci is class membership of c uster number . I Note:this is a vector operation
~EXIT
Fig. 7. Method ofgeneratingTypeC data. Method for generating an artificial pattern vector X and artificial teacher decision T. Figure 6 summarises the properties of Type C data, while Figure 7 presents the flow-chart of an algorithm for generating data of this type. This procedure yields: (a) NC1 dusters in class 1 (T = + 1). (b) NC2 clusters in class 2 (T = - 1). The j t h cluster has the following properties (/" = 1 . . . . . NC 1 + NC2):
NCT, where NCT =
ARTIFICIAL DATA FOR PATTERN RECOGNITION
11
G. determines its center. (c) (d) ~. determines its size. (e) P/determines the probability that a point from it will be selected. (f) C] indicates to which class its members belong. Box A in Fig. 7 generates a single Gaussian variate from two uniform variates, using the Box Muller method [1]. To generate clusters with non-Gaussian distributions, we simply change Box A. ENTRY
................
\
I"------~
r~nt, N ~ o /
for J:= i stel~ i until q do
I z:ol~-o~.jl-., I o..,..,,hn. .... to, 0, I ~:= "~
°'~'-~" I
I WV:=WV'R I
PDFI : = PDFI ÷ WV" Pi
PDF2 : mPDF2 +WV~PI
Fig. 8. Method for calculating the pdf's, PDF1 and PDF2, at a given point Y. Notice that the parameters Gi,H/,P/ (i = 1. . . . . NCT) are identical with those used in Fig. 8. The larger of these pdf's indicates the Bayes' optimal decision.
12
B.G. BATCHELOR
8. USES OF TYPE C DATA The artificial data generator defined in Fig. 7 produces data according to precisely defined rules. We can use our knowledge of these rules to calculate the pdf's, at any point in the space. Figure 8 presents the flow chart of a procedure for doing this. This ability to calculate the pdf's is very important, since it allows us to calculate the Bayes' optimal decision at any point in the space. Hence we can: (i) Estimate the Bayes' error probability, by applying the output of our Type C generator to the Bayes' classifier (Fig. 9). (ii) Draw the Bayes' optimal decision surface. (This is only possible in 2-dimensional spaces.) This provides us with a method of visually comparing a practical classifier with the best possible solution (Fig. 9).
X2
-" Xl
Fig. 9. Comparison of the Bayes' optimal decision surface (solid curve) with a decision surface obtained by growing a compound classifier [6] (broken curve). The artificial data generator used in this experiment differed from Fig. 7 in using the following transformation in Box A :
Xz.: =4"
(Vt.- 0.5) • (Wi- 0.5)
This produces points lying within a finite range (X/e[-1,+l], i = 1. . . . . q). Stippling indicates regions where there are no points (Bayes' decision is "don't know"). One of the classes is indicated by crosshatching. The square border indicates the limits of our plot of the decision surfaces.
ARTIFICIAL DATA FOR PATTERN RECOGNITION
13
We can generate data with known characteristics, for use in other areas of data analysis. Again, the ability to calculate the pdf's is of great importance. For example, in order to develop pdf estimation techniques, it is valuable to be able to test our suggestions on data which has a precisely known pdf. 9. OTHER DATA STRUCTURES We have discussed the generation of clustered artificial data in some depth. It seems appropriate to add a few comments here about some of the other types of structure found in natural data. We shall describe three types of natural data which cannot be modeled properly by Type C data without some modification. (A) POWER NORMALIZATION
In many problems we find that the pattern vector X has been normalized so that either q
(1) i=1
or q
z i=1
(2)
is constant. In such cases the point X is constrained to lie on a hypersphere or hyperplane. We may speculate that points will form clusters on these surfaces. However, we do not yet possess any strong evidence to support this. We may readily produce clustered artificial data of this type, by normalizing the output of a Type C generator. (BJ MISSING DESCRIPTORS
We often f'md natural data with some of the descriptors missing in some of the vectors. A Type C generator may be modified to simulate this kind of problem, simply by removing descriptors at random. (C,] NONSTA TIONAR Y DISTRIBUTIONS
The generation of Type C data with time-varying statistics is straightforward, provided that we know how the ,clusters should move. It is possible to arrange for:
14
B.G. BATCHELOR
(i) (ii) (iii) (iv)
addition or removal of clusters, changing the cluster-selection probabilities (Pz"in Fig. 7), changing the sizes of clusters (/-//in Fig. 7), movement of clusters (Gi in Fig. 7).
These changes may be either abrupt or continuous. We do not yet have any clear idea about the dynamics of cluster movement in natural data. 10. USE OF NATURAL DATA AS A TEST MEDIUM Any simulation is subject to the risk that it is based upon incomplete or even incorrect concepts. The use of artificial data in pattern recognition is liable to the same danger. To avoid this, we feel that it is important to apply our pattern recognition procedure to natural data, as a check that our conclusions about it are valid. The natural data used in this way may have no intrinsic value. It is important that any natural data used for testing be both cheap to collect and "adjustable," so that problems of varying complexity can be produced. We have found the discrimination between natural languages to be an excellent problem for testing learning procedures [4, 7]. The data on Irises [10] has also been widely used for this purpose. APPENDIX. CRITERIA FOR PSEUDORANDOM NUMBER GENERATORS If we are to produce good-quality artificial data then we must design our pseudorandom number generators carefully. If we are to calculate pdf's as we have described in Section 8, then we must be sure that these pdf's accurately reflect the properties of the data being generated. It is impossible to test the output of a Type C generator properly, so we must devise suitable tests for its "inputs." These are the variables v.
(i= t .....
q)
wi
(i-- t . . . . .
q)
and U Our purpose in this appendix is to provide guidelines for these tests, so that we can be reasonably confident that the Type C data we produce is representative of populations with the calculated pdf's. (a) The variables listed above should be uniformly distributed. The following
ARTIFICIAL DATA FOR PATTERN RECOGNITION
15
statistical tests are commonly used to investigate this property [ 1, 11, 12] and are perfectly adequate for our purposes: X2 , 6o2, and Kolmogorov-Smirnov tests (b) The variables listed above should all satisfy the usual tests for randomness [I, 11, 12],viz: poker, gap, lagged product, runs up]down and runs above/below the mean. (c) Let S denote the set of variables (Vl . . . . . V , W I , . . •, .Wq, U) and let f and g be any two distinct members o f S. In qddition we shall write per (f) to represent the period o f the pseudorandom number generator whose output isf. Then we should choose the random number generators so that HCF(per(.f), p e r ( g ) ) = I (f, g e S, f ~ g ) Then the joint period of these variables is ~
per(f)
feS We should make sure that this joint period is greater than 10 a , which is larger than the maximum size o f data set that we are ever likely to use in practice. (d) Members of S should be independent. While it is possible to devise sensitive statistical tests for vectors with 2 or even 3 elements, the author feels that the most effective method for detecting intervariate dependence is by visual inspection o f scattergrams. The concept of randomness is so il/-defined that the greatest possible use should be made of the human ability to detect structure. Chambers [2] demonstrates the power of this approach in his diagrams. REFERENCES 1. T. G. Newman and P. L. Odell, The Generation of Random Variates, Griffin, London, 1971. 2. R. P. Chambers, Random number generation on digital computers, IEEE Spectrum (Feb. 1967). 3. N. L. Ford, B. G. Batchelor, and B. R. Wilkins, Learning scheme for the nearest neighbour classifier, Information Sciences 2, 139 (1970). 4. B. G. Batchelor, Learning machines for pattern recognition, Ph.D. Thesis, University of Southampton, 1969. 5. B. G. Batchelor, N. L. Ford, and B. R. Wilkins, Learning in a potential function classifier, Electronics Letters 6, 826 (1970). 6. B. G. Batchelor, Growing and pruning a pattern classifier, Information Sciences 6, 97 (1973).
16
B.G. BATCHELOR
7. B. R. Wilkins and B. G. Batchelor, Evolution of a descriptor set for pattern recognition, LF.A.C. Symposium, Yerevan, ] 968, proceedings published by Instrument Society of America, 1970. 8. B. G. Batchelor, N. L. Ford, and B. R. Wilkins, Family of pattern classifiers, Electronics Letters 6,368 (1970). 9. B. G. Batchelor, Automatic procedures for varying the complexity of pattern classifier, in Machine Perception of Patterns and Pictures, Institute of Physics, Conference Series, No. 13,317. 1972. 10. R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7, 179 (1936). 11. T. H. Naylor, J. L. Balintfy, D. S. Burdick, and Kong Chu, Computer Simulation Techniques, Wiley, New York, 1968. 12. W. J. Conover, Practical Non-Parametric Statistics, Wiley, New York, 1971.
Received August 1974