NEUROCOMPUTING ELSEVIER
Neurocomputing 13 (19%) 347-357
Classification by balanced binary representation
1-
Yoram Baram * Department of Computer Science, Technion, Israel Institute of Technology, Haija 32000, Israel NASA Ames Research Center, Moffett Field, CA 94035, US4
Received 1I October 1994; accepted 10 April 1995
Abstract Classifiers for binary and for real-valued data, consisting of a single internal layer of spherical threshold cells are defined by two fundamental requirements: asymptotic linear separability of the
internal representations, which defines the cells’ activation threshold, and input-space covering, which defines the minimal number of cells required. Both parameters depend solely on the input dimension. Class assignments are. learnt by applying Rosenblatt’s learning rule to the internal representations which, by choice of the activation threshold, are balanced, having equally probable bit values. Balancing guarantees that the asymptotic separation capacity of the proposed classifiers is equal to the size of the internal layer. Generalization is achieved when the data points are clustered. The advantage of balancing is demonstrated by application to two ‘real-world’ problems, one involving binary dam and the other real-valued data. Keywords:
Classification; Pattern recognition; Balancing; Learning capacity
1. Introduction The classification problem is one of assigning a member u of a set (I to a class U-C U, or to its complement U+ C U. A device that performs this task, a classifier,
‘This work was supportedin part by tire Director’s Discretionary Fund, NASA Ames Research Center and in part by tbe Fund for the Promotion of Research at the Tecbnion. * Email:
[email protected] 09252312/%/$15.00 8 1996 Elsevier Science B.V. All rights reserved SSDI 0925-23 12(96)00047-X
348
Y. Baram/Neurocomputing
13 (1996) 347-357
should be able to adapt to any such dichotomy, (U’, CJ-) on U. Since u will form an input to such a device, we call II the input space. The problem has long been of interest in science and engineering. Rosenblatt [I] showed that a single linear threshold unit is capable of classifying linearly separable binary vectors. He proposed a simple learning algorithm that, under linear separability, converges to a solution in a finite number of steps. While the approach is particularly tailored to the separation of two classes, it readily extends to multi-class problems with some additional logic. The separation capacity of a single hyperplane in n-dimensional real space with respect to random, generally positioned vectors has been shown by Cover [2] to be 2n. This implies that the classification capacity of a single-cell perceptron is twice the input dimension, which is very low in any practical context. It was shown by Baum [3] that every set of N generally positioned points can be separated by some network, specified with respect to the data, consisting of N/n linear threshold cells. Baum’s construction, which is suitable for proving the existence of a solution, cannot be used, however, in actual classifier design, as it will, with high probability, produce a zero response to new input points. The actual construction of multi-cell classifiers appears to have been limited to data-dependent architectures (see e.g. [4]). Alternatively, the back-propagation method [5] employs a fixed architecture, but provides no clues as to the choice of one. In addition, the back propagation method often presents a heavy computational load and does not lend itself easily to capacity and performance analysis. In this paper we present an approach to the design of classifiers for binary and for real-valued data. The given vectors are transformed into (0, 1) vectors, employing an internal layer of threshold cells that represent randomly placed spheres in the input space. The sparsity of the internal representations is determined by the activation threshold of the internal cells. For a given sparsity, requiring that each point in the input-space be covered by the spheres with high probability yields a lower bound on the size of the internal layer, linear in the input dimension. Applying Rosenblatt’s learning method to the internal representations of the training inputs, the learning capacity becomes equal to the size of the internal layer, provided that the representations are linearly independent. Results on the singularity of random (0, 1) matrices [6,7] imply that the internal representations become linearly independent with high probability for a large input dimension, if their number does not exceed their dimension, provided that their elements are mutually independent and the probability p of each of these elements being 1 is l/2. Symmetry suggests that if the elements of the internal representations are mutually statistically independent, then the probability of linear independence between the internal representations is maximal when p = l/2. On more analytic grounds, we show that for p = l/2, and only for this value, the elements of the internal representations become mutually asymptotically independent as n becomes large, making the results of [6] and [7] relevant. This, in turn, implies that the classifier’s asymptotic separation capacity is equal to its size. When p = l/2, the internal representations will be said to be balanced. Balancing the internal representations by choice of the network’s parameters, which is advocated in this paper for the purpose of classification, stands in contrast to sparse internal coding, which has been proposed in the past ([8,9] for the purpose of associative memory.
Y. Lkwum/Neurocomputing
13 (1996) 347-357
349
2. Binary vector classification
We consider the binary input space U = {+_l}” first. A given input u E U will be transformed into the space (0, l}N by a layer of N internal cells performing the function xi =
( 1
if ( uCi), ~4)2 t
0
if ( uCi),14)< t
where ~(~1E { k 1)” is a binary input weights vector connecting the input to the ith cell, (v(‘), u) denotes the inner-product between the two vectors and I, the threshokf, is a positive scalar. The resulting vector x is the internal representation of the input vector u. The input weights of the cells define the centers of discretizing spheres of Hamming radius r = 0.5(n - t) in { f l)“, so that the cell’s output is 1 when the input falls within the sphere and 0 otherwise. These spheres divide the input space into mutually exclusive domains, called grains, consisting of intersections of groups of spheres. The tmnsformation (1) assigns different vectors on (0, l} N to different grains. A given set of input vectors uCi), i = 1, . . . , M, together with their class assignments, induces a dichotomy (X-, X’) on the corresponding set of internal representations, xCi), i = 1,. . . , m, with m s M, in an obvious manner. It should be noted, however, that the induction is not unique in the sense that two different sets of input points, producing the same set of representations, may induce different dichotomies. The reason is that there may be grains containing points from both U- and Vf. Let the class assignment of u, produced by the classifier, be based on the value of the output function y=
l l
-1
if (w, x) 2 0 if(w,
x)
(2)
where wj, the jth element of w, is the weight of the connection between the ith internal cell and the output. If, by convention, yCi)= 1 implies that uCi)belongs to U+, then u will be assigned to U+ by the classifier, if y = 1. Given a training set A of pairs (u(‘), y(“), i = 1,. . . , M, where a(” is an input and yCi) is the correct output for u(i), classification is learnt by applying Rosenblatt’s perceptron learning rule to the corresponding pairs of internal representations and outputs: Start with wj = 0, j = 1,. . . , N. Pick a pair (u(‘), y”)) from A and present uCi) as an input. If the output is yCi),pick another pair and repeat the operation. If the output is not yci), change the weights as follows: w~=w~+x~~‘~‘, The perceptron verge to a final training vectors and that such a
j= l,...,N
(3) convergence theorem [lo] implies that this learning process will convalue of the weights vector w, so that the internal representations of the are classified correctly, provided that their assignment is not ambiguous weights vector exists.
2.1. Input-space covering and separation capacity
Since the underlying assumption is that the input vectors may take, both in training and in operation, any values in the input space, and that the class assignment of the same
350
Y. Bamm/Neurocomputing
13 (1996) 347-357
input may be different in different applications, the least we should expect from the classifier is that every point in the input space have a nonzero internal representation. We assume in this paper that the centers of the discretizing spheres are chosen randomly. (Placement may also be based, in principle, on deterministic code constructions [l 11.However, it can be shown that such placement is less efficient than a random one for a large input dimension). The first design objective is, then, to find a value for the network size, N, so that the input space is covered by the spheres with high probability. The probability that a random point in the input space is not covered by the discretizing spheres is
P,= [1 -PIN
(4)
where p is the probability that a random point falls within a sphere of radius r. when both the point and the sphere center are randomly placed in { + l}“. It follows immediately that P, I e-‘”
(5)
when Et2
N2 ln[l/(l
-P)I
for any E > 0. The number of spheres required, so that a random point is covered with probability 1 - eeE”, can be obtained, then, from (6). It can be seen that coverage is achievable with a high degree of confidence by a reasonable number of cells. For instance, if n = 20, p = l/2 and E = 1, then N L 30 guarantees that, on the average, one in 485 million input points will not be covered. Convergence of the learning process to correct values of the weights is guaranteed by the existence of a separating hyperplane in the space of internal representations. It is known (see [ 121, p. 70) that a separating hyperplane exists with high probability if and only if the number of training vectors, M, satisfies M 5 N, provided that these vectors are in general position. To find a bound on the learning capacity of the classifier, we look for the largest value of M for which general position of the training vectors can be guaranteed with high probability. Restricting, then, the training set size to be no greater than N, a necessary and sufficient condition for the training internal representations to be in general position is that they are mutually linearly independent. It was shown by Komlos [6] that the probability that a random (0, 1) matrix of dimension N X N has a full rank is of order l/ fi. A more recent paper by Kahn et al. [7] shows that this probability is smaller than (1 - E)~ for some 0 < E < 1. The matrix elements are assumed to be statistically independent, and it is also assumed that p = l/2. Symmetry considerations suggest that p = l/2 minimizes the probability of singularity, although this is not formally proven (note that a random (0,l) matrix is singular with high probability for p values near 0 or near 1, and absolutely singular for B = 0 and for p= 1). The elements of a given internal representation in the proposed classifier are not mutually statistically independent, as required by [6] and [7]. However, they become
Y. Baram / Neurocomputing
13 (19%)
347-357
351
independent asymptotically, as the input dimension, n, becomes large. This may be seen employing the spherical analogy of the space 1 f l}“, [13,8], which implies that, for large n, the bulk of this space is contained within a thin spherical shell of radii n/2 f E, centered at some point, say, u(‘). The larger n is, the smaller is 6. The main implications of this geometry on our considerations are that the points f u(j), j Z i are all placed at approximately the same distance, n/2, from uCi)and that any two of these points are approximately at distance n/2 from each other. Obviously, only information concerning the distance between u and vCi) can have any bearing on whether the ith cell is activated. It will be shown shortly that p = l/2 implies that r --) n/2 as n + a. The activation of a certain subset of the other cells means that u falls within the intersection of the corresponding subset of the spheres n
(7)
where
f%(r)=
i(y) i= 1
is the volume of a sphere of radius r in (0, l}“, the activation radius, r, should be numerically calculable. However, this is not very simple for R greater than, say, 20. Further noting that ([l 11, p. 3 10) 24(p) < S”(r) S 2nhz(p)
(9)
\Isnp(l we
have
$x,(Bnp(l - P))
1 It is not difficult to see that (1/2n) log,(8np(l 10. For large n, the intermediate value
r=nh,’
l.(
&log,(
l/P)
1
(10)
- p)> > 1 for n values greater than, say,
(11)
352
Y. Baram /Neurocomputing
13 (1996) 347-357
seems to be a reasonable estimate of r. For the balancing value p = l/2, the minimal number of cells is N = .sn/ln 2 and their activation radius is nh; ‘[l - 1/(2n)J. It can also be seen that r + n/2 as n + m. We have seen that in order to satisfy the covering requirement, the network needs only be linear in the input dimension. This is quite different from the case of associative memory [S,9], where effective operation requires a network size exponential in the input dimension. Indeed, the operation of associative memories is conditioned on the sparsity of the internal representations, while that of the proposed classifier requires, for large n, non-sparsity, represented by p = l/2. The input-space covering requirement imposes a lower bound on the size of the internal layer. Increasing the latter will increase the separation capacity. Implementation constraints will impose an upper bound. If a given training set is fully separated, further training may be attempted to improve performance on new, unleamt data. Extension of the classifier’s performance on the training set to new data has been termed genera&unon. Cover [2] defined generalization as convergence to zero of the probability that a new data point is classified ambiguously by two different hyperplanes that separate the training set, as the size of the latter grows to infinity. Vapnik and Chervonenkis [ 141 defined generalization as convergence of the classifier’s rate of success to the probability of success, uniformly with respect to the weights. Both definitions assume that, as the training set is increased, the classifier keeps separating it. In the case of linearly separable data, which is the case assumed for the single-cell perceptron, this is a reasonable assumption. However, in our case it cannot be assumed a priori that the internal representations will be linearly separable beyond the classifier’s capacity. Generalization can be guaranteed in our case, only if the internal representations corresponding to one of the classes are confined to a certain region in their space, that can be separated from the rest of the space by a hyperplane. Then additional points from that class can be expected to fall in the same region as the training points and be classified correctly. Since close points in the internal representation space correspond to close ones in the input space, this implies that generalization can be guaranteed if it can be guaranteed that one of the two classes forms a single convex cluster in the input space. Example 1. It has been shown that obstacle detection, which is a basic task in such applications as robot navigation, can be performed using optical flow information [ 15,161. The time to collision, T, is related to the optical flow by au, 2 -=-++ ax 7
au, ay
’
where v = [v, vy]’ is the local velocity of an edge in the image plane. The latter is related to the local brightness, or gray level, I, by the brightness constancy equation ar ar -vv,+ayuy=-~, ax
az
which does not have a unique solution (a problem known as the “aperture problem”) [ 171. It was shown in a recent paper [ 181, that effective classification of objects as being
Y. Baram / Neurocomputing
13 (1996) 347-357
353
“safe” or “dangerous” can be performed by a sparsely encoded classifier, using the signs of the first-order spatial and temporal derivatives of the local brightness alone. These three binary measurements, taken at 13 local grid points, constitute a 39-dimensional binary input vector. The activation radius of the internal cells was set in the reported experiments to r = 0.3, corresponding, approximately, to p = 0.35, and there were 5600 internal cells. In the worst case considered, combining longitudinal and lateral motions, there were 3000 training pairs and 909 testing pairs. The testing rate of success for this case was 91 percent and the training process converged in 32 epochs. In view of the analytical observations of the present paper, we modified p to the balancing value 0.5, corresponding, approximately, to r = 0.45, and reduced the number of internal cells to 3900. The rate of success in the worst case rose to 94 percent, and the training process converged in 18 epochs. Applying a single-cell perceptron to this problem, the training process did not converge in 100 epochs and the testing rate of success was 62 percent, which shows that the classes are highly linearly inseparable. ‘Ihe performance of the proposed classifier on this ‘real world’ problem, is particularly impressive in view of the fact that the high-dimensional data, corresponding to local edge velocities pointing in many different directions, appear to be highly disordered. Yet, a close examination will reveal that, since the velocities normal to the edges all point away from the focus of expansion, the data points corresponding to the two classes are clustered, which explains the high generalization capability.
3. Real-valued data classification We now turn to the classification of points in a bounded domain in the n-dimensional Euclidean space. These points are assumed to be reduced into the unit cube U = [0, 11” (this can be done by placing the origin so that the corresponding vector elements are all positive, and dividing each of them by the maximal length of any of these vectors). An input vector u E U is transformed into a binary one by a layer of cells, each performing the spherical threshold function
xi =
1 1 0
if 11u - uCi)11< r if 11u - uCi)I( > r
(12)
where uCi),i = 1, . . . , N are n-dimensional input weights vectors, whose components are randomly and independently selected from the interval [O, 11and (I . II denotes the vector norm, As in the binary case, classification is based on the value of the function Y=
1
if (w, x) 20
-1
if(w,
(
(13)
x)
where w is the vector of connection weights, updated in the training stage according to Rosenblatt’s rule w
w= i
W
+
x(Oy(O
if y # yCi) if
y =
y(i)
Y.Baram/Neurocomputing 13 (2996) 341-357
354
where A?) and yci) are the internal representation and the correct class assignment
corresponding to a training input ~(~1. The internal cells, performing the function (12), represent spheres, whose centers are randomly placed in the unit cube at the points defined by the weights vectors u(i) of the corresponding cells and whose radius is the activation threshold of the cells. It should be noted that while in binary space the spherical function is identical to a linear threshold function, it is not in real space. As in the binary case, deterministic placement of these spheres [ 191 is not as efficient as random placement for large n. A bound on the minimal cardinality of a randomly placed covering code, identical to the one obtained for the binary case, follows from the same arguments. For a given p (which determines r) we have, then, P, 22e-cn provided that ER
Nrln[l/(l
-141
Since the exact relationship between r and p is difficult to derive (numerical values can be calculated by Monte Carlo simulation), we seek bounds on the value of r in terms of p. The volume of a sphere of radius r in R” is known to be (1191p. 9)
“n(r) =
I
r”7r”/2/( n/2)! (n even) (2r)“T(n- I)/2 ((n1)/2)!/n!
(n odd)
(15)
Clearly, (16)
P= a4
where v,J r) is the expected volume of the intersection of the unit cube with a sphere of radius T, whose center is randomly placed within the unit cube. Let us denote by r(p) the sphere radius that satisfies (16) and by rP the sphere radius such that Wp)
(17)
=P
Then, clearly rp 5 r(p)
22 d2
08)
(note that n’12 is the length of the main diagonal of the cube and f 2 n’12 implies p = 1). Employing (15), we obtain (assuming even n and noting that the result for odd n will be practically the same) n n/2 ---_= (19) T,2)! p yielding, by the Stirling approximation (n! = r~“e-~fi). ‘P = (24
- ‘17 nTp2)“(2”)n’/’
120)
Y. Baram/Neurocomputing 13 (1996) 347-357
355
Hence, approximately for large n, rP = (27re)-1’2n’/2
= 0.242n’j2
(21)
The lower bound on r(p) becomes, then, for large n, independent of p. We found in simulation that, for large n, performance is rather invariant to the value of p, as long as it is not very close to the extreme values 0 and 1. An empirical value for r which causes, on the average, the activation of about half the internal cells, can be found for a given data set or by Monte Carlo simulation in a ‘trial and error’ fashion. In deriving the asymptotic separation (or learning) capacity of the classifier, the arguments presented in the binary case apply here too, as the internal representations are, again, (0, 1) vectors which are, for p = 0.5, linearly independent with high probability for large n. This guarantees, with high probability for large n, the existence of a separating hyperplane in the space of internal representations and, consequently, convergence of the learning algorithm, provided that M < N. We emphasize once again that generalization is a useful property only when the training set is separable. The advantage of the proposed method over other, more parsimonious, methods is that it starts with a high separation capability. Generalization will then depend on the data structure. If the classes are highly mixed, the proposed method will separate the training set but it will generalize poorly. On the other hand, a very parsimonious method will likely fail to separate the training set, but it may nicely generalize this poor performance to new data. Both approaches are likely to produce good results when the classes are convexly clustered. Example 2. Diamond evaluation can be formulated as a classification problem in which a diamond is to be assigned one of a number of color grades [21]. The intensities of the red, green and blue components of the reflection light and the same components of the illumination light constitute a six-dimensional real-valued input vector. In [21], a set of 225 diamonds, pre-classified by a certain widely recognized institute, were randomly divided into several pairs of training and testing sets. Several network architectures and training methods were tried. A single-cell perceptron assigned to each of the grades performed poorly, indicating that the data is not linearly separable. Several variants of the back-propagation method, including momentum and varying learning rates [20], were then applied to several network architecture, ranging from a single internal layer of 10 to 500 cells, to two and three internal layers of 10 to 40 cells each. In all cases, the algorithm rapidly reached a local minimum failing to produce meaningful results. Clustering the data so as to reduce measurement fluctuation effects, and applying a single-cell perceptron per grade to each cluster resulting in a testing success rate of 81 percent. Applying, instead, a multi-cell perceptron, consisting of a single internal layer of 2400 spherical cells, produced the same success rate of 81 percent with and without prior clustering, indicating a certain inherent clustering capability. The number of internal cells and their threshold value used in [21] were chosen so as to satisfy a minimal covering requirement on the input space. Using instead the balancing value for the threshold r. found empirically so as to produce p = 0.5, increased the success rate to 82 percent without prior clustering of the data, and to 86 percent with prior clustering. The difference is explained by the fact that while in the former case the s&networks
356
Y. Baram/Neurocomputing
13 (1996) 347-357
assigned to the different clusters by the internal layer are weakly joint, the latter method assigns a separate network to each predetermined cluster.
4. Conclusion A method for classifier design for binary and real-valued data has been presented. Rosenblatt’s learning method is applied to (0, 1) internal representations, obtained by a layer of threshold elements. A linear lower bound on the network size was derived from input-space covering considerations. For balanced internal representation and for large input dimension, the separation capacity was shown to be, with high probability, equal to the size of the internal layer.
References [II F. Rosenblatt, The Perceptron: A probabilistic model for information storage and organization in the brain, fsychol. Reu. 65 (1958) 386-408. [2] T.M. Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Trans. Electronic Comput. (June 1965) 326-334. [3] E.B. Baum, On the capabilities of multilayer perceptrons, J. Complexity 4 (1988) 193-215. [4] M. Mezard and J. Nadal, Learning in feedforward layered networks: the tiling algorithm, J. Phys. A 22 12 (1989) 2191-2203. [5] P.J. Werbos, Beyond regression: New tools for prediction and analysis in the behavioral sciences, thesis in Applied Mathematics, Harvard University, 1974. [6] J. Komlos, Circulated manuscript, see B. Bollobas, Random Graphs (Academic Press, New York, 1985) 341-352. [7] J. Kahn, J. Komlos and E. Szemeredi, On the probability that a random matrix is singular, to be published. [8] P. Kanerva, Sparse Distributed Memory (MIT Press, 1988). [9] Y. Baram, Corrective memory by a symmetric sparsely encoded network, IEEE Trans. information Theory 40 (2) (Mar. 1994) 429-438. [lo] M.L. Minsky and S.A. Papert, Perceptrons (MIT Press, 1988). [I I] F.J. MacWilliams and N.J.A. Sloan, The Theory ofError Correcting Codes (North-Holland, 1988). [ 121 R.O. Duda and P.E. Hart, Parrern Classijcation and Scene Anafysis (Wiley, New York, 1973). [ 131 R.W. Hamming, Co&g and Information Theory (Prentice-Hall, 1986). [14] V.N. Vapnik and A.Ya. Cbervonenkis. On the uniform convergence of relative frequencies of events to their probabilities, Theory of Probability and its Applicutions 2 (1971) 264-280 (first published in Russian, May l%9). [ 151 R.C. Nelson and J. Aloimonos, Obstacle avoidance using flow field divergence, IEEE Trans. on Pattern Analysis and Machine Intelligence 1 l(l 0) ( 1989) 1102- 1106. [ 161 D.L. Ringach and Y. Baram, A diffusion mechanism for obstacle detection from size-change information, IEEE Trans. on Pattern Analysis and Machine Intelligence 16 ( 1) ( 1994). [17] B.K.P. Horn and B.G. Schunck, Determining optical flow, Artificial Intelligence 17(3) (Aug. 1981)
185-203. [18] Y. Baram and Y. Bamiv, Obstacle detection by learning binary expansion patterns, to appear, IEEE Trans. on Aerospace and Electronic Systems (Jan. 1996). [19] J.H. Conway and N.J.A. Sloan, Sphere Packings, Lattices and Groups (Springer-Verlag. 1988). [20] D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning internal representations by error propagation, in Parullel Distributed Precessing, D.E. Rumelhart and J.L. McLelland, eds. (MIT Press, 1986).
Y. Baram/Neurocomputing
13 (19%) 347-357
357
Baram and S. Wasserkrug, Clustering and classification by neural networks applied to diamond evaluation, CIS Report No. 9410. Department of Computer Science, Technion, Israel Institute of Technology, June 1994.
[21] Y.
Yoram Baram received the B.Sc. degree in aeronautical engineering from the Technion-Israel Institute of Technology, Haifa, the M.Sc. degree in aeronautics and astronautics, and the Ph.D. degree in electrical engineering and computer science, both from the Massachusetts Institute of Technology, Cambridge, in 1972, 1974, and 1976. resnectivelv. ~-ln’l97LJ975 he was with the Charles Stark Draper Laboratory, Cambridge, MA. In 1977-1978 he was with the Analytic Sciences Cornoration, Reading. MA. In 1978-1983 he was a faculty member at the Department of &troni~‘Systems, School of Engineering, Tel-Aviv University, and a consultant to the Israel Aircraft Industry. Since 1983 he has been with the Technion-Israel Institute of Technoloev. where he is an Associate Professor of Cornouter Science. In 1986-1988 he wG_a Senior Research Associate of the Nationalkesearch Council at the NASA-Ames Research Center, Moffett Field, CA, where he has also spent the following summers. ests are in pattern recognition and neural networks.