JOURNAL
OF
MATHEMATICAL
The Geometry
ANALYSIS
AND
APPLICATIONS
of Convergence
7, 475-481
(1963)
of Simple Perceptrons*t
A. CHAFWES The Technological Institute,
Northwestern
Submitted
by Richard
University,
Evanston, Illinois
Bellman
INTRODUCTION
The concept of the perceptron, due to F. Rosenblatt [l], and various of its attributes, developments and literature are reviewed in a particularly cogent manner in the paper [2] by his co-worker, H. D. Block. A still more recent coverage of literature is to be found in the 1962 Proceedings of the ONR Symposium on Self-Organizing Systems. Although the fundamental theorems on the convergence of learning procedures for a simple perceptron are developed in [2] in a highly ingenious manner (a distillation of a succession of proofs by Rosenblatt, Joseph, Kesten and Block), it is not clear from their “hard analysis” what lies behind this convergence, in any intuitive manner, and thereby what general classes of procedures may be expected to converge or diverge, or when indeed any “learning” can make the differentiations required. For instance, it should be expressly noted that the perceptron convergence theorems are false if more general cases of a system of linear inequalities are considered. The iterative procedures involved are particular types of relaxation or cyclic projection methods which one knows (cf. [3]) converge then, at best, in an infinite number of steps and are also subject to well-known difficulties of numerical entrapment, etc. Further, since more elaborate perceptrons (see [4]) involve more elaborate (and less complete) hard analytic arguments, what kinds of extensions are likely to work and how these may be characterized is not at all evident, or intuitive, mathematically. The results which follow were developed shortly after a lecture visit to * Based on a presentation at the COINS Symposium on Learning, Adaptation, and Control in Information Systems, June 17, 1963. The Technological Institute, Northwestern University, Evanston, Illmois. t Research underlying this paper was partly undertaken for the project Temporal Planning and Management Decision under Risk and Uncertainty at Northwestern University, under contract with the U.S. Office of Naval Research, contract Nonr1228(10), project NR 047-021. Reproduction of this paper in whole or in part is permitted for any purpose of the Umted States Government.
475
476
CHARNES
Northwestern University by H. D. Block in September 1962. I present the problems in a completely geometric setting and develop an intuitive (but vigorous) proof of convergence (restricting myself here to the case of a simple perceptron and the error correction procedure) Lvhose essential argument requires only two-dimensional visualization, although it ecompasses the general case of variable correction step magnitudes. ‘41~0 the question of existence of a solution, i.e., of in principle possibility of learning, hardly touched on in [2], but elsewhere designated as the “linear separability” problem (see [5]-this work became available to me only months after my results, which overlap little if at all, were accomplished), is placed in a geometric linear programming framework developed earlier by the writer and others [3, 6, 71. En passant, a computational solution is suggested.
GEOMETRIC
FRAMEWORK
Employing a matrix paraphrase of Block’s notation in [2, pp. 128-1301, the possibility or linear separability question is, does there or does there not exist a solution to yTB > Be=. (1) In case the set Y of points y satisfying (1) is not vacuous, it forms a very special type of open convex body, e.g., a truncated polyhedral cone, of which an arbitrary two-dimensional section which contains the line through the origin and a point d of Y has the typical form shown. To see this note that the point d is at a nonzero distance from each of the finite number of bounding hyperplanes to the open half-spaces designated by the system of inequalities (1). There is thus a least distance from d to the boundary of Y and therefore any sphere of radius p less than this minimum is completely contained within I’. Thus Y is a convex body, since trivially it is also convex. It is polyhedral since its boundary is composed of portions of hyperplanes. It is truncated because the origin is cut off from Y because 8 is strictly greater than zero. Clearly, then, any two-dimensional slice through the origin in d must be of the form indicated in Fig. 1. For later use we note also that if S,, designates the solid sphere (really hypersphere) of center d and radius p, as before, then the sphere S,, of radius top about the point pd is contained in Y for all p 3 1. Thus the minimum distance from the point pd to the boundary of Y tends to infinity with p. As we shall see later, this is a special characteristic of the set Y which will make any procedure involving successive steps, the size of whose sum becomes large enough and which remains within a fixed finite cylinder about the line through d and the origin, to become a finitely convergent
THE GEOMETRY
OF COhTERGENCE
OF SIMPLE
PERCEPTRONS
477
process. Graphically, the situation is as in Fig. 2. Since the cylinder becomes swallowed up in Y a finite distance out from the origin, so, a fortiori, must any point which describes a path lying within the cylinder and traveling out toward infinity in the direction Od. This is the basis for my geometric proof of convergence which will be taken up in detail after first providing a geometric characterization and computational note for the possibility or linear separability question.
FIG. 1
A---FIG. 2
It may be noted additionally that my proof applies a fortiori to any system of decision boundaries in which a polyhedral set like Y may be inserted and for which the correction steps based on these boundaries projects into valid steps for Y. In particular, the proof holds for decision boundaries which contain convex bodies “radially” increasing to infinity.
478
CHARNES
“IN
PRINCIPLE”
LEARSING
To make contact with the mainstream of linear programming theory we must change the question of existence of solutions to (1) which involves an “open” system of linear inequalities to a question of existence for a “closed” system of inequalities. Evidently the system (1) has a solution if and only if the system yTB > 0’ (1-l) has a solution where 0 < 8’ < 0. Now then we have, see theorem (2), p. 211 of the Charnes-Cooper graph [7]:
mono-
THEOREM. There exists no y satisfying (1.1) if and only if there exists x 3 0, e’eTx > 0, such that Bx = 0. An obvious equivalent form is the following: There exists y such that yTB 3 0’ > 0 if and only if for all THEOREM. x > 0 with eTx = 1, the condition Bs # 0 must hold. Note: The condition eTs = 1 means merely that N # 0, since x 3 0 and the theorem statements involving s and B are positive homogeneous in x. As an immediate corollary we have the preliminary lemma (on p, 129 of [2]) in Block’s convergence proof that 0 < inf 1Bs I/] x’ /,
.t > 0,
x # 0
69
since
and the minimum of the left number is taken on for some X. Here the absolute value signs mean Euclidean norm (= Euclidean length) when a vector is contained within them, and me minimize by linear programming. Incidentally, this argument does not require topological considerations as in [2] since the linear programming theory utilized holds for finite dimensional vector spaces over arbitrary ordered fields. See [7]. An immediate consequence of the above is the possibility of computing whether or not a y exists satisfying (1) since the linear programming problem min eT(z+ + a-) .z+ - z- - Bx = 0 e’x = 1 z+, z-, x > 0 will have minimum
(3)
value greater than zero if and only if such a y exists.
THE GEOMETRY OF CONVERGENCEOF SIMPLE PERCEPTRONS
479
Another geometric characterization of the possibility theorem can be obtained by reference to the opposite sign theorem of Charnes and Cooper [6]. This proposition is so fundamental that it can be employed (see [7]) to develop all of the major theorems of the theory of linear inequalities and convex polyhedra without reference to topology or to constructs, such as the separating hyperplane. The opposite sign theorem may be stated as follows. The set A = {A : x2=1 P,h, = P,, , A, > O> is OPPOSITE SIGN THEOREM. spanned by its extreme points (hence bounded) if and only if whenever a = (011) . ..) 01~)# 0 and c, P,oL; = 0 somec+ and a, must be of opposite sign.
Hence it follows that There exists y satisfying (1) if and only if the set X of all .Isuch that Bx = b and s > 0 is a bounded polyhedron for b f 0. THEOREM.
This equivalence yields another linear programming computational test. The linear programming problem max eTGV BN = (l/n) Be .v 3 0
(4)
has a finite maximum if and only if there exists y satisfying (1). It should be noted that in computing this problem by the simplex method or its variants, the presence of an infinite maximum may be automatically signaled in the course of the computation by a specific property of the vector seeking to come in, or, if preliminary “regularization” (see [7]) is employed, by obtaining an optimum solution involving an artificial variable.
CONVERGENCE PROOF
Returning now to the geometric framework, the error correction procedure (see Block [2], pp. 1288129) consists at each stage in making a step in the direction of the inward normal to the bounding (decision) hyperplane corresponding to the stimulus for which the incorrect response has just been elicited. Geometrically, recall that the pointy (see Fig. 2, for example) will give an incorrect response to a stimulus whenever it is on the wrong side of the corresponding decision hyperplane. Now then, the inward normals to all of the decision hyperplanes make angles of less than ninety degrees with the line from the origin through the (previously mentioned) interior point d. To verify this requires, evidently, only a two-dimensional slice passing through the line o’i and 6 where G has the same direction as the normal to the decision hyperplane.
480
CHARNES
Thus each error correction step moves the point y in the direction 6”d by an amount proportional to the size of the step. Therefore any succession of error correction steps, the sum of whose sizes tends to infinity, will have a projection in the direction of o”d tending to infinity at the same rate. If d, is the size of the nth step, we shall require that c, A, -+ ~0. If we require further that A,/&~1 A, tends to zero with II (as will be seen, it would be sufficient merely to require that A,/c~=, A, become and remain less than some small fraction of p) then after a certain number of steps, the size of the succeeding steps will be small compared to the distance of the point to the line 3. This is of course on the assumption that the point y has not yet moved into Y, for if it has we have already achieved the objective in a finite number of steps. At the current juncture suppose an incorrect response has just been elicited and a step is to be made in the direction of the inward normal to a decision hyperplane. Let y + A be the position after making this step. This situation is depicted in Fig. 3 which is a three-dimensional slice containing the plane through y, y + A, and pd, the point nearest toy on the> a (in scalar product notation p = (y, d)/(d, d)), and also containing Od.
FIG.
3
Since the direction of the line between y and y + A is perpendicular to the trace of the hyperplane, which must also cut in between y and t pd, and since the step size is small compared to the length of (y) (A, the sid; (&)(y + d) of the triangle y, y + A, pd is smaller than the side (y) (4). A fortiori, y + A must be closer to 3 than y. Thus the subsequent points attained by correction remain within a hypercylinder of fixed radius about the line a, and therefore must enter Y in a finite number of steps. This concludes the proof.
THE GEOMETRY OF CONVERGENCE OF SIMPLE PERCEPTRONS ADDITIONAL
481
REMARKS
1. It should be noted that specific numerical bounds for the maximum number of steps necessary can be obtained in terms of the m&imum cosine of the angle between a decision hyperplane and the line Od (this measures the minimum rate of progress along the direction of a), and the minimum distance between d and the decision hyperplanes whic& forms part of the boundary of Y extending to infinity in the direction of Od. These constructs can be combined into a geometric interpretation of the scalar products involved in the analytic convergence proofs of [2] and [5]. 2. It should be repeated that extensions to a wide class of nonpolyhedral decision boundaries immediately reduce to the preceding arguments by use of polyhedral inscription or polyhedral approximation in obvious ways. 3. Reduction of convergence proofs for other learning methods to the above constructs will be indicated elsewhere.
REFERENCE-S
1. ROSENBLATT, F. “Principles of Neurodynamics.” Spartan Press, Washington, D.C., 1962. 2. BU)CK, H. D. The Perceptron: A model for brain functioning. I. Rev. Mod. Phys. 34, 123-135 (1962). 3. CHARNILS, A., COOPER, W. W., AND HENDERSON, A. “An Introduction to Linear Programming.” Wiley, New York, 1953. 4. BU)CK, H. D., KNIGHT, B. W. JR., AND ROSENBLATT, F. Analysis of a Four-layer series-coupled Perceptron. II. Rew. Mod. Phys. 34, 135-142 (1962). 5. SINGLETON, R. C. A test for linear separability as applied to self-organizing machines, in “Self-Organizing Systems-1962,” M. C. Yovits, G. T. Jacobi, and G. D. Goldstein, eds. Spartan Press, Washington, D.C., 1962. 6. CHARNES, A., AND COOPER, W. W. The strong Minkowski-Farkas-Weyl theorem for vector spaces over ordered fields. Proc. Natl. Acad. Sci. U.S. 44 (1958). 7. CHARNES, A., AND COOPER, W. W. “Management Models and Industrial Applications of Linear Programming,” ~01s. I and II. Wiley, New York, 1961.