Continuous functions and neural network semantics

Continuous functions and neural network semantics

Pergamon Nonlinear Analysis, Theory, Methods & Applications, Vol. 30, No. 3, pp. 1335-1341, 1997 Proc. 2nd Worm Congress of Nonlinear Analysts PII: ...

488KB Sizes 0 Downloads 85 Views

Pergamon

Nonlinear Analysis, Theory, Methods & Applications, Vol. 30, No. 3, pp. 1335-1341, 1997 Proc. 2nd Worm Congress of Nonlinear Analysts

PII: S0362-546X(96)00145-9

© 1997ElsevierScienceLtd Printed in Great Britain.All rights reserved 0362-546X/97 $17.00 + 0.00

CONTINUOUS FUNCTIONS AND NEURAL NETWORK SEMANTICS MICHAEL J. HEALYt t Research and Technology, Information & Support Services, The Boeing Company, P O Box 3707 MS 7L-66, Seattle, Washington 98124-2207 [email protected] K e y words and phrases: Formal semantics, logic, model theory, neural networks, rule-learning, topological spaces, topological systems, continuous functions

1. I N T R O D U C T I O N

This paper introduces a model-theoretic approach to the formal semantics of neural networks. We examine the relationships between objects in a network's environment that satisfy rules expressible in formal logic. The network is to learn the rules from input data examples. For this to happen, the rules must be implicit in the input data patterns representing the examples, and the network must appropriately process these through activation of its nodes and long-term adaption in its synaptic connections. Establishing that these conditions exist, and formalizing the learned rules, calls for a mathematical model of the learning situation. Our model represents the inputs as objects called points that exist in abstract domains. The domains include closed systems of logical formulas that are satisfied by subsets of points. The formulas, together with a model of the theory, form a topological system, a generalization of the concept of a topological space. We apply this formal semantic model to the semantic analysis of rule-learning, and show that successfully-learned rules implement a continuous function. The lateral priming ART (LAPART) neural network [1] provides a convenient example of rule-learning neural networks. LAPART was originally developed for the study of the learning of logical implications from data. It is based upon the adaptive resonance theory version 1 (ART 1) neural network [2], which performs binary pattern classification. It is not necessary to understand ART 1 in detail for this discussion; detailed descriptions are contained in [1], [2], [3] and [4]. Briefly, an ART 1 network A consists of three layers of nodes--an input layer I and two layers F1 and F2 which are interconnected by adaptive connections, together with other nodes and connections which control the processing between the layers. The n A input nodes I,, 12. . . . . In, in I collectively sample each input pattern i(a), a string of binary values ii(a), i2(a),..., i~(a) representing an object a in the input environment of A. Node Ik becomes activated if and only if the corresponding pattern component is 1 (h(a) = 1), otherwise (when h(a) = 0) it is inactive for the duration of the presentation of the current input pattern. Through one-to-one connections with fixed, unit strength, layer I transmits the input pattern to the F1 layer, which serves as a register for comparing the input with a number of binary template patterns. In a given application, network A is presented with a sequence of binary patterns, and it partitions these into a collection of disjoint classes based upon the template comparisons (the disjointness is not essential in this paper, but is mentioned for completeness). An example of an application is in the classification of tissue sample images generated by a medical diagnostic imaging system such as MRI or ultrasound. The binary pattern i(a) for image a encodes features derived from it. The i-th class has a two-part representation: (1) a unique F2 node F2,~, and (2) a binary template pattern T~, consisting of the weights in the adaptive connections from F2,~ to the Fx nodes. Template weight 7~ has the binary value 1 only if all input patterns in the i-th class have a 1 in position k. An input pattern i(a) must be 1335

1336

Second World Congress of Nonlinear Analysts

sufficiently similar to ~ for a to be assigned to the i-th class as prescribed by the ART 1 algorithm. If the current input object a is assigned to its class, then ~ is modified to adapt to i(a). The template-based classification process is autonomous, occuring without the aid of an external "teaching" signal. A LAPART network [1] consists of two "laterally-connected" ART 1 networks, denoted A and B (see Fig. 1). Interconnections from the F( to the F~ nodes are adaptive; the only remaining interconnection, from the network B vigilance node VIGB to the network A vigilance node VIGA, is fixed. It serves to enforce learned inferences implemented by strong (binary value 1) F( --~ FB connections, which are formed adaptively. As the coupled ART 1 networks A and B are presented with pairs (/a(a), iB(b)) of binary input patterns, the adaptive interconnections acquire a pattern of binary weights that express the classto-class dependency implied by the pairing of input examples. This "learned" pattern is such that each F( node has a strong connection to a single F~ node, and very weak (binary 0) connections to all others. Once such a connection, F~i ~ FzB, j, say, is formed, then if an input pattern ia(a) results in object a being classified in the class controlled by node F ~ , the FA~ ~ F~j connection causes activation of F~3. The ensuing F~j ---+ FB signalling activity causes the pattern 2B,j to appear at the F~ matching layer. If the paired input pattern In(b) does not sufficiently match To,J, VIGB becomes active and triggers VIGA through the VIGB ----+VIGA connection. The two active vigilance nodes force both subnetworks to undergo an ART 1 reset. This in turn forces subnetwork A to produce a new class, with an associated node F~e, to represent a, and the process continues until an F~, ----. F~j, association proves acceptable for the input pair (iA(a), .~B(b)) for some pair of indices i', j'. When this happens, the corresponding class templates ~a.~' and 7~B,j' are appropriately modified. When a sufficiently compatible sequence of pattern pairs is presented to a LAPART network, then, it forms two sets of classes for the network A and B inputs, along with a mapping between classes in the two sets via exclusive connections of the form F~ --~ F~j. Each such connection assigns a unique index j in network B's F2 layer to each index i in network A's F2 layer. We call each class-to-class association a learned rule, because it associates each class of network A input instances with a class of network B input instances. Under what conditions does the learning of a class-to-class mapping actually take place? And how can the rules be formalized as symbolic expressions that express properties of the data? I.

T O P O L O G I C A L SYSTEMS

The objects, or points, in the input space of a neural network exist in a universe called a domain. The semantics of the network is represented in a model-theoretic structure on this domain--a means of relating points to each other and to the logic of the network's processing algorithm. The logic used here is geometric logic, and the model-theoretic structure for this logic is that of topological systems, to be described in set-theoretic terms (The modeltheoretic structure is better represented using topologies generalized to category-theoretic structures [5], [6]; the present discussion using set-theoretic topologies requires less explanation). A domain D consists of a set pt D of objects called points together with a closed system f~D of formulas defined on the points. Formulas are constructed from primitive predicates that state properties of points relative to neural network nodes. The primitive predicates have the form r(x), where r is the label of a network node and z is a variable whose instances are the points in pt D. Informally, r(z) means "x causes excitation of node r". For example, the input, F, and F2 nodes of an ART i network are represented by predicates Ik(z) (for input node k), Fl,k(z) and F~,~(z) (for class i), respectively. Predicates with variable z can be thought of as properties possessed by the instances of z for which they are valid. Formulas are formed from the primitive predicates by the usual operations of conjunction ^

Second World Congress of Nonlinear Analysts

(y 21o

o

o

o

1337

///

I \\Bj

o

6 i"o VlG

~N

(13 INPUT FIELD A

\

~.~

I \" \?





u





u__u

(;) ? o'"o

(+)



INPUT FIELD B

Figure 1. The LAPART network. If subi*etwork A attempts to classify the current input in class A~, the inferencing connection to network B class Bj will cause the template T B'j to be read out over the F ~ layer through the top-down connections shown. If this is not a close enough match for the input, V I G B acting through a feedback connection to VIGA will force subnetwork A to try a different class for its input. If the match is favorable, the templates T A'i and T B'j will be suitably modified to refine the class representations.

and disjunction v. One difference from ordinary predicate calculus is that, although a conjunction has only finitely many conjuncts, disjuncts can be infinite. Thus, we have not only conjunctions of the form P(x) n Q(x) for formulas P(x) and Q(x), but also disjunctions V S, where S is an arbitrary set of formulas. The existence of binary conjunctions is equivalent to that of arbitrary finite conjunctions; the existence of infinite disjunctions is consistent with the study of observable object properties [5], which is appropriate for learning with neural networks. ~D also contains two special formulas, called (generically) true (always true) and f a l s e (always false), defined true = A 0 and f a l s e ---- V 0 . Under the relation ~- of logical entailment, with mutual entailment (equivalence under ~-) regarded as equality, ~D is a lattice. Note that conjunction is the lattice meet operation and disjunction is the join (and since joins can be infinite, the lattice is upper complete). By definition, P(x) ~- true and f a l s e ~- P(x) for all formulas P(z). Finally, finite conjunctions distribute over disjunctions: P(x) ^ (V s) = V{P(x) A Q(x) I Q(x) 6 S}, so the lattice is distributive. The neural network architecture, together with the pattern-encoding scheme for its input data, yield a theory with a sub-theory ftD about its domain of possible input objects. For example, suppose that the activation of a node p is always followed closely by activation of a node q--due to an excitatory p --~ q connection fixed to a strong (large) value, say. Then, a complete description of the network theory would include an axiom p(x) ~- q(x). A fixed set of values for the adaptive weights, together with pt D, yields a model of the theory. The adaptive weight values are the result of synaptic learning from the processing of a sequence

1338

Second World Congress of Nonlinear Analysts

of inputs from pt D. The elements of pt D are instances of the variable x used in the formulas which form the theory of the neural network. If a • pt D and a causes a node p to become activated, we write a ~ p(x) (a satisfies p(x) ). In general, the relation ~ on pt D × flD is defined as follows for formulas P(x) and Q(x), a r b i t r a r y sets S of formulas, and a • ptD : S1. a ~ P(x) ^ V(x) if and only if a ~ P(x) and a ~ Q(x). $2. a ~ VS if and only if, for some P(x) • S, a ~ P(x). In particular, a ~ t r u e and a ~= false always hold. For a given formula P(x), the subset extent(P(x)) of ptD defined by extent(P(x)) = {a 6 prO [ a ~ P(x)} is called the extent of P(x) in ptD. If the entailment P(x) ~- Q(x) exists in ~ D , then, since P(z) ~- P(x), P(z) t- P ( x ) n Q(x); but also, P(x) ^ Q(x) ~ P(x), so P(x) = P(x) n Q(x) . As a consequence, a ~ P(x) only if (by S1) a ~ Q(x) for a r b i t r a r y a • ptD. This shows t h a t P(x) ~ Q(z) only if extent(P(x)) C extent(Q(x)), an expression of the soundness of the logic. The relationship between ~D and the lattice of subsets of ptD consisting of the extents of the formulas, with finite intersections and a r b i t r a r y unions, can be s u m m a r i z e d extent(P(x) A Q(x)) -- extent(P(x)) vI extent(Q(x)), extent(V S) = m{extent(P(x)) I F(x) • S}, extent(true) = ptD, extent(false) = ~.

This relationship implies t h a t the extents of formulas form the open sets of a topology on ptD, which we denote by ~ p t D . In general, a topological space (X, 7-) on a set X is a subset T of the power set of X t h a t is closed under finite intersections and a r b i t r a r y unions and contains X and 0; T forms a lattice with the same general properties as f~D. An input domain D, or (ptD, f~D), exemplifies a more general concept called a topological s y s t e m [5], with ^ the equivalent of the binary intersection operation and V the equivalent of union. The s y s t e m (ptD, flptD) is b o t h a topological space and a topological s y s t e m (thinking of the open sets as logical s t a t e m e n t s a b o u t their elements). The main structure-preserving mappings in point-set topology are the continuous functions, where 1: (X, Z ) ---+ (Y, T2) is continuous if and only if I-I[U] e T1 for every U e T~) ; in a sense, the structure of the space (x, T~) must "contain" t h a t of the space (Y, T2). A continuous function between topological systems, 1: (ptD, l~D) ----+ (ptE, l~E), is a function p t / : p t D ~ p t E together with a lattice h o m o m o r p h i s m ~I: ~E --~ ~D such t h a t for all a • ptD and P(x) • ~ E , ptf(a) ~ P(x) if and only if a ~ ~I(P(x)) ; in a sense, the logic of D must be capable of expressing the logic of E. Besides the topological structure, ~D induces a preorder _E on ptD called a specialization hierarchy. We write a _E b if and only if b ~ P(x) for every formula P(x) such t h a t a ~ P(x). Informally, this is stated "b specializes a " , since everything t h a t is true of a is also true of b. For example, b could be a m e m b e r of an A R T 1 class, and a an object t h a t represents the class; in a logical context in which s t a t e m e n t s a b o u t objects are derived from observations of them, any p r o p e r t y of a must be shared by all class members. We shall be concerned with a domain D derived from the input domain b of a neural network by aggregating input objects. Here, pt D is the power set pt_b, and a ~ P(x) if and only if a ~ P(x) for all a • a . The logic of D derives from t h a t of D, except t h a t we need to restrict disjunctions in ~ D to m a i n t a i n soundness of the logic. To see this, suppose t h a t a = al u as, and t h a t ~x ~ P(x) and a2 ~ Q(x), but also t h a t ax ~e Q(x) and ~2 V= P(x). Since every element of a satisfies either P(x) or Q(x) in D, a • extent(P(x) y Q(x)) in D. However, it also happens t h a t a ¢ extent(P(x)) m extent(Q(x)). To account for this, we use only

Second World Congress of Nonlinear Analysts

1339

disjunctions V 1" over directed sets in ptD : In place of the undirected disjunction P(x) V Q(x), for example, we find R(x) such that P(x) ~- R(x) and Q(x) ~- R(x), and form the directed disjunction V 1" {P(x), Q(x), R(x)} V{P(x), Q(x), R(x)}. --

A specialization hierarchy exists over ?t D, given by a g a' if and only if for every a' • a', there exists a • a such that a G a'. When the order on p t b is discrete (e.g., b might have the topological structure of a Hausdorff space, which yields a discrete order), then a G a' reduces to a' c a. The non-trivial order on D is advantageous, for it allows the representation of hierarchies of neural network inputs directly in pt D. 3. LEARNED INFERENCES AND CONTINUOUS FUNCTIONS

We assume that the L A P A R T network, with subnetworks A and B, has been trained on a sequence of input pattern pairs (]A(a), is(b)) drawn from a subset M c /hA X /)B whose projections are surjections onto the respective input domains /7)~ and DB. Since the networks are coupled, with adaptive interconnections that form class-to-class connections, the two subnetworks do not operate as independent ART 1 networks. For a given pair (iA(a), iB(b)), the semantics of an A-class is that b o t h resonance and the absence of a lateral reset from VIGB define class membership for the object a, which means that the unique t3 class F2 node to which its F2 node is connected must accept the corresponding object b. Conversely, the function of network /3 is merely to test the match of iB(b) against the template corresponding to the inferred /3-class, so membership in a /3-class is determined in part by the A-class membership of the corresponding input object a. Let DA and DB denote the power set domains derived from /)A and DB, as discussed in the last section. Let A~(x) be a predicate in ~DA representing the choice of class node F ~ for an instance of x from ptDA (this notation is simpler than the previously-introduced F~,(x) notation). Similarly, let formulas B~(y) represent classes formed by network /3. To represent the rule t h a t corresponds to a strong, learned connection F~i ---4 F.~j, we form a two-variable predicate ~ A /,j , B (x, y) This predicate is a primitive t h a t is added to the theory, but does not belong to either ~DA or ~DB. This newly-introduced predicate is defined to be equivalent to a formula that already exists in the logic, but is not a valid inference: It is only guaranteed to be true in the current model. We call this a learned rule. For visualization of it as a rule, we shall rewrite the predicate as &(z) ~ Bj(y). An instance of the rule &(x) ~ Bd(y) is a pair (a, fl) c ptDA X ptDB such that 01 a ~ Ai(z), o2 fl ~ Bj(y), ~3 bEfl

only if

(a,b)•Mfor

somea•a.

~a establishes the connection between instances of the variable z for domain DA and instances of the variable y for domain DB. Define a function p t f : p t / ) A ---4 ptDB aS follows: For ~ e ptDA, ptf(cQ

f { b l { b } 6 P t D B and (a,b) E M }

if c~ = { a } , a 6 p t b A

t (.J{ptf({a}) l a e a}

otherwise.

Suppose that a number of input pairs from M have been processed, with the different sets of adaptive connections within subnetworks A and /3, and adaptive interconnections between them, having acquired patterns of values. Let n~ and n~ be the number of A- and Bclasses, respectively. If the L A P A R T A - B interconnections represent a valid set of rules, the following must be true if (iA(a), ]B(b)) is input, for arbitrary (a, b) e M:

1340

SecondWorldCongressof NonlinearAnalysts

a.

Network A immediately resonates on an existing class Ai for some i (1 < i < hA) for the object a, that is, VIGA is never activated. Let A~(x) ~ B~(x) for some j (1 _< j < n~), be the learned rule connecting class Ai with a B-class.

b.

During network B's subsequent processing of the input p a t t e r n representing object b, its vigilance system does not veto the choice Bj inferred via the A~(x) ~ Bj(x) connection. T h a t is, VIGB is not activated.

Let AB,j = {#:A~(x) ==v Bj(y)}. Applying the fact that the projections of M cover ptDa and ptDB, (a) and (b) imply the following: 1. The classes A, (1 _< /~ _< n¢) and By pt/)B, respectively.

(1 < v _< n~)

form partitions of ptDa and

2.

From the definition of the function p t f , for each class A~, a ~ A~(z) only if pt.f(a) ~ Bj(y), where the rule A~(x) ~ Bj(y) has been learned.

3.

ptf(a) ~ Bj(y) if and only if a N V{A, I A~(z) ~

Bj(y)}.

By restricting to conjunctions and disjunctions (not directed) formed from {B~: 1 < ~ < n~}, we obtain a restricted domain theory ~DB2. By then restricting to directed disjunctions in ~1DB,2, we obtain a continuous function f: DA ---~ (ptDB, 12DB,2) by using the function p t f defined above as the points part and defining ~ I as follows: For a collection B~ (v • F) over some index set r c {1, 2. . . . , riB}, let V 1` {B~ [ ~ • F} denote the directed disjunction V 1- {B~ I v • F} = V((B~ I - • r} u { V I B ~ I - • r}}). Then ll](V 1"{B~ I v e F}) = V 1"{A, I A~ ==~ B~ exists for some ~, • F}. 4. CONCLUSION The result of this analysis is that a system of rules A~(x) ==v Bj(y) expressing a dependency relation between classes formed by the two ART 1 subnetworks of a L A P A R T neural network has the underlying semantics of a continuous function on topological systems. The domain of the function is the input domain of subnetwork A and its codomain is the input domain of subnetwork B with a restriction of the logic to the closed system of formulas generated by the F~ node predicates. This is a formalization of two intuitive notions: (1) A network having the form of LAPART, with two subnetworks capable of forming object classes and adaptive connections between class representations, can learn formally-expressible rules only if the logic of the A subnetwork is capable of representing the structure defined by the classes formed by the B subnetwork; and (2) the appropriate d a t a must be supplied. Notice that our arguments do not depend upon the particular connection rules established. Another thing to notice is t h a t the analysis is not restricted to binary connections between binary nodes [7]; for example, analog values for the level of activation of a non-binary node can be represented using predicates of the form (t > c), (t < c), (t < c),..., where t is a real variable for the activation level of a node and c is a real constant. This suggests t h a t the result of this paper holds for more general systems of adaptive interconnections between subnetworks: if they are to represent rules valid over some d a t a base, then they must be representable semantically in terms of continuous functions between the input domains of the appropriate subnetworks. This has significance for the formal semantic analysis of neural networks more complex t h a n the simple inferencing network described here. REFERENCES 1.

M.J. Healy, T. P. Caudell and S. D. G. Smith, "A Neural Architecture for Pattern Sequence Verification Through Inferencing', IEEE Transactions on Neural Networks, vol. 4, pp. 9-20, 1993.

Second Wodd Congress of NonlinearAnalysts

1341

2.

G.A. Carpenter and S. Grossberg, "A Massively Parallel Architecture for a Self-OrganizingNeural Pattern Recognition Machine", Computer Vision, Graphics, and Image Processing, vol. 37, pp. 54-115, 1987.

3.

B. Moore, "ART 1 and Pattern Clustering", Proceedings of the lg88 Connectionist Summer School, Toretzky and Hinton, eds., Morgan Kaufman, 1989.

4.

M. Georgiopoulos, G. L. Heileman, and J. Huang, "Properties of Learning Related to Pattern Diversity in ART1", Neural Networks, vol. 4, pp. 751-757, 1991.

5.

S. Vickers, Topology via Logic, Cambridge University Press, 1991. 2nd ed. 1993.

6.

S. Vickers, "Geometric Theories and Databases", in Applications of Categories in Computer Science, M. P. Fourman, P. T. Johnstone and A. M. Pitts (eds.), London Mathematical Society Lecture Note Series vol. 177, Cambridge University Press, 1992.

7.

M.J. Healy and T. P. Caudell, "Discrete Stack Interval Representations and Fuzzy ARTI', Proceedings of the World Congress on Neural Networks, Lawrence Erlbaum Assoc. or INNS Press, Hillsdale, NJ, pp. II-82--II-91, 1993.