JOURNAL
OF MATHEMATICAL
A Mathematical
Theory HENRY
School
12, 137-177
PSYCHOLOGY
of Social
of Learning
HAMBURGER
Sciences,
(1975)
University
AND
Transformational KENNETH
of California,
Grammar*
WEXLER Irvine,
California
92664
How to explain the fact that children learn language is a central problem for both psychology and linguistics. Suppes says that “the linguists’ insistence that they will accept nothing less than a complete and detailed account will probably turn out to be the most important conceptual demand on psychology in this century.” This paper speaks to that demand by presenting a complete formal characterization of the learning process and the language environment in which it operates. The assumptions are in general accord with psychological and linguistic principles. It is proved that the system converges; that is, the learning process, acting on the linguistic information it receives, learns the language, according to a formal criterion.
1.
INTRODUCTIONS
I .l. A Dual Claim This paper presents a reasonable formal theory of language learning, together with a proof that the theory works. It is in “reasonable” conformity with what is known empirically about natural language and its acquisition. The learning procedure “works” in the sense that we prove that the learner must ultimately attain a correct representation of language. It is important to view this dual assertion of plausibility and success as a single claim. When a procedure for learning language is proposed, it is appropriate to ask in what sense it will succeed, and whether such success can be proven in a general class of circumstances or may be dependent on a restricted choice of sample situations. A language-learning theory, of the sort we present here, has four interrelated aspects: (a) the class, G, of possible grammars, (b) the kind, I, of information made available to the learner, (c) the language-learning procedure, P, and (d) the criterion, C, of success. The theory must deduce a guarantee that the correct member, g, of G can be discovered by applying procedure P to information of type I about g. The sense in which g is “discovered” must be made precise by a formal criterion C, and the proof must hold for arbitrary g in class G. * This work was supported by the Office of Naval Projects Agency, ONR Contract N OO014-69-A-0200-9006. 1 We wish to thank W. H. Batchelder for an insightful
137 Copyright All rights
Q 1975 by Academic Press, Inc. of reproduction in any form reserved.
Research comment
and
the Advanced
on Section
1.
Research
138
HAMBURGER
AND
WEXLER
The above claim that the theory is reasonable rests on the plausibility of G, I, P, and C. We conclude this subsectionby commenting briefly on the first three. For G, we have chosen a classof transformational grammars, becauseof the unmistakable importance of transformational theory in current linguistics. Each element of the information sequence(I) is a sentencecoupled with its underlying syntactic structure. Many linguists argue that most aspectsof meaning can be defined on deep structure and here, as a starting point, we make the too strong assumptionthat the learner can deduce the deep structure from the situation in which the sentenceis presented. One way in which the learning procedure (P) is plausible is that it doesnot require explicit memory for past data. It can therefore, like children, make a mistake on a datum even though that datum has been presented earlier. Ultimately, such errors disappear. 1.2. Main Features of the Theory
The learner is presented with a successionof data, each consisting of a sentence accompaniedby its underlying structure. 2 His task is ultimately to specify a transformational component2consistentwith all such data, including data he hasnot yet seen. We investigate the learning of transformations since it is widely agreed by linguists that they are crucial and not innate. The most striking characteristic of the learning procedures posited is that only minor changesin the learner’s model are made in responseto each input datum. Specifically, the learner can hypothesize at most one new transformation in response to a datum or can reject at most one of those already hypothesized. Since it alters only one rule at a time, the learning procedure used here is much more plausible asa psychological model than the procedures considered in the literature of automata theory (e.g., Solomonoff, 1964) which allow repeated wholesale rejection of entire grammars(setsof rules). Whatever may be the merits of all-or-none modelsof leaning for simpler tasks, it is clear that no child learns an entire langaugeat a stroke. Despite the complexity of transformational systems,the constraint of making only minor changes,and the lack of explicit memory for past data, it is neverthelesspossible to guaranteein advancethat the learner will always succeedat his task. That is, he will ultimately guesseither the component actually responsiblefor the data or elseanother component indistinguishable from it with respect to all possible data, and having made such a guesshe will then stick to it. The strength of this claim can perhaps be made clearer by contrasting it with the kind of conclusion that can be drawn from computer simulation studies of language learning. (See,e.g., Klein & Kuppin, 1970).A simulation can showthat somelanguagelearning procedure seems to be converging to the correct result for somesampleinput 2A transformational componenttogetherwith a basecomponentcomprisea transformational grammar.The basecomponentgenerates or specifiesa setof phrase-markers, eachof which is the underlyingstructureof a sentence. The transformational componentactsuponanunderlying structureto produceasurfacestructure,part of whichisthe surfacestring,utterance,or sentence.
LEARNING
TRANSFORMATIONAL
GRAMMAR
139
sequences from some sample grammars. In contrast, we prove that the procedure must converge for a specified class of grammars and inputs. This is particularly important for psychological processeslike language learning where the fundamental observation is that every normal child succeeds,as opposed, say, to some kinds of problem-solving tasks, where humans may indeed fail. To accomplish this task, the learner is provided with a stream of information of a type which is quite rich but not implausible. Formally, each datum is an ordered pair consisting of a basephrase-markerand its surfacestring (but not its surfacestructure). The base phrase-marker is close3to the meaning, while the surface string is the uttered sentence. Thus we are assumingthat the learner is given something like meaning-plus-utterancepairs. This type of data presentation scheme,in which sentencesare presentedwith their underlying structure, is consistent with the following summarizing comment by McNeil1 (1970): To observea transformational relation not yet known, an underlying structure that comes only from the child must be made contiguous with a surface structure that comesonly from an adult... In other words, something in the child’s mind must be brought together with something in the adult’s speech.(pp. 111-I 12) Chomsky (1965) calls providing underlying structure (akin to meaning) a strong claim, adding however that it need not therefore be incorrect. He goes on to cite evidence4of Miller and Norman (1964) that “semantic reference... does not, apparently, affect the manner in which acquisition of syntax proceeds.” He notes that situational context may facilitate languagelearning in general, but distinguishesthis from any effect it might have on particular rules. In this context, our results showthat, if the so-called “strong claim” is made-that underlying structures akin to meaning are presented-then there is a particular learning procedurewhich succeedsin learning language(in a prescribed sense)and for which, moreover, the particular rules leaned are directly influenced by underlying structure. In a framework for the study of languagelearnability devised by Gold (1967), the learner is presentedwith a sequenceof data and must guess,after each datum, what language of a given classhe is encountering. He is never told whether his guessis correct but, if his guessingsequenceconverges,he is saidto succeed.More specifically, if his procedure in all casesleads to a correct guesssooner or later and then stays correct, then the procedure “identifies in the limit”5 that classof languages.Our own 3 In some formulations it is a representation of meaning, while in others it contains all information necessary to assign meaning. For discussion of the issues involved, see Partee (1971). 4 McNeil1 (1970, p. 110) cites contrary evidence. 5 Actually, the learner identifies which language in the class is the correct one, but, since he must be able to to it for whichever member of the class happens to be correct, he is said to identify that class.
140
HAMBURGER
AND
WEXLER
criterion is similar to this one, but probabilistic. We shall refer to it informally as “learnability.” The reason for speakingof a classof languagesinstead of a single languageis to avoid learning procedures which are too specific. By requiring procedures to work for any language in a class,we preclude devices whose sole function is to guessa particular rule for a grammar of a particular language. We have investigated the learnability of certain classesof transformational mappings of base phrase-markersto surface strings. We speak of mappings, not languages, becausewe require that the learner know more than just the set of sentencesof the language. In addition, he must acquire the ability to map each phrase-marker (akin to meaning) to the appropriate surface string (utterance). The learner fulfills his task if he attains any transformational component which represents the correct mapping, with respect to the basegrammar. Two transformational components will thus be called “task equivalent” if they induce the same mapping, with respect to the base grammar. They will also be called “moderately equivalent” in Section 3.2 for reasonsgiven there. 1.3. Content and Process Chomsky (1965), in summarizing his subchapteron “Linguistic theory and language learning,” assertsthe following: The real problem is that of developing a hypothesis about initial structure that is sufficiently rich to account for acquisition of language,yet not so rich asto be inconsistent with the known diversity of language.(p. 58) Interpreting our formal work in this context, the base grammar and the learning procedure play the role of initial structure. More specifically, the learning procedure would be what Chomsky (1965) calls “a schema that is applied to data and that determines in a highly restricted way the general form... of the grammar that may emergeupon presentation of the appropriate data” (p. 53). Slobin (1966) speaksof a “p recessapproach” as opposedto a “content approach” to specify the “innate structure of man... The [linguistic] universals may thus be a derivative consequenceof, say, the application of certain inference rules rather than constitute the actual initial information in terms of which the child processeslinguistic input” (pp. 87-88). In other words, what is innate is not a grammar (content) but a learning algorithm (process)for acquiring one. Derwing (1973, Chap. 3, especiallypp. 53-56) discussesthis problem in somedetail and provides a clear statement of a crucial misconception: that the processapproach makes“fewer assumptions about the child’s innate capacity for languageand henceis to be preferred on conceptual grounds, all other things being equal” (p. 55; italics in
LEARNING
TRANSFORMATIONAL
GRAMMAR
141
original). True this approach makes fewer assumptions about innate content, but correspondingly it requires a more complex innate learning mechanism.6 The content-vs-process distinction must be regarded as a continuum, not a dichotomy. The extreme positions are clearly untenable: “no content at all” presumably would mean a neonate brain of random neural nets, while “no process at all” is ruled out by the diversity of existing languages. Thus it seems clear that both some structure and a process must be innate. Rather than taking a position on this contentprocess continuum and trying to defend it, we consider it appropriate to make the weakest possible assumptions about both content and process, consistent with logical possibility and empirical observation. To show logical possibility, one demonstrates that if certain assumptions were true-about what content and learning algorithms are innate-then learning would occur. Such is the purpose of the proofs provided here. Although a process (a learning procedure) is central to our concerns, that process can only operate meaningfully if certain content, e.g., the base grammar and the is explicitly built into the learner. Other aspects are definition of “transformation,” built-in in an implicit fashion. One of these is the so-called transformational cycle (see Section 2), which Chomsky (1969, p. 67) c h ains must be built in. We prove that this assumption (of the transformational cycle) together with several other assumptions are sufficient for learning to occur. Anyone who thinks the assumption is not necessary is welcome to try to devise proofs corresponding to ours without depending on that assumption. Structure and process have, in a way, given rise to an academic division of labor. Mathematical psychologists in the area of learning theory have put considerable effort into models of the process of learning but far less into specifying the complex structures that people actually learn. Linguists, conversely, have been delineating in great detail the nature of the adult syntactic structure to be learned but, while endorsing the need for a theory of the learning process, have not actively pursued one. In this paper, complex (syntactic) structure is incorporated into a model of (learning) process. Such a unified treatment of these two aspects is important because it shows what a complete theory can look like and thereby provides insight into both learning and syntax. In this context, it is interesting to compare the situation in the theory of language acquisition with the situation in learning theory in psychology. Such a comparison will show why it is important for a theory of language acquisition to prove that it “works” (in the sense of Subsection 1.1). Traditional learning theory deals with situations in which there is no problem in 6 Derwing and others (e.g., Schlesinger, 1971) seem to have no conception of the complexity of the learning problem. In this connection, Gold’s (1967) proof that no processor of whatever complexity can learn an arbitrary finite state language (a too simple model for natural languages) from presentation of sentences should provide an intellectual catharsis.
142
HAMBURGER
AND
WEXLER
conceiving of a mechanism which can learn. There are generally many theories which would allow for learning. Thus the problem the psychologist faces is to find ways to discriminate among those theories. He does this by studying other phenomena than the simple one of whether the system is learned. These include, for example, how “correctness” changes over the course of learning, response times to various kinds of items, and a large number of other phenomena, often ingeniously arranged. The puzzle is to explain all the phenomena and, if two theories account for all the phenomena, then the problem is to find (usually by devising experiments) new phenomena. In this way a tradition of psychological investigation has been created. The situation in language acquisition is quite different, however. Here, given our (i.e., linguistic theory’s) account of what language is, the problem is to find a theory which would account in any way for the fact that language is acquired. Up to the present, no theory can account for this fact. Thus, if we compare two theories of language acquisition with respect to what they say about, for example, the course of language learning, we may be comparing theories which could never account for the fact of language learning. There is no Q priori way of knowing, of course, that looking at the course of languageacquisition will or will not aid in the construction of a theory which can account for the fact that languageis ultimately acquired, but we should not lose site of the necessity for finding a theory which is rich enough to allow languageto be acquired. The formal theory given here is presented with somewhat more detail and correspondingly Iesslucidity in Hamburger (1971). Critical discussionof assumptionsis to be found in Hamburger and Wexler (1973). A proof that surface data alone are not adequateto insure learnability of related classesof grammar may be found in Wexler and Hamburger (1973). Th e implications of this work for the subject of explanatory adequacy of theory of grammar are being treated extensively in Wexler, Culicover, and Hamburger (in press).
2. THE CLASS OF TRANSFORMATIONAL
GRAMMARS
In selecting a particular classof grammars,we have been guided by our twin aims of(i) conforming to the claimsof linguists about an appropriate format for representing the structure of languageand (ii) being able to prove learnability by a simple learning procedure (Section 3), given a plausible form of data presentation. Subsection 2.1 is a statement of the basic framework of the transformational grammar to be used. For a more general framework, embracing the constructs of Subsection 2.1 and many other transformational systemsin recent use, seeGinsburg and Partee (1969). (Th e mathematicalpsychologistseekinga sophisticatedintroduction to mathematical linguistics in general may consult Chomsky (1963) and Hopcroft and Ullman (1969).)
LEARNING
TRANSFORMATIONAL
GRAMMAR
143
Subsection 2.2 contains certain restrictions on the application of transformations. These restrictions are prominent in the proofs of learnability in Section 3. Independent linguistic evidence for them is studied in Wexler et al. (manuscript in preparation) and is touched on here. The principal result of this section ,Theorem 3 in Subsection 2.3, dependsonly on the grammatical framework (as opposed to data format and learning procedure), but it pavesthe way for the learnability resultsin Section 3. In view of the data scheme and the learning procedure, it has roughly the following import: A learner can afford to ignore sentencesof too great complexity.7 2.1. General Form
We take a transformational grammarto consistof(i) a basegrammarwhich generates base phrase-markers akin to meanings of sentences,(ii) transformations, and (iii) regulations for the use of those transformations. Other grammatical devices which have been suggestedin connection with transformational grammar, for example, syntactic features and output constraints, will not be considered.The transformations, used according to the regulations, associatewith each basephrase-marker a surface string, which is a sequenceof morphemesakin to the uttered sentence. s
-
PO
S -
PH
S-----W M-m
P -MS P -
NV
H-h N-n
RS
D-d
a V-D
(0) The production context -free
(b)
Phrase-marker by
the
(base)
rules of the grammar, B.
p generated grammar
B.
FIGURE 1 ’ More precisely, if the degree of embedding of the sentence in the input datum is greater than a certain bound, then that datum can be ignored without affecting the possibility of discovering whether the currently hypothesized set of transformation rules is consistent with all possible data.
144
HAMBURGER
AND
WEXLER
Several terms will now be introduced by examples; a formal account is provided by Hamburger (1971). F i gure 1 shows a context-free grammar B, which can serve as a base grammar, and a phrase-marker p (a rooted, ordered, labeled tree), which can be generated from ita In p, the node labeledQ, which we shall often simply call Q, dominates D since there is a sequenceof connections downward through the phrase-marker from Q to D. There is recursion on the symbol P becauseone P dominates another. There is also recursion on S, and the grammar B can generateother phrase-markersin which there is recursion on Q. S is essential to all recursion with respect to grammar B because recursion on any other symbol must be through9 S. (Proof omitted.) The S’s (the nodeslabeled S) break up the phrase-markerinto levels. The S which dominatesw has a level consiting of itself and that w. The level of the top S includes itself, m, M, P, n, N, d, D, V, and Q. More generally, the level of an S consistsof itself and the non-S nodesthat it dominateswith no other S intervening.g The transformationa component is a finite set of transformation rules together with specificationsof the manner in which they are used. A base phrase-markeris transformed into a surface phrase-marker by applying the transformational component at a successionof levels. The labels of the end nodesof this surface phrase-marker,taken in the usual left-to-right orderlo form the surface string. The transformational component is applied at a level only after it has been applied at all dominated levels. Dominance of one level by another meansdominance of the associatedS nodes. Thus, in Fig. 1, application at the level of the S right above P and H must be later than application at the level of the S dominating only w. (This is in effect identical to Ginsburg and Partee’s (1969) “LLS” condition.) Applying the transformational component at a level meansapplying a succession of individual transformations at that level. In this section,we allow only one transformation to apply at a level. (For the more general case,seeSubsection 4.1). Thus, we now define the application of a singletransformation at a singlelevel. Each transformation consistsof a structural description and a structural change.ll The structural de* Strictly, Fig. l(a) shows only the production rules of a context-free grammar. A complete specification would state that S is the start symbol, that nz, w, h, n, and d are terminal symbols and that B is context free because each of its rules has a left-hand side consisting of one terminal symbol. The phrase-marker p is generated in the following manner: Write down the start symbol S; invoke rule S + PQ to write P and Q, in order, immediately descending from S; etc. 9 “through” and “intervening” are to be taken in the obvious pictorial way. That is, if A dominates B and B dominates C then A dominates C through B or with B intervening. lo Node A is to the left of node B if for some nodes C, D and E, (i) C dominates (or is) A, (ii) D dominates (or is) B, (iii) E is immediately above C and D, and (iv) C is to the left of D. (This is not circular since informally (iv) is clear in the context of (iii).) r1 Although this terminology is in widespread use, Ginsburg and Partee (1969) speak of “domain statement” and “change statement.” Our structural descriptions are a special case of their domain statements with their Do required to be empty.
LEARNING
TRANSFORMATIONAL
145
GRAMMAR
scription is a finite sequence of symbols from the base grammar and ofX’s. The symbol X is called a variable and, in effect, stands for an arbitrary string (possibly empty) of symbols from the base grammar. If the structural description fits the current phrase-marker, then the structural change is carried out; that is, if the transformation is applicable, then it must be applied. The transformations are thus said to be “obligatory.” We have not studied learnability properties of components with optional transformations. Clearly, this would be an important area of future study. We have begun by using obligatory transformations for two reasons: First, determinism makes the learner’s task more manageable and thus simplifies our own task; second, linguistic developments have led linguists to alter some descriptions to use obligatory instead of optional transformations (e.g., in early accounts of transformational grammar, the passive was optional, but since Chomsky (1965) it has usually been treated as obligatory). The ideas of structural description, its fit with a phrase-marker, and structural change can be seen from the example of Fig. 2. A more precise treatment may be found in Ginsburg and Partee (1969). P.
DEFGH
T(P)
I
defghojk
JK
fiklg
DE I
before of T
FIG. 2. description
opplrcotzon
after T
Opplicotlon
I
of
T is a transformation consisting of structural Application of a transformation. = (A, X, G, H, I, J, K) and structural change = (6, 2, 0, 0, 5-7, 5, 3-m-1).
The structural description fits phrase-markerp in Fig. 2 with X fitted to F because its symbolsappearinp dominating, respectively, the terminal strings de,f, g, h, i, j, k; that is, its symbols dominate all the nodes of the terminal string, in left-to-right order (seefootnote lo), without overlap. The numbers 1, 2, 3, of the structural change refer to the first, second,third, etc., symbols of the structural description, that is, A, X, G, etc. The 2 in secondposition meansthe node labeled F, to which X is assigned,is not moved. The 121in fourth
146
HAMBURGER
AND
WEXLER
position indicates deletion of H. The ~?rin third position together with the appearance of 3 in another position results in the shift of G. The two occurrences of 5 yield copying of I, and the appearanceof m calls for insertion of the morpheme m. In p, the node C directly dominatesJ and K, which are sixth and seventh in the structural description; in the structural change, the sixth and seventh positions include 5, 3, m, and 1. Therefore, C directly dominatesI, G, m, and A in the transformed phrase-marker. 2.2.
Restrictions
Uncontroversial restrictions on variables will be followed by more significant restrictions on the use of transformations. Three restrictions on the use of variables will be needed. Two are present in the general framework of Ginsburg and Partee (1969), cited earlier; any violation of the third would require a bizarre interpretation. No linguistic argumentshave been raised againstany of them. (i) Variables may not be copied. (ii)
Variables may not be shifted.
(iii)
There cannot be two consecutive variables in a structural description.
The shifting of variables, (ii), can cause nondeterminacy in the application of transformations unless adequate conventions are introduced; its exclusion is thus convenient and linguistically acceptablethough not formally necessaryfor the learning results. Consecutive variables, (iii), causeinherent nondeterminacy which no conventions can remedy; that is, with two consecutive arbitrary strings, it is arbitrary where one leavesoff and the other begins (unlessboth are empty). The restrictions on the use of transformations require the following additional terminology, which, unlike the foregoing, is not standard. A height-two transformational grammar is one in which the fit of a transformation with a phrase-markercan depend only on nodesat the level where it is being applied and the level(s) immediately below. These levels are called, respectively, the upper level and the lower ZeveZ(s). A node is raised if it is moved from a lower to the upper level; a node sidesteps if it is moved from one lower level to another. Moreover, a raised node is either explicitly raised or nonexplicitly raised, depending on whether or not it is involved in establishing the fit of the transformation that raisesit. A node can be marked ineZigibZe for use in establishingthe fit of transformations. Only a variable is allowed to fit with an ineligible node. If a node is ineligible, then all the nodes it dominates are also ineligible. These points are exemplified in Fig. 3. Restriction on Base Grammar. The basegrammar is context free; one of its symbols is essentialto all recursion (seeSubsection2.1, first three paragraphs,for terminology).
LEARNING
TRANSFORMATIONAL
S
lnellgible
E
F
node:
YAI cGHJf I
FIG. 3. Eligibility fitting nodes D and
Restrictions
I
D
slruclural descripilon
gh
147
GRAMMAR
fit
AEF
Yes
CXJF
yes
CGB
no
I
j
(yes/no1
and fit. Ineligibility of D renders G ineligible may fit ineligible nodes).
but not
A. CXJF fits with X
H (variables
on the Use of Transformations.
(1) The transformational grammar is height-two. (2) No sidesteppingmay occur. (3) Nodes added by transformations are ineligible. (4) Nonexplicitly raised nodesare ineligible. (5) When a node is explicitly raised, someother node at the upper level having the samelabel must either be deleted or becomeineligible.la (6) Nodesadded by transformations may not be deleted. (Hence nodesdominating them cannot be deleted.) Figure 4 is a formal example showing how someof these restrictions operate. Some of our assumptionshave been motivated not by strong linguistic evidence but to enableus to prove learnability. We felt it important to demonstratethat positive results can be obtained. Our working method is to modify assumptionsin the direction of linguistic descriptive adequacy while retaining learnability. Nevertheless, the following comments on descriptive syntactic adequacy seem warranted. Restriction (I), the height-two assumption,is equivalent to the “subjacency” principle proposed independently on linguistic grounds by Chomsky (1973). Although the condition is controversial (seePostal, 1972), it is intriguing that a condition which we initially posited on learnability grounds has found descriptive support. With respect to Restriction 2, we know of no proposed transformations which sidestep. It seemsthat at most what can be done within two lower levels is to delete part of one or possibly
to raise a node from only one of them.
I2 The selection of which node becomes ineligible or by specific mention in the structural change part
can be accomplished, either of individual transformations.
by a meta-rule
148
HAMBURGER
AND
WEXLER
S A S
J
A
Al
A Al SD
Ai’
A D
/
ii” dk
E
E
AA FGHS I I fghw
I
Ka
I (a)
A H
S
I h
I w
F
G(k
I f
I g
(b)
FIG. 4. Formal example of restrictions on the use of transformations. There are five S’s, hence five levels, Suppose that the deep structure, in (a), is unaltered at the bottom level and at the next level, but that at the level after that the D above F and G is raised to replace the D above d. The result is shown in (b), with the eligible structure for the following level enclosed in the irregular box.
the
FIG. 5. Specifically, transformation of omitting
picture of Sam
to be on the table
the picture of Sam
to be on the table
Natural language example meeting the Restrictions on the use of Transformations. the noun phrase “the picture of Sam,” once raised, is treated as a unit. The raising may be denoted NP V NP VP + (1, (2, 3), 0, 4). We follow the usual practice portions of the phrase-marker not related to the transformation under consideration.
Restrictions 3-6 are an interrelated set of conditions which involve the notion that certain new structures, once formed, cannot be analyzed by the structural description of subsequent transformation. This property is shown in the example of Fig. 5. More importantly, it has led to extensive syntactic investigations and has resulted in a new formal linguistic universal, the Freezing Principle. For thorough discussion and
LEARNING
TRANSFORMATIONAL
GRAMMAR
149
syntactic details, see Wexler and Culicover (1973), Culicover and Wexler (1973, 1974a) and Wexler et al. (in press). These raising restrictions also account for the ungrammaticality of sentence (lb). Postal (1972) argues that in sentences like (la) there is a transformation which raises “the picture of Sam” from the subject position of the embedded sentence (“the picture of Sam to be on the table”) to the object position of the matrix sentence (main clause), as shown in Fig. 5. Suppose that in place of Sam in the deep structure of (1 a) there appeared the question element “who.” Then in the absence of our restrictions, “the picture of who” could be raised and then the question transformation would yield (lb). However, Restriction 4 renders the “who” node ineligible, so that it cannot be moved to the front of the sentence. Note that, in nonraising cases, the “who” can be fronted, as in (1~). (la)
John believed the picture of Sam to be on the table.
(1 b)
*Who did John believe the picture of to be on the table ?
(lc)
Who did John buy a picture of?
The height-two restriction (1) can be expressed in terms of eligibility. Suppose that the transformational component is to be applied at some level of a phrase-marker. Then, at that moment, all nodes not at either that level or a level immediately below are temporarily ineligible. (These and/or other nodes may be permanently ineligible anyway by Restriction (3), (4), or (5)). To summarize, a base grammar generates a base phrasemarker. The transformational component is applied at successive levels. At the time of an application, only certain nodes are eligible, namely those which are not either temporarily or permanently ineligible. These eligible nodes determine a portion of the (partially transformed) phrase-marker; that portion is called the eligible structure at that level. This eligible structure contains all the information necessaryto establish whether and how any particular transformation fits. A transformation applies if and only if a fit can be establishedwith currently eligible nodes. Each transformational component is a set of transformations to be used in the manner already specified (including adherence to Restrictions (l)-(6). To specify a particular transformational component of this type, it now sufficesmerely to list the transformations in it. We shall therefore speakloosely of a transformational component as simply a set of transformations. However, many setsof transformations representnondeterministic transformational components. C is nondeterministic with respect to the base grammar B if C cannot be applied deterministically to every phrase-marker generated by B. Nondeterminacy may arise either becausesometransformation in C fits an eligible structure in two or more ways or becauseseveral transformations in C fit simultaneously. The expression “transformational component” will be noncommittal as to determinacy.
150
HAMBURGER
AND
WMLER
We shall require that the actual, adult component be deterministic, require the learner’s guess at each point in time to be deterministic.13
but we do not
Notation. B: C: A(B): elig(B):
the set of base grammars that meet the Restriction on Base Grammar. the set of transformational components which meet Restrictions 1-6. that subset of C, each of whose members is deterministic with respect to B (where B E B). the set of all eligible structures which arise at any level in the application of any C EC to any b E B (where B E B).r*
2.3. Properties LEMMA
1. For any B E B, elig(B) is a$nite set.
Proof. For any b E B, the number of branches coming down from a node of b cannot exceed the number of symbols, Y, in the longest right-hand side of a rule of B. Also, since S is essential to recursion, a “vertical” sequence of nodes (each dominating its predecessor) within a level of b must have all labels different and so cannot have more nodes than n, the number of symbols in B. Therefore, a level of b can have at most Cy=, ri nodes. For any eligible structure, the same argument applies except that (i) the vertical node sequences may be twice as long (Restriction 1) and (ii) previous transforming may have raised nodes into the lower levels of the eligible structure.15 However, nonexplicitly raised nodes become ineligible (Restriction (4)), while each explicit raising only adds an eligible node at a higher level while simultaneously rendering another node ineligible (Restriction (5)), so the total of eligible nodes at that level is not increased. r3 Indeed, it may not be possible to ascertain, simply by inspecting a set of transformations, whether that set comprises a deterministic transformational component with respect to the base grammar. Nevertheless, the learning procedure ultimately (with probability 1) achieves a deterministic-and correct-component by means of rejecting some transformation whenever nondeterminacy is discovered while processing an input datum. Thus, the definition of deterministic is, in a manner of speaking, a “behavioral” as opposed to a “structural” definition since it is stated not in terms of forms but in terms of a property which emerges from a process. r4 The symbol B is used in two senses. If B is a base grammar, then B also represents the set of base phrase-markers generated by B. elig(B) may be regarded as lJcsc elig(B, C), where elig(B, C) is the set of all eligible structures which arise at any level in the application of C to members of B. is Transformations may insert nodes but, by Restriction (3), this will not increase the number of eligible nodes. Sidestepping (even if it were not ruled out by Restriction (2)) and lowering of nodes causes them to be temporarily ineligible for any transforming that takes place at higher levels (Restriction (1)). Though not made permanently ineligible, such nodes will remain temporarily ineligible throughout the transformational process.
LEARNING
TRANSFORMATIONAL
151
GRAMMAR
Thus the number of nodes in an eligible structure an eligible structure is a rooted, ordered tree whose B, it follows that the number of eligible structures applying any C E C is less than some function of r The result follows.
cannot exceed xfz, ri and, since labels are among the 12symbols of formed from members of B by and 71,which are properties of B. 0
It is worth noting that, although the restrictions on raising make possible Lemma 1, bounding the number of possible eligible structures, they do not render finite the number of possible transformational components defined with respect to a given base grammar. More importantly, even if both base and component-to-be-guessed are specified, there is no bound on the number of “reasonable guesses” one might make as to which transformational component is the one to be guessed. This highly informal statement depends on what is meant by “reasonable,” which in turn depends on the conditions under which guesses are made. The following example will make things clearer. For any integer 7t, consider a base phrase-marker of 71levels as in Fig. 6. Suppose that for some reasorP the learner mistakenly thinks that A should always be deleted. Further suppose that he is told that the given base phrase-marker has surface string #-lb. Then after applying his deletion transformation at all levels up to the top, obtaining Fig. 4(b), he might conclude that at the top level he needs a copying transformation which creates 2n - 2 new copies of C (and hence of a), thereby yielding the correct surface string. Since for any value of n he might reach this conclusion, any transformational component consisting of deletion of A and (2n - 2)-fold copying of C is attainable in this way. There are thus infinitely many transformational components, each attainable by a learner using this rationale.
0
o
b
FIGURE
6
I6 Such a reason could be that A is in fact supposed to be deleted the learner has overgeneralized, concluding that A is always deleted. 480/12/2-z
in some
context
and that
152
HAMBURGER
AND
WEXLER
Theorem 3 was loosely characterized at the start of Section 2 as the statement that “complex data can be ignored.” A direct assault on that result would be cumbersome and opaque. The source of the difficulty is the possibility of raising nodes from one level to another. Therefore, we first prove two restricted versions. In Theorem 1 raising of nodes is ruled out, while in Theorem 2 it is allowed on a limited basis. The principal ideas of the proof of Theorem 1 are contained in Lemmas 2 and 3, which in turn require the following definitions. DEFINITION. A path of a phrase-marker is a sequence of levels in which the first dominates no other level of the phrase-marker, and each of the others immediately dominates its predecessor.17 DEFINITION.
The degree of a phrase-marker
is one less than the cardinality
of its
longest path.17 DEFINITION.
Levels which
are immediately
dominated
by the same level are
siblings. DEFINITION. For B E B, for C, , C’s EC, and for an ordered set of H levels, each directly above its predecessor in a path of some b E B, a chunk of heightHis an H-tuple of triples, the H successivetriples correspondingto the H successivelevels. The first member of each triple is the eligible structure at the particular level, with respect to C, . The second member is the samewith respect to C’s. The third member is an integer indicating the position, in left-to-right order, of that level among its siblings.
Thus, for example, a chunk of height 1 contains a singletriple, corresponding to a single level, its “defining” level, so to speak. But note that the first member of the triple is an eligible structure which itself may have lower levels, in addition to its upper, or defining, level. When we speak of chunks asbeing disjoint, we shall mean that the setsof defining levels of the various chunks are disjoint. LEMMA 2. For any B E B, let E = 1elig(B)l and F = cardinality of the largest setof siblingspossiblein B. For any C, , C, EC andfor any b e B of degreegreater than HK(E2F)H, b hasa path containinglevelscorrespondingto K identical chunksof heightH.
Proof. By definition of degree, there must be a path containing K(E2F)H (not necessarily identical) disjoint chunks of height H. Each such chunk is an H-tuple of triples, elementsof the latter being chosen from sets of cardinality at most E, E, and F, respectively. The chunks themselvesare thus chosenfrom a set of cardinality at most (E2F)H but, since there are K times this number of chunks, someK of them must be identical. 0 I7 This definition not standard. Since
of degree is equivalent our use of paths is only
to the usual one,though our internal to proofs, this situation
definition of path is acceptable.
is
LEARNING TRANSFORMATIONAL
GRAMMAR
153
DEFINITION. In two identical eligible structures, a lower level of one corresponds to a lower level of the other if they are in the sameleft-to-right order among their siblings (seeFig. 7).
s
C ib)
d
*C(b)
i FIG. 7. The construction dicated in the statement of blanks so as to make clear that and k’ correspond with respect
*C
(pl
c iP)
sk’
=
cddefqhijlnmhl~ln
=
cd
efghl
S,”
1In
for Lemma 3. C . (p) 1s f ormed from C . (b) in the manner inthe lemma. The terminal string *C . (p) is written with some it is a reduction of *C . (b). Levels m’ and m” are siblings; levels m’ to the eligible structures of m and k.
Notation. P(b) denotesthe phrase-marker obtained by applying the transformational component C to the basephrase-markerb at k levelsin an appropriate succession (see Subsection 2.1). If n is the number of levels in b, then C . (b) = C”(b) is the surface phrase-marker; also, Co(b) = b. Notation. sub(b, K) denotes the sub-phrasemarker of b which includes level k and all the levels it dominates.
*b is the terminal string of b.
Notation. DEFINITION.
q is a reduction
of Yif q, Y, x1 , x2 , xa , x4 , and xg are strings such that 4 =
“%X3%
and r = x1x2x3x4x5 (seeFig. 7). LEMMA
3.
For any B E B and C E C, let b E B be such that the following
are true:
154
HAMBURGER
AND
WEXLER
(i) Level m of b dominates level k, and these two levels have identical eligible structures with respect to C. In these identical eligible structures let level m’ and level k’ be corresponding lower levels. Let p be the base phrase-marker formed from b by replacing sub(b, m’) with sub(b, k’). Then (ii) implies (iii) and (iv). (ii) No raising takes place at levels k and m. (iii) C * (p) is formed from C . (b) by replacing
sub(C . (b), m’) with sub(C * (b), k’).
(iv) *C . (p) is a reduction of *C . (b) (see Fig. 7). Proof. b and p are identical at and below level k’ and so must be transformed identically up to and including that level. We now show that the level of k’ is treated identically in p as in b when k’ is a lower level. The level right above k’ in p has the sameeligible structure asthat in b, so the sametransformation (if any) applies. Thus the level of k’ can only end up different in b and p by virtue of nodes moved (nonexplicitly) into level k’ from other lower levels (siblings) or from the upper level. But movement from siblings is ruled out by Restriction 2, while the respective upper levels, having not previously been transformed, have only eligible nodes and hence are identical (levels m and k are identical in basestructure by assumption). b and p are identical at all other levels below m. Different results at level m in b and in p could only arise from nodes raised nonexplicitly from levels m’ and k’, respectively. But there is no raising, by premise.Therefore, (iii) holds and (iv) follows immediately from (iii). cl
When (i), (iii), and (iv) of L emma3 hold, the levels removed from b to form p can be cleanly removed. DEFINITION.
DEFINITION. Two transformational components disagree with respect to a base phrase-markerif they assignif different terminal strings.
We are now ready to show, for a restricted class of components, that “disagreements” (as just defined) can be discovered without looking at sentencesof high degree (complexity). In the theorem, members of A(B) play the role of the possible adult components, while C’ includes all possibleguessesany learner could ever make. THEOREM 1. Let C’ be that subclass of C for which there is no raising. Then VB E B, 3U = U(B), VC EC’, VA E A(B) n C’, if A and C disagree on any b E B, then they disagree on some b’ E B of degree at most U.
Proof. Let E = 1elig(B)[ and F = cardinality of the largest set of siblings of B. We show that the theorem holds with U = 10(E2F)2. Supposethat b E B and that the degree of b exceeds U. Let C, = A and C, = C. By Lemma 2 with K = 5 and H = 2, b hasa path, Q, containing ten levels, call them ql’, q1, q2’, q2 ,..., q5 in ascending
LEARNING
TRANSFORMATIONAL
155
GRAMMAR
tree order, of which all pi have the same eligible structure with respect to Cj (j = 1 or 2) and the qi’ are corresponding lower levels of the pi . (Note that the five chunks are not necessarilycontiguous; that is, although for 1 < i < 5 qi directly dominates qi, it is not necessarilythe casethat q;+l directly dominatesqi). The conditions of Lemma 3 are now met by replacing levels m, m’, K and K’ of the lemma with levels qi+l , q;,, , pi , and qi’ (i = 1, 2, 3, 4) here. Therefore, applying Lemma 3 and using the definition following it, levels q:+l down to qi’ (i = 1, 2, 3, 4) can be cleanly removed with respect to Cj (j = 1,2). Lemma Al (Appendix) therefore applies, so that a lower degree base phrase-marker must also yield disagreement between A (=C,) and C (=C,). The argument can be repeated to obtain successively lower-degree basephrase-markersuntil one of degreelessthan U is obtained. 0 The following lemmaswill enable us to generalizethe above results. LEMMA 4. For any C EC and B E B, suppose that b E B has a path containing levels k, k + l,..., k + Q, each immediately dominating its predecessor and levels m, m + l,..., m + Q, each immediately dominating its predecessor, with level m dominating level k + Q, such that the following are true.
(1) The eligible structures are the same at levels k + q and m + q, for q = 0, l,..., Q. (2) With respect to the eltgible structures in (l), the path includes corresponding levels.
lower
(3) At levels k and m there is no raising. Let p be the base phrase-marker obtained from b by replacing sub(b, k + Q - 1). Then the following are true.
(i) C . (p) is formed from sub(C * (b), k + Q - 1). (ii) *C . (p) is a reduction
C . (b)
by replacing
sub(b, m + Q -
1) with
sub(C . (b), m + Q - 1) with
of *C . (b).
Proof. Since p is formed by removing part of b, the remaining parts of p are identical, before application of C, to the correspondingparts of 6. Thus the only place where C might act differently is at the “joining point” of p. (Thus far the argument parallels Lemma 3.) That is, a height-two transformation applied at level m + Q might raiseelementsof level m + Q - 1 or level k + Q - 1 into level m, in processing b or p, respectively. It must be shown that such raised nodesdo not dominate different nodes. Since these levels (m + Q - 1 and k + Q - 1) are identical in b by virtue of (I), they must remain identical under transformation unless nodes are raised into them from levels m + Q - 2 and k + Q - 2, to which in turn the sameargument applies. Thus, ultimately it is only necessaryto have no reaising at levels m and k as in (3) of the premises. El
156
HAMBURGER
WEXLER
5. Suppose that (i) of Lemma 3 lzolds but (ii) is replaced by the following.
LEMMA
(ii’)
AND
Some node R is raised at k and at every level from k up to m, inclusive.
Then, again, (iii) and (iv) of Lemma 3 hold. Proof. It is again only necessary to deal with raising at level m where the top and bottom parts of b are joined to form p. Whereas in Lemma 3 there was no raising at this point, here we do allow raising. However, here the premises imply that the only node raised at level m be exactly the same one raised at level K, still dominating the same structure since all nodes under a raised node are impervious to transformations, by Restriction (4). Therefore level m in C . (p) will in fact be the same, even with respect to ineligible nodes, as level m in C . (b). 0 Both Lemmas 4 and 5 state conditions for the clean removal of a part of a phrasemarker. This statement can be verified directly from the definition (of “clean removal”, following Lemma 3) for Lemma 4, while for Lemma 5 a change of variable is all that is needed. Let X be a succession of levels, each directly above its predecessor, in some phrase-marker. Suppose that when transformational component C is applied there is a node R which is raised, successively, at every level in X. Then C raises R through X. DEFINITION.
THEOREM 2. Let C’ be that subclass of C for which at most one node can be raised at any level. Then VB E B, 3 U, VA E A(B) n C’, VC E C’, if A and C disagree on any b E B then they disagree on some b’ E B of degree at most U.
Proof. Lemma 2 states that, to guarantee the existence of any particular number of identical chunks of any particular size, it is only necessary to have a phrase-marker of sufficiently high degree. Thus by applying Lemma 2 three times, it follows that there exists some U such that for any phrase-marker of degree greater than U there must be a “nesting” of identical chunks, along a path of that phrase-marker, consisting of the following: 5 identical chunks, denoted chunk(l), chunk(2),..., chunk(5), each large enough to necessarily contain two smaller identical chunks, denoted chunk(l) l), each large enough, in turn, to necessarily chunk(1, 2), chunk(2, 1) ,..., chunk(5,2), contain two identical chunks of height-two, these last denoted chunk(1, 1, l),..., chunk(5,2,2). Note that for any i, j, and k, chunk(i, j, K) is part of chunk(i, j), which in turn is part of chunk(i), provided of course that i, j, and k are chosen so that the chunks are defined. Numbering of chunks is in order of dominance of the corresponding levels in base phrase-marker. That is, if 1 < i’ < i < 5 and j = 1 or 2, then chunk(i) chunk(i, chunk(i,
dominates chunk(?), 2) dominates chunk(i, l), j, 2) dominates chunk(i, j, 1).
LEARNING
TRANSFORMATIONAL
GRAMMAR
157
Also lying along the same path are the following chunks, defined for the same values of i and j as above: X(i, j), extending from the lower level of chunk(i, j, 2) to the upper level of chunk(i,j, l), inclusive; W(i), including all levels between (but excluding) those of X(& 2) and X(i, 1); Y(i), mc . 1u d’m g exactly the levels included by X(i, 2) and W(i); Z(i), extending from the lower level of chunk(i + 1,2,2) to the upper level of chunk(i, 2, 2), inclusive. In Fig. 8 are diagrammed the levels of the various chunks, including some of the 10 X’s, 5 IV’s, 5 Y’s, and 4 2’s. All X’s are identical to each other, as are the w’s and the Y’s, but the Z’s in general need not be. Therefore we may speak of the occurrence of raising in X without reference to a specific X(i, j). Similarly, for W and Y.
Z
FIG. 8. Configuration of identical chunks for Theorem 2. The thick center line represents the path of the phrase-marker. Other lines show where the levels of various chunks are. The W, Y, and 2 in the figure are W(5), Y(5), and Z(5); the X’s are X(5, 2) and X(5, 1).
Under appropriate conditions (to be described), Lemmas 4 and 5 will allow the clean removal of X, Y, or 2 (without affecting the remaining parts of the surface string). Consider transformation by C (similar comments will apply for A). If, with respect to C, there is no raising out of the highest level of X, then the Z’s can be removed. (In Lemma 3, replace levels m and m’, by the two levels of chunk(i, 1, 1) and replace levels K and k’ by the two levels of chunk(i + 1, 1, l), for i = 1, 2, 3, 4. Note that the explicit mention of “corresponding levels” in Theorem 1 is not needed here because of the definition of chunk.) Suppose, on the other hand, that some node R is raised by C out of the highest level of X. Then there are several cases depending on whether C raises R through X and/or Y. By Lemma 5, if C raises R through X(Y), then X(Y) can be cleanly removed. By
158
HAMBURGER
AND
WEUER
Lemma 4, if C does not raise R through X, then Y or 2 can be cleanly removed; also, if C doesnot raise R through Y, then Z can be cleanly removed. (These three setsof applications of Lemma4 are all accomplishedsimilarly. As an example, if C fails to raiseR through X, then by definition there is no raising at some level of chunk(i, 1) and at the samerelative level of chunk(i + 1, l), for i - 1, 2, 3,4. These two levels are substituted for levels K and m, respectively, of Lemma 4. The upper levelsof chunk(i, 2,2) and of chunk(i+ 1,2,2) are substituted for levels Iz + Q and m + Q of the lemma. Substitution of intervening levels is uniquely determined by dominance relations). These results are summarized in Fig. 9. The sameresultshold if C is replacedby A in the foregoing discussion.C and A each belong to exactly one of the four categories of Fig. 9, and several casesarise depending on which categoriesare involved. The method of completing the proof is identical for all casesexcept for the names of categoriesand cleanly removed parts, so only one casewill be given here. category
C raises R through X
C raises R through W
I
Yes
Yes
xv
2 3 4
Yes “0 “0
no
x2 YZ Y,L
FIG.
9.
Patterns
Yes no
of raising
What is cleanly removable
for Theorem
2.
Supposethat C belongsto category 1 and A belongsto category 3. Then any of the five Y’s can be cleanly removed with respect to either C or A. We now invoke the contrapositive of Lemma A2ls. The left-hand side of (5) in the lemma is replaced by *C . (b), and the right-hand side by *A . (b). The clean removal of Y’s here correspondsto reductions (seedefinition) in (5) of the lemma by removing the Et’s and &‘s on the left side and the ii’s and &‘s on the right. Since *C . (b) # *A . (b), it follows from the contrapositive of Lemma A2 that, correspondingto either (l), (2), (3), or (4), for somechoice of subscripts, *C . (b’) # *A * (b’), for someb’ formed from b by removal of the levels correspondingto one or more of the Y’s. Thus a phrase-marker (namely b’) of degree lower than that of b must, like b, yield disagreementbetween A and C. The argument can be repeatedto obtain successively lower-degree base phrase-markersuntil one of degree lessthan U is obtained. 0 THEOREM 3. QB E B, 3 U, VA E A(B), C E C, ;f A and C disagree on any b E B, then they disagree on some b’ E B of degree at most U.
I* Only four the Y’s, unlike
Y’s need be removable to satisfy Lemma A2. Lemma the Z’s, are not vertically contiguous in Fig. 8.
Al
is not satisfied
because
LEARNING
TRANSFORMATIONAL.
GRAMMAR
159
This extension of Theorem 2 to all of classC requires only a demonstration that explicitly raising more than one but lessthan r nodesdoesnot upset the previous argument, where r is the maximum number of eligible nodes in any eligible subphrase-marker. By Lemma 1, B determines a bound on r. In Theorem 2, the two components A and C could belong to any categories of Fig. 9, and there would in any casebe one of X, Y, 2 cleanly removable for both. Here, instead of checking the two components,we must now check 71component-node pairs, where n < 2r. If all n pairs happen to belong to, say, categories2,3, and 4 of the table, then Z’s are removable, etc. But for 12> 2 there is no guaranteethat any of X, Y, or 2 will be removable for all 7t pairs. However, if n-fold instead of two-fold nesting of chunks is used-replacing 2, Y, X by X,, , Xr ,..., X,---then there will in all casesbe a removable Xi . But n < 2r and by Lemma 2 there is somedegreeU such that for all phrase-markersof degree greater than U there will be n-fold nesting of identical chunks. 0 Proof.
3.
LEARNABILITY
Many tasks, learning and otherwise, can usefully be regarded in terms of input and output: What kind of information is provided to the doer or performer as input, and what sort of answerdoes he have to construct asoutput in order for us to say he has succeeded.These three aspects-input, doer, output-found expressionin Section 1.3 as information sequence,learning procedure, and criterion of convergence. In Subsections3.1 and 3.2 we shall describethese aspects,especiallythe learning procedure, in progressively greater detail; Subsection 3.3 contains the proof of learnability. 3.1. Requirements
and Outline
of Procedure
The input, asnoted in Subsection 1.3, is a sequenceof pairs, each pair of the form (b, s), where b is a basephrase-markerand s is the correspondingsurface string. Thus, for B E B and C E C, we have, in the notation already introduced, b E B and s = *C . (b). Each successivepair is chosen according to a fixed probability distribution, independently of past events. That is, each b EB has a fixed nonzero probability of being the first element in the pair to be presented. What the learner must attain is a transformational component which accounts correctly for all possibledata. That is, he must construct a mapping which assignsto each base phrase-marker the appropriate surface string (sentence) of the language he is learning. It is a celebratedlinguistic observation that there are infinitely many potential datain our terms, infinitely many different (b, s) pairs. Despite this, the probabilistically expected number of input pairs required for achieving a correct guessat the mapping (and never subsequently deviating from it) must be finite.
160
HAMBURGER
AND
WEXLER
Moreover, we seek not just any procedure which will perform the required task. As indicated in Subsection 1.1, we consider it crucial that the procedure be plausible, in particular that changes in the set of guessed rules be gradual and independent of past data. The following broad outline of the procedure incorporates these requirements. At each point in time the current guess consists of a set of transformations. This set, together with the current datum, is acted upon by the procedure to produce two other sets of transformations: a set of candidates for hypothesization and a set of candidates for rejection. From among all the members of these two sets of candidates, one is chosen at random and is hypothesized (added to the guessed set) or rejected (removed from it) according to which set of candidates it is from. We now move to the more detailed matter of how the above-mentioned candidate sets are formed. The base phrase-marker which constitutes the first half of the datum is transformed according to the current transformational component. If the resulting surface phrase-marker has a terminal string identical to the correct surface string given as the second half of the input datum, then no change is made in the current component. This ensures that, when the correct mapping (that is, the correct transformational component or any component which agrees with it on all possible data) is found, no further changes will be made. It will then be necessary only to show that the expected value of the time at which the correct mapping is found is finite. If, on the other hand, the current component assigns to the input base phrasemarker a wrong surface string or if the component is nondeterminate in its treatment of that phrase-marker, then the operation of the component up to the point of error discovery is examined. Error in this connection is said to be discovered either at the lowest level in the phrase-marker where nondeterminacy is encountered or else, in the case of a wrong derived string, at the top level of the phrase-marker. Any transformation which has been applied is a candidate for rejection. For each level where no transformation was applied, create candidate hypotheses by using the hypothesizer H(p, t), where p is the sub-phrase-markerdominated by that level, t is a terminal string, and H(p, t) consists of all possible transformations which transform p into any phrase-marker with terminal string t. The terminal string t to be used is the correct surfacestring, S,given aspart of input, whenever p is the entire surface phrasemarker (that is, when working at the top level). Otherwise, a “string-preserving” transformation is hypothesized; that is, the terminal string of p is used as t. 3.2. Details of the Learning
Procedure
The ideasof the preceding subsectionwill now be stated more precisely. app(b, K, C) is the subsetof transformations of C applicable at level K in Notation. the application of C to b. To be applicableat level K, a member of C must fit the ehgible structure of the phrase-
LEARNING
TRANSFORMATIONAL
GRAMMAR
161
marker C’“-l(b), which results from applications of C prior to level li. Moreover, the notation C’-l(b) itself will be understood to refer not simply to a phrase-marker, but to a phrase-marker in which each node is marked as eligible or not with respect to the next level to be transformed, that is, level k. Some of the following notation has already been introduced, but we shall now no longer define each symbol each time it is used. Thus, when b is used, we shall no longer bother to state that B E B and b E B. (Also seeend of Subsection2.2 for B, C, A(B), and elig(B).)
Notation.
B: the basegrammar (B E B)
b: a basephrase-markerwhich is part of the input datum (b E B). n: the number of levels of b. C: the currently guessedtransformational component (C E C) C . (b): If C can be applied deterministically to b, then C . (b) is the resulting surface phrase-marker. Otherwise, C . (b) is undefined. C(b): the result of carrying out the computation of C . (b) for only the first i levels; thus, C”(b) = C . (b). p: p = C(b), for somei. p might be called a partially derived phrasemarker. (In Subsection2.3, p was used differently.) *p: the terminal string of a phrase-markerp A: the correct adult component. A E A(B); sameway as C * (b).
also, A * (b) defined in
s: the surface string from b; that is, s = *A . (b). d: the input datum; that is, d = (6, s). i*:
the smallesti such that the attempt to compute @(b) yields an error (nondeterminacy or incorrect surface string). If *C * (6) = *A . (6), then i* undefined.
DEFINITION.
(Candidatesfor rejection, rej, and hypothesization, hyp)
6) WP, t) = v
I *T(p)
= q,
where t is any string.
(ii) S&)
= WP, *PI.
for 1
Cl = u b(i), iSZ
162
HAMBURGER
AND
WEXLER
where 2 = (i 1 app(b, i, C) = ,0 (vi) rej(d, C) =
u
t
i < i*}.
and
app(b, i, C). Transformaflons H(p. AB AE _s
ob
A
i 0
FIG.
A C
D
I 0
I b
10.
6Ab I b
T:
An example
b
(0.1211 ll21.01 (~l.bl.bl
AB
-
AB A8
-~(Il,lbll -(0,lI,bll
AB
-----,((a,bl.01
AB A8
l(o).(b)1 -(0,.(..bll
ACD ACD ACD ACD
-(~1,31.0.01 -l(11.(31.01 -t ll11.0.(311 -----t(0.(1.3).01 . :
AX
-I0.~211 . :
I o
b
structural
description:
structural
change:
s).
of a nontrivial
o
DACB ((3).
(I),
(21,
Grommor
-ED
D B-b
Example
H(p,
o
s
11.
-
of hypothesizer
Bose
FIG.
,n ~1
CE
(4)) Includes
E -
DA
C A-o
BA
string-preserving
transformation.
See Fig. 10 and 11 for examples. This definition creates the candidate sets, in (v) and (vi), of transformations which can be hypothesized or rejected by the learning
LEARNING
TRANSFORMATIONAL
GRAMMAR
163
procedure to be defined shortly. The set hyp(d, C) consists of contributions from the various levels as set out in (iii) and (iv). The fundamental building block is the “hypothesizer” in (i) which defines the set of transformations H(p, S) that act upon phrase-marker p to produce a phrase-marker with terminal string S. The “Stringpreserving” hypothesizer in (ii) designates transformations which alter a phrasemarker without altering its terminal string. A glance at Fig. 10 reveals that the effects of deletion may be offset, for example, by copying another occurrence of that same letter. One might think that deletion should be somehow restricted to avoid such effects. But note that it is a strength, not a weakness, of our result that it holds despite the latitude of deletions allowed. Any principled restriction on deletions that linguists may make will leave our proofs in tact. DEFINITION. LP is a learning procedure whichls (i) guesses C = @ initially and, given a datum d = (6, s), when its preceding guess was C, (ii) guesses C unchanged if *C . (b) = s and otherwise guesses with equal probability among alterations of the form, (a) reject (remove from C) a member of rej(d, C) or (b) hypothesize (add to C) a member of hyp(d, C).
L-4 START
Set
Take
t c
new
= 0
d = (b,sl
Pick some T c hyp (d,C) Set C = CU{Tj
I
FIG.
12.
Flow
chart
of the learning
Pick some T c wt(d.C). Set C q C -(Tl
procedure.
The contribution to hyp(d, C) f or i = n, consists of any transformation which, applied at the top level, creates a surface phrase-marker with the correct terminal I9 This definition may appear not to cover the case in which s # *C . (b) and both rej(d, C) and hyp(d, C) are empty. However, Theorem 4, below, shows that such a situation is impqssible.
164
HAMBURGER
AND
WEXLER
string, s. That is, the input base phrase-marker is transformed according to the currently guessed component C up to the top level, at which point there may or may not be a member of C which is applicable. If there is none, then a check is made to see if the phrase-marker, as transformed thus far, has the appropriate terminal string as provided by the second part of the current datum. If it does have the proper string, then the current component need not be altered; otherwise, any transformation which will remedy the situation becomes a candidate for addition to the guessed set.20 The contribution to hyp(d, C) for levels other than the top, on the other hand, has a more intricate rationale. It is not particularly obvious whether any hypotheses should be formed at cycles other than the last cycle. Such hypothesization could not profitably be based on s, since s is not the correct terminal string at any cycle except the last. Moreover, there is no obvious easily computed candidate to fill the role played by s for the last cycle. In fact it might be argued that no hypothesization need be made at cycles other than the last since, for any i < n, the ith cycle is in fact the last cycle for the base phrase-marker b’ = sub(b, i), which occurs as part of a lower-degree datum, d’ = (b’, s’). However, if a transformation which is string-preserving with respect to the top level of b’ is what is needed, then d’ may appear to be correctly handled by C. That is, it is possiblethat C . (b’) = s’. We are thus led simultaneously to the conclusionsthat hypothesization may be necessaryat cycles other than the last and that such hypothesization ought to be string-preserving. The set rej(d, C) simply consistsof those members of the current component C which are employed in applying C to the basephrase-markerb. That is, if C acts on b to produce a wrong terminal string, then any member of C which was involved is a candidate for rejection. It might be argued that only the transformation (if any) used in the last cycle need be a candidate for rejection, since each earlier cycle i is identical to the last cycle for the base phrase-marker sub(b,i) and hence could be rejected when the latter phrase-marker appears.Such a suggestion,however, ignores the possibility that a transformation may give, for sub(b, i), the correct surface string and yet result in incorrect structure which is not detectable except when, for b itself, a transformation at a subsequentlevel usesthat faulty structure to give an incorrect surface string. Implicit in the above is that C could be applied in a determinate manner. (See Subsection 2.2). However, nondeterminacy may arise either because T E C fits an eligible structure in two or more ways or becauseTl , T, ,..., T, E C are applicable simultaneously. In either case, LP may reject transformations applicable at the level of the nondeterminacy or at earlier levels. Hypothesization is restricted to levels earlier than the nondeterminacy. The rationale here is that nondeterminacy need not 20 Even if such a transformation is added, the resulting guessed set may not handle even the current input correctly, since the new transformation may be applicable at a lower level. Convergence occurs despite this.
LEARNING
TRANSFORMATIONAL
be the “fault” of the transformation(s) improper processing at earlier levels.
GRAMMAR
165
directly involved, but may instead arise from
3.3. Proof of Leavnabilitv
The following theorem (Theorem 4) showshow the various aspectsof the learning procedure (LP) combine to ensure that, if the currently guessedcomponent is not acceptable,then there is at least the possibility that the next datum will induce LP to make the next guesscloser to correct, by either adding a correct transformation or rejecting a wrong one. That this possibility is not of too small probability will follows from Theorem 5, in conjunction with Theorems 3 and 4. Finally, Theorem 6, the principal result of this paper, shows that the procedure is successfulaccording to our criterion. 4. If *C . (b) 5 s or C is nondeterminate
THEOREM
following
on b, then at least one of the
holds:
(i) rej(d, C) n (C - A) f
a;
(ii) hyp(d, C) n (A - C) # o ; (iii) for some d’ = (b’, s‘), with b’ of lower degree than b, (ii) holds with d’ in place of d. Proof. (iv) Co(b) = AO(b), since by definition each is simply 6. (v) C”(b)
# A”(b)
smce *C”(b) .
= *C . (b) # s = *A . (6) = *A”(b).
It is therefore meaningful to define (vi) i = r&JCj(b)
# Al(b) or G(b) nondeterminate].
C and A thus act upon the phrase-markerdifferently at level i, so level i must involve the application of a transformation for at least one, possibly both, of C, A. CaseI: T E app(b, i, C) and app(b, i, A) = @ If T were in A, it would be applicable at cycle i. Thus (i) holds. T1 E app(b, i, C) and T, E app(b, i, A) T, # T, by (vi). Tl 6 A for otherwise Ai undefined with both Tl and T2 in app(b, i, A). Thus (i) holds. In casesI and II, the results hold for any element of app(b, i, C). Case IIz
CaseIII:
app(b, i, C) = ,@ and T E app(b, i, A)
Subcase a: *Ai i = n, then *C”(b)
= *Ai-l(b); that is, A is string-preserving on b at level i. If = *F-l(b) = *An-l(b) = *A”(b) = s (using CaseIII premise,
166
HAMBURGER
AND
W!XXLER
(vi), Subcase IIIa premise, and definition of s), contrary to the premise of theorem. Therefore i < n, so by Subcase IIIa premise and definition of hyp(d, C), T E H,,(P1(b))
C hyp(d, C).
Thus (ii) holds. Subcase b: *Ai f *Ai-l(b) and i = n. T E app(b, n, A) implies T(A’+l(b)) = A”(b). If i = 71in (vi), then P-l(b) = An-l(b). Therefore *T(C+l(b)) = *A”(b) = s or, equivalently, by definition of by(n), T E by(n) = hyp(d, C). Thus (ii) holds.
Subcasec: *Ai # *AC-l(b) and i # 1~. Let datum d’ = (b’, s’), where b’ = sub(b, i) and s’ = *Ai( Since i is the top level of b’, conditions equivalent to Subcaseb are met for the datum d’. Thus (iii) holds. 0 Notation: Cardinality, etc. If W is a set or a sequence, 1W 1 is the number of elements it has. If s is a string, / s 1is its length. If p is a phrase-marker or eligible structure, 1p 1is the number of nodesit has. For easeof expression,we shall say that the mth datum in the information sequence is presentedto the learner at time m. LEMMA
6.
Bound on size of guessed set. At all times, 1C 1 < I elig(B)I.
Proof. A transformation can be addedto C only when it is applicableto somelevel in a phrase-markerwhere no transformation currently in C applies.Since applicability dependsonly upon eligible structure, there is always a mapping from C into elig(B), taking each member of C to the member of elig(B) which was involved in its hypothesization. The result follows. 0
Note that two (or more) membersof C may be applicable to the samemember of elig(B), but they cannot both have been hypothesized from it. A related fact, not used here, is that two membersof C cannot have the samestructural description. If T is a transformation with structural description Q and structural change R, we write T = (Q, R). R is a sequenceof sequences.Thus in Fig. 2, R, = (5, 7) and R5, = 7. Taking the elementsof elements of R in order yields a single long sequence which we denote R. For example, in Fig. 2, R would be 6, 2, 5, 7, 5, 3, m, 1. Thus R is a partition of R into subsequences,somepossibly empty. The number of ways an m-elementsequencecan be so partitioned into n subsequences is an increasingfunction of m and n (seeFeller, Chap. 2). The following two lemmas place a bound on the number of different possible hypothesized transformations per level, despite the unbounded size of the phrasemarker potentially involved.
LEARNING
TRANSFORMATIONAL
GRAMMAR
167
LEMMA 7. Bound on 1H(p, s)l. Let elig(p) be the eligible structure ofp, wherep = G(b). Vx, Vy, 32, ;f 1elig(p)l < x and [ s 1 & y, then I H(P, s)l G x. Proof. Suppose T E H(p, s), so that *T(p) = s (see part (i) of the definition of candidate sets). Let T = (Q, R), w h ere R is a partition of R. The proof consists of showing that Q and R are each of finite length and that each has elements chosen from a finite set, so that there are finitely many choice for Q, for R, and hence for R and thus finally for (Q, R). Nonvariables in Q are at most x in number and are chosen from a set of cardinality at most X. Variables in Q are required to be nonconsecutive (restriction (iii)on variables, Subsection 2.2), hence at most x + 1 in number and chosen fromzl among Xl),..., X(x+1)*
Each element of R is chosen from the finite set consisting of the integers from 1 to 1Q / and the elements of s. For any integer i in R, if Qi is not a variable, then Qi fits with a node of p dominating at least one terminal node of p. Therefore each occurrence of i corresponds to a terminal symbol of T(p), of which there must be 1s 1 < y. Since variables cannot be copied, R can have at most x + 1 elements corresponding to empty substrings of s. Therefore 1R I < y + x + 1, as required. 0 The next lemma, like the last, helps bound the number of alternatives specified by LP. (The argument culminates in Theorem 5.) The proofs of the two lemmas are similar except that here there will be no bound on the length of the string, *p = *G(b), used in hypothesization because of possible phenomena like that of Fig. 6. Therefore a slightly more intricate argument is needed. LEMMA 8. Bound on I H,,(p)/. Let del(p) be the set of deletable22terminal symbolsof p, wherep = F(b). Let 1elig(p)l be as in Lemma7.
then Proof. Suppose T E H,,(p), so that *T(p) = *p. (See parts (i) and (ii) of the definition of candidate sets,hyp and rej). Let T = (Q, R), where R is a partition of R. We show bounds for 1Q 1and I R I. The rest of the proof parallelsthat of Lemma 7. 21 For the use of superscript indices, see Ginsburg and Partee (1969). 22 Terminal symbols (and inserted morphemes) which have been added marker by copying or insertion may not be deleted subsequently. 480/12/2-3
to a base phrase-
168
HAMBURGER
AND
WEXLER
As in Lemma 7, / Q 1 < 2x + 1. Let u be the number of deletions caused by T; that is, for u of the Qi there is no i in R. When T is applied to p, nodesassignedto these deletedQi must dominate no terminals outside del(p). Thus I R I < I Q I - u + I Wp)l THEOREM
5.
Bound
on alternatives vx, 3x, vy,
< 2x
+ 1+
speci$ed by LP. if
degree(b) < x,
then I Md, C)l + I rej(d, C)l d z, where d = (b, s) and C are respectively datum and the current component, at time y. Proof.
q
Y-
the current
Let B, be the finite set of basephrase-markersof degreeat most x.
Let U = y~;x 1 *b 1 x
and let V = y~zx I *A . (b)l. x
Further, in view of Lemma 1, let W be the greatestnumber of eligible nodesof any member of elig(B). Then
I s I d K I elk(
< W
and I del(p)l G u,
where p = F(b)
for any appropriate i and where elig(p) and del(p) are asdefined in Lemmas7 and 8. Lemmas 7 and 8 can now be invoked. From these and the definition of hyp(d, C) it follows that hyp(d, C) is a bounded union of bounded sets, hence bounded. cl DEFINITION. C is moderately equivalent to A, with respect to B iff Vb E B, *C . (b) = *A . (b). This situation is denoted C ‘u A. The word “moderate” is employed here with an eye to comparisonwith the terms “weakly equivalent” and “strongly equivalent” which are used in studies of formal grammar (seeChomsky, 1963). Moderate equivalence differs from strong equivalence in not requiring C . (b) = A . (b)-that is, not requiring identical surface structures. On the other hand, it goes beyond weak equivalence in requiring equality of the mappings defined by A and C with respect to B, not just equality of the respective surface languages(setsof sentences). The idea of moderate equivalence is quite important here. On one hand, weak equivalence is inappropriate since successaccording to such a criterion does not
LEARNING
TRANSFORMATIONAL
GRAMMAR
169
imply knowledge of the relationship between meaning and utterance. On the other hand, one could not aspire to any stronger form of equivalence between C and A, since if C is moderately equivalent to A then the two are indistinguishable with respect to all possible data. In other words, any sequence of (6 S) pairs generated by base B and component A could, if C is moderately equivalent to A, have been generated using C in place of A. Therefore, once LP guessessuch a C, it will never alter its guess. We next prove “identifiability in the limit with probability one.” The proof consists of showing that there exists an integer t* and a probability p > 0 such that at any time t either moderate equivalence (of C and A with respectto B) has been achieved or else the probability of achieving it at or before t + t* is at leastp. It follows that at any t the probability of not achieving moderate equivalence by t + Kt” is lessthan (1 - p)“, a quantity which can be made arbitrarily small by choosing K sufficiently large. (The basegrammar B is chosenfrom B and is held fixed throughout Theorem 6). For any t, let C, be the current component at time t. THEOREM
6.
VIE> 0, It’, Prob(C,, ‘u A) 3 1 - E. Proof. For any t and for any C EC, supposethat C is the current component at time t and that C e A; that is, for this C and for someb E B, *C . (6) # *A * (b).
Then by Theorem 3, there exists b’ EB with degree(b’) < U(B) such that *C - (b’) # *A * (b’).
Further, by Theorem 4 there exists b” EB such that deg(b”) < deg(b’) < U(B) and either (i) rej(d”, C) n (C - A) # o or (ii) hyp(d”, C) n (A - C) # o, where d” = (b”, s”) and S” = *A . (b”). There are finitely many (b, S) pairs with deg(b) < U(B); let PI be the smallest probability assignedto any of them so that Prob(d”) > PI . In the event that d” is the next datum, there is by Theorem 5 a bound, z, on the number of possible alterations, and therefore by equiprobable selection (definition of LP) each possiblealteration has conditional probability at least Pz = l/z of occurring, given the appearanceof d”. Thus, in view of (i), (ii) above, there is probability at least Pz that occurrence of d” results in either
170
HAMBURGER
AND
WFXLER
(iii) rejecting an element of C - A or (iv) hypothesizing
an element of A - C.
Thus with probability at least PI the next datum is some d” such that the probability that d” results in (iii) or (iv) is at least Pz . The cardinalities of the sets mentioned in (iii) and (iv) are bounded independently of time since at any time
I C - A I < I C I G I ekdB)l
(Lemma 6)
and
Therefore, there exists a sequence of k ordered pairs of data dti and resulting alterations > (4 ((4, , &), (4, , -GJ,-., (4, , -GJ), such that WW hold: (vi) k < M (where M is defined as I elig(B)/ + 1 A I),
zti
(vii) Prob(d,J
3 PI ,
for i = l,..., k, where the (viii) Prob(Zti j C, ZiIZti. ,..., Zt(:-r) , d&i) 2 Pz left-hand side is understood to be the probability that alteration Zii occurs at time t + i 1, given that the current component at t was C and that Zi, ,..., Zt+r) were the next i - 1 alterations and that dti was presented at t + i - 1. (ix) For some A’ !z A, the alterations Z,, ,.. ., Z,, comprise the rejection of all members of C - A’ and the hypothesization of all members of A’ - C, so that, if these alterations occur, the resulting current component would be A’. (Alterations of type (iii) and (iv) would make this true for A’ = A unless some other A’ N A were to become the current component sooner.) From (vi)-(viii), the probability is at least p = (P,P,)M that (v) will occur, causing (according to (ix)) the current component at 23 time t + M to be some A’ crl A. This last statement has been shown true for any time t and whatever the current component at that time may be. Therefore, consider the probability of a succession of K sequences such as (v), where K is an integer greater than logc/log(l - p). The probability of reaching some component A’ 1~ A within time t’ = KM is at least 1 - (1 - p)” > 1 - (1 - p)log4lowP) Therefore t’ = KM satisfies the requirement
for
t’
= 1 - c.
in the statement of the theorem 0
23 Actually, p is a lower bound on this probability for some time no later than t + since A’, once achieved, will never be left, the statement in the text is also correct.
M. However,
LEARNING
TRANSFORMATIONAL
GRAMMAR
171
A stronger result can be obtained. We have seen in the proof that the probability is always at least p that a correct guess will be attained within M steps. It follows that the expected time to attain a correct guesscannot exceed
which converges for 0
RESULTS
In the first three subsectionswe informally extend the theory in three important directions. Then we turn briefly to a potential objection to the complexity of the proofs. In the final subsectionwe permit ourselvesa word on ecumensism. 4.1. Several Transformations
at a Level
The requirement that only one transformation be used at each level runs counter to much linguistic evidence. This requirement is dropped in Hamburger (1971, Chap. 6). A simple one-level basephrase-marker may undergo one transformation to make it passiveand then another to make it interogative. Roughly speaking,the corresponding sentencescould be the following. (i) John hit the ball (base). (ii) The ball w as hit by John (after application of passive transformation). (iii) Was the ball hit by John? ( aft er application of interrogative transformation to (ii)). Not only are two transformations used, but they are used in a particular order. What the passive transformation does, among other things, is find a noun phrase (here “John”) at the beginning of a sentenceand move it to the end. The interrogative transformation moves the auxiliary (here “was”) to the beginning. These two operations could not be carried out in the reverse order since if the auxiliary were fronted first, there would no longer be an initiaE noun phraseto move. In a caselike this24the learner doesnot have to learn the order in which transformations apply, becausehe cannot apply them out of order. Assuming that this is the typical case,it is possibleto prove learnability while essentiallyretaining the advantages claimed for the approach used above (Sections 2 and 3) for the simpler caseof one transformation per level. The proof involves a generalization of the learning procedure 24 This
is called
intrinsic
ordering
of transformations.
172
HAMBURGER
AND
WEXLER
to allow hypothesization of a transformation which, together with other transformations (which may or may not already be in the guessed component), can accomplish what in Subsection 3.2 a hypothesized transformation had to do alone-that is, yield the correct surface string at the top level or else be string-preserving at some other level. Here again the procedure seems plausible because one can see how correct transformations might be guessed, but clearly incorrect ones abound and so the proof of convergence is quite important. This rule-hypothesization view of language learning throws an interesting light on an important empirical observation by Bellugi-Klima (1968). She found that a child may be able to use each of two particular transformations correctly separately and yet not be able to use them together. Slobin (1971) suggests that “apparently there is some performance limitation, some restriction on ‘sentence programming span,’ . ..at this stage of development.” From our viewpoint a tremendous efficiency advantage would exist for a learner who assumed, at first, that only one transformation was allowed per level, because the number of combinations of two transformations is vastly larger than the number of individual transformations. Thus we suggest the temporary limitation might be not on performance but on the class of “possible competences.” 4.2. Filtering Filtering is a capability of a transformational grammar whereby certain base phrasemarkers correspond to no surface string. In this subsection we provide an example, a comment on the formal theoretical significance of the notion, and finally a brief argument that filtering can be admitted into our system without undermining the proofs of learnability.
the
FIG. 13. any datum.
A base phrase-marker
which
box
would
fell
be “filtered
out”
and therefore
not be part
of
Consider the base phrase-marker in Fig. 13. Ignoring the # symbols for the moment, note that, if “box” had instead been “pen,” then a so-called relativization transforma-
LEARNING
TRANSFORMATIONAL
GRAMMAR
173
tion could apply, replacing the lower-level “the pen” by “that” to give “the pen that fell is old” as surface string. However, the base phrase-marker as it stands in the figure does not meet the requirement that the lower level and upper level noun phrases be identical, and so the relativization transformation is not applicable. The resulting surface string, we would have to admit at this point, would have to be “the pen the box fell is old.” Such a string would be meaningless and, to ensure that our system does not count it a valid sentence, we can “filter it out.” This is accomplished by having the base phrasemarker put in25 the # symbols, called boundary markers, at each level. The relativization transformation removes the #‘s, which are “forbidden” in surface strings. Since the relativization transformation does not fit (meet the requirements for application to) the phrase-marker in Fig. 13, the #‘s are not removed and the would-be surface string contains forbidden symbols which identify it as a nonsentence. Filtering is an important theoretical issue because Peters and Ritchie (1973) have shown that it makes certain transformational grammars2’j weakly equivalent27 to Turing machines. That means there is no computable set of sentences they could not be used to represent. Thus, to say that natural languages, as sets of sentences, are representable by such transformational grammars is to say nothing at all. Sampson (1973) takes the more sanguine position that transformational grammar can still be accepted provided that it enables linguists to represent natural languages relatively simply and at the same time specifically does not make possible simple representations of clearly non-natural-language sets of strings. An example of the latter is the set of all strings (over some vocabulary) with length equal to any prime number. Without settling this argument, we turn to the significance of Peters and Ritchie’s result for learnability. Gold’s (1967) sequence of sentences he calls “text” data, as opposed to “informant” data, which consist of all possible strings over a vocabulary, with sentential and nonsentential string distinctively labeled. He shows that even 2z This is easily accomplished by the following sequence of changes in the rules of the base grammar: Change every S to S’, insert # before and after S’ on the right-hand side of every rule in which it appears, and include the rule S + # S’ #. Altering the base grammar in this way enables Ginsburg and Partee (1969), to treat level-by-level transforming without talking about levels as such. In other words they thereby formalize that process. Note that their grammars are not height-two. 26 The transformational grammar used by Peters and Ritchie uses a finite state grammar as a base. Such grammars comprise a subset of the context-free grammars. Their transformations are not subject to the restrictions introduced in Subsection 2.2. We do not know whether a result like theirs could be obtained with our form of grammar. 2’ Forms of equivalence are discussed in Subsection 3.3. Weak equivalence does not provide definitive conclusions about natural language but has proved mathematically tractable and has at least provoked linguistic insight, Strong equivalence has been presumed intractable and has gone virtually unstudied. We hope to have herewith raised some interest in moderate equivalence.
174
HAMBURGER
AND
WEXLER
with this stronger form of input data, the class of languages generated by Turing machines is not learnable (not identifiable in the limit). Thus the result of Peters and Ritchie and that of Gold, taken together, show that if we are not careful we shall find ourselves with language and learning theories that predict we cannot learn to speak properly. We turn now to the question of whether we can extend the transformational system of Section 2 to allow filtering and still demonstrate learnability. The base is modified, using S’ and # as specified in footnote 25, and only (b, s) pairs for which the string s includes no occurrence of the forbidden symbol # are allowed as data. Theorem 3 showsthat, if A and C, the true and guessedcomponentsrespectively, disagreeon any (b, s) of any degreeat all, then they must alsodisagreefor some(b’, s’) of degreelessthan U, where U is a function only of the basegrammar. However, if filtering is allowed, then s’ may contain #, in which case(b’, s’) would not be a datum (but would be filtered out). We must show that if (6, s) is allowed asa datum then so is (b’, s’). Contrapositively, it suffices to show that if s’ contains # then so does s, and this follows immediately from the fact that s’ is a substring of s. Therefore, the filtering property can be incorporated into our grammar without affecting the claim to learnability. 4.3. Elaborated
Structural
Descriptions
For various reasons,some linguists have used structural descriptions which have a basic part like those in Section 2 but which may have other parts as well. These auxiliary parts must fitz8 with the phrase-markerjust as doesthe basic part, but they are not usedin the structural change.There may be interrelations betweenthe auxiliary parts and the basic part. For example, we might have (A, B, D) as basic part and (C, D) as auxiliary part together with the specification that the node fitting with D be the samein each case.Thus C would, in effect, be required to dominate A and B. Extending the theory in this way necessitatesreworking of the proofs in Section 3, particularly those of Theorem 5 and the lemmas which immediately precede it. Nevertheless, the general result, Theorem 6, can still be maintained, provided the auxiliary parts of the structural description are, like the basic part, required to refer only to the currently eligible nodesof the phrase-marker. 4.4. Insu#kiency
of Degree-One
Data
It might be thought, at first glance, that the learner could deduce a correct transformational component simply from degree-zero and degree-one data. Degree-zero data would seem to provide evidence for any transformation acting within a single level, while degree-onedata might do the samefor transformationsacting on a so-called 28 More generally, Partee, 1969).
there
can be a Boolean
condition
on additional
parts
(see Ginsburg
&
LEARNING
TRANSFORMATIONAL
175
GRAMMAR
upper-level and its lower levels. The height-two restriction, by this reasoning, would make higher-degree data unnecessary. Such an argument, if correct, would replace Theorems 3 and 4, thereby saving a lot of effort. The simple example of Fig. 14 shows that such reasoning is invalid. The only base phrase-markers of degree 0 or 1 for the base B shown are b, , b, , and b, . For none of these can T, apply so that components C, and C, are indistinguishable from each other in terms of all possible data of degree 0 or 1. Nevertheless, C, acts differently upon b, than does C, , specifically, *C, . (b4) = amefcd, asshown, whereas*C, . (b4) = aefcd. Thus, a learner who guessedC, and was exposedonly to data of degree0 or 1 would never have any basis for rejecting that guess. Nevertheless, he would be incorrect. a. s--,
as
s -
feS
S ---+
bl:
cd
S A c
be:
S
b3:
A
d
S
fa\s
“A c b,=C;(b,):
S
2 C2(bql:
A
C,.(b,):
d
S *
A
CI
FIG. 14. shows which
S
A c
d
TI:
feed
-
(2, I, 3,4)
Tz:
oex
-
Ul,m),2,3)
= {TI]
Example showing the insufficiency level has been transformed.
C2
= {TIJz]
of degree-one
data.
The
arrow
next
to an S
The basic problem is that, although a level cannot directly affect the level which is two levels higher than itself (its “grandparent” so to speak), it can affect the level immediately above it which in turn affects the next level up. Such effects need not depend upon raising ascan be seenfrom the examplejust elaborated. 4.5. Languageand Learning The aspectsof grammatical structure which we have incorporated here are not arbitrary but rather have had considerableplay in the linguistic literature of the last fifteen years. The attempt to represent the languagestructure possessed by an adult, without reference to acquisition, has been the source of the notions deletion, essential recursion, height-two, filtering, nondeterminacy, and raising. Ineligibility is a new notion, though worthy of future linguistic study, as we attempt to show in Wexler et al. (manuscript in preparation).
176
HAMBURGER
AND
WEXLER
The important point is that new insight into such linguistic devices is provided by the formal study of the logical possibility of learning the structures which they define. Moreover, such acquisition insight is independent of the structure insight which linguists base on adult utterances, native judgments, linguists’ introspection, etc. The bridge that Chomsky has reerected between linguistics and psychology bears two-way traffic. APPENDIX LEMMA
Al.
If(l)-(4),
then (5), where exf = uwv,
(1)
ecixdif
(2)
= ugiwhiv, = ug,g,whjh,v
for
i
f = ugigjg,wh,hjhp
for
i
ec,cjxdjdif
(3)
ecicic,xd,didi
(4)
e~1wsw&444
(5)
f = ~glg2g3g4wVG2hlv~
where each symbol denotes a string. LEMMA
A2.
If (l)-(4), then (5’), where
(4)-(l) are formed from (5’) by picking one, two, three, or four of 1, 2, 3, 4 and removing from (5’) all double-bar symbols with those subscripts. Proofs are given Appendix B of Hamburger (1971). REFERENCES BELLUGI-KLIMA, U. Linguistic mechanisms underlying child speech. In E. M. Zale (Ed.), Proceedings of the Conference on Language Behavior. New York: Appleton-Century-Crofts, 1968. CHOMSKY, N. Formal properties of grammars. In R. D. Lute, R. R. Bush & E. Galanter (Eds.), Handbook of Mathematical Psychology, Volume II. New York: Wiley, 1963. CHOMSKY, N. Aspects of the Theory of Syntax. Cambridge, MIT Press, 1965. CHOMSKY, N. Linguistics and philosophy. In S. Hook (Ed.), Language and Philosophy. New York: NYU Press, 1969. CHOMSKY, N. Conditions on transformations. In S. R. Anderson & P. Kiparsky (Eds.), A Festschrift for Morris Halle. New York: Holt, Rinehart and Winston, 1913. CULICOVER, P., & WEXLER, K. An application of the freezing principle to the dative in English. Social Sciences Working Papers, 39. Irvine, CA: University of California, 1974a. CULICOVER, P., & WEXLER, K. Three further applications of the freezing principle in English. Social Sciences Working Papers, 48. Irvine, CA: University of California, 1974b.
LEARNING
TRANSFORMATIONAL
GRAMMAR
177
DERWING, B. L., Transformational Grammar as a Theory of Language Acquisition. London: Cambridge University Press, 1973. FELLER, W. An Introduction to Probability Theory and Its Applications. New York: Wiley, 1957. GINSBURG, S., & PARTEE, B. A mathematical model of transformation grammars. Information and Control, 1969, 15, 297. GOLD, E. M. Language identification in the limit. Information and Control, 1967, 10, 447. HAMBURGER, H. On the learning of three classes of transformational components. Doctoral dissertation, University of Michigan, 1971. HAMBURGER, H., & WEXLER, K. Identifiability of a class of transformational grammars. In K. J. J. Hintikka, J. M. E. Moravcsik & P. Suppes (Eds.), Approaches to Natural Language. Dordrecht, Holland: Reidel, 1973. HOPCROFT, J. E., & ULLMAN, J. D. Formal Languages and Their Relation to Automata. Reading, MA: Addison-Wesley, 1969. KLEIN, S., & KUPPIN, M. An interactive, heuristic program for learning transformational grammars. Technical Report No. 97. Madison, WI: Computer Sciences Dept., Univ. of Wisconsin, 1970. MCNEILL, D. The Acquisition of Language. New York: Harper and Row, 1970. MILLER, G. A., & NORMAN, D. A. Research on the use of formal languages in the behavioral sciences. Semiannual technical report, Department of Defense, Advanced Research Projects Agency. Cambridge, MA: Harvard University, Center for Cognitive Studies, June 1964. PARTEE, B. H. On the requirement that transformations preserve meaning. In C. J. Fillmore & D. T. Langendoen (Eds.), Studies in Linguistic Semantics. New York: Holt, Rinehart and Winston, 1971. PETERS, P. S., & RITCHIE, R. W. Nonfiltering and local-filtering transformational grammars. In K. J. J. Hintikka, J. M. E. Moravcsik & P. Suppes (Eds.), Approaches to Natural Language. Dordrecht, Holland: Reidel, 1973. POSTAL, P. M. On Raising: One Rule of English and Its Theoretical Implications. Cambridge, MA: M.I.T. Press, 1972. SAMPSON, G. The irrelevance of transformational omnipotence. Journal of Linguistics, 1973, 9, 299-302. SCHLESINGER, I. M. Production of utterances and language acquisition. In D. I. Slobin (Ed.), The Ontogenesis of Language. New York: Academic Press, 1971. SLOBIN, D. I. Comments on “Developmental psycholinguistics.” In F. Smith & G. A. Miller (Eds.), The Genesis of Language. Cambridge, MA: MIT Press, 1966. SLOBIN, D. I. Psycholinguistics. Glenview, IL: Scott, Foresman, 1971. SOLOMONOFF, R. J. A formal theory of inductive inference. Information and Control, 1964, 7, l-22; 1964, 7,224-254. SUPPES, P. The desirability of formalization in science. Journal of Philosophy, 1968, 65, 651-664. WEXLER, K., & CULICOVER, P. Two applications of the freezing principle in English. Social Sciences Working Papers, 48. Irvine, CA: University of California, 1973. WEXLER, K., CULICOVER, P., & HAMBURGER, H. Learning-theoretic foundations of linguistic universals. In press, Theoretical linguistics. WEXLER, K., CULICOVER, P., & HAMBURGER, H. Formal Principles of Language Acquisition (in preparation). WEXLER, K., & HAMBURGER, H. On the insufficiency of surface data for the learning of transformational grammar. In K. J. J. Hintikka, J. M. E. Moravcsik &P. Suppes (Eds.), Approaches to Natural Language. Dordrecht, Holland: Reidel, 1973. RECEIVED:
December 19. 1973