The acquisition of the English past tense in children and multilayered connectionist networks

The acquisition of the English past tense in children and multilayered connectionist networks

COGNITION ELSEVIER Cognition 56 (1995) 271-279 Discussion The acquisition of the English past tense in children and multilayered connectionist netw...

470KB Sizes 0 Downloads 37 Views

COGNITION ELSEVIER

Cognition 56 (1995) 271-279

Discussion

The acquisition of the English past tense in children and multilayered connectionist networks Gary F. Marcus* Department of Psychology, Tobin Hall, University of Massachusetts at Amherst, Amherst, MA 01003, USA

1. Introduction

Children's overregularization errors (e.g., goed) were once the paradigm example of a mental rule. But Rumelhart and McClelland (1986; henceforth RM) showed that a connectionist model which contained no explicit linguistic rules could produce these errors, making overregularizations central to discussion about whether traditional symbolic rules could be replaced by connectionist networks. The RM model was extensively criticized (e.g., Pinker & Prince, 1988; Lachter & Bever, 1988), but McClelland (1988) argued that "a problem with the [RM model] is that it has no intervening layers of units between the input and the output. This limitation has been overcome by the development of the back-propagation learning algorithm" (p. 118).

2. Multilayer networks

The back-propagation algorithm inspired several new connectionist models of the acquisition of inflection (Cottrell & Plunkett, 1991; Daugherty and Seidenberg, 1992; Gasser & Lee, 1991; Hoeffner, 1992; MacWhinney & Leinbach, 1991; Plunkett & Marchman 1991, 1993; cf. Egedi & Sproat, 1991). MacWhinney and Leinbach (1991, p. 143) argued that their model improves on the RM model by using "back-propagation... [and] layers of hidden units to capture nonlinearities in problem spaces". Similarly, Plunkett and Marchman (1991, p. 99) argued that "the use of a backpropagation algorithm in a network with hidden units represents a step * E-mail: [email protected]; fax: 413 545 0996. 0010-0277/95/$09.50 © 1995 Elsevier Science B.V. All rights reserved S S D I 0010-0277(94)00656-3

G.F. Marcus / Cognition 56 (1995) 271-279

272

forward in the application PDP systems to problems of language processing and acquisition". Of these new models, Plunkett and Marchman's (1993, henceforth PM) model is most comprehensive. This paper compares that model with a child.

3. Developmental sequence

Fig. 1, reprinted from Plunkett and Marchman (1993, p. 46), compares the longitudinal development of a child (Adam) and of the model. The striking similarity between the model artd the child motivated PM's claim that "the behavior of these networks can be seen to mimic several aspects of the type and timing of children's pattern of morphological acquisition" (1993, p. 58). But careful inspection reveals that the two graphs are not truly comparable: • x-axis variable: Where Adam's data are plotted by age, the simulation's are plotted by vocabulary size. This discrepancy- important since age is not linearly related to vocabulary size (Marcus et al., 1992)- is not discussed. • x-axis range (Adam): Plunkett and Marchman claim that their graph of Adam's performance is "reproduced from Marcus et al., 1992"-but where Marcus et al. plotted all of Adam's data, which end at age 5;2, Plunkett and Marchman, without explanation, prematurely truncate Adam'sdata at age 4;7. (cf. Plunkett & Marchman, 1993, p. 61; Adam's data are truncated at age 4;11.)

(a) 100-.

Adam

~

1

.. 90-."

0

80:.'

zo~

,o i

,o 30

I0.

~ "'l 25

~

~o o 20

.

0

0

~. so

30-

I=~ 8 20.

Simulation

==



~

(b)

.... 30

i .... I .... t .... t .... I 35 • 40, 45 50 55 A g e in M o n t h s

I .... 20

i .... 120

i .... 220

t 320

Vocabulary Size

Fig. 1. (1 -~ Overregularization rate) for Adam (reproduced from Marcus et al., 1992) and the simulation. Data are expressed as percentages of irregular verbs produced. The overregularization rate for Adam reflects the number of verb tokens whereas for the simulation it reflects the number of verb types.

G.F. Marcus / Cognition 56 (1995) 271-279 (a)

ADAM

81MULATION

100%.,

100%-

80%

(b)

273

l~Overreguladza~onRate

~60%

ao%~ ~60%

e-

40%

40%

20%

20%

0% . . . . ~ . . . . . . . . . . . . . . . . . . . Vocabulary Size

0%:

........................

Vocabulary Size

Fig. 2. (a) Adam's rate of overregularization, calculated in types. (b) The simulation's rate of overregularization, calculated in types. • x-axis range (simulation): Whereas all other graphs and tables of the simulation (e.g., their Figures 1, 2, 4, 5; Tables 2, 3) display vocabulary ranges from 0 to 500, the two graphs which are compared directly with Adam, without explanations, truncate the x-axis range to only 320 verbs. • y-axis variable: For Adam, overregularization rate is calculated in tokens, where the simulation's overregularization rate is calculated in types. Each change exaggerates the similarity of the model and Adam. Including all data and making the axes of the two figures match yields Fig. 2.1 Although similarities remain, replotting reveals that the model overregularizes substantially less often (0.9%, in types) than Adam (5.2%, in types) - a child who already overregularizes less than average (Marcus et al., 1992). Furthermore, while Adam continues overregularizing at vocabulary sizes of nearly 400, the PM model has long since stopped altogether. Although these problems may eventually be fixed, it is clear that the seemingly pictureperfect correspondence between Adam and the model stems from misleading graphing practices.

4. Input The apparent similarity between the input to the model and to the child, shown in Fig. 3, is also misleading. The scales on the x-axis are truncated, but at different points than in Fig. 1, and measure different things. More importantly, whereas Plunkett and Marchman earlier compared types with tokens, here they compared the model's input measured in tokens with ]The data for the Plunkett and Marchman figure are from their Table 3, because in their published graph the x-axis was inadvertently shifted by one tick mark (Plunkett, personal communication, December 9, 1993).

274

G.F. Marcus / Cognition 56 (1995) 271-279

(a)

no

Adam

(b)

100-

100 :

oo-

90-

80 -

80 -

60-

60-

50 ;

50-

40 -

40-

30,

30 -

20 -

20 -

10-

10-

O

.....

24

I .....

I .....

30

I .....

I .....

36 42 48 Age in Months

I .....

54

I ~

I

60

0

Simulation ~

....

20

I ....

v

I ....

70

I ....

I J ' ' ' l

120 170 220 VocabularySize

%lrreOCorrect

....

270

I

320

]

%Reg Vocabula~/

Fig. 3. Comparison of overregularization rate and proportion of regular verbs of total vocabulary tokens for simulation with same measured for Adam (taken from Marcus et al., 1992). Note that the irregularization rate for the simulation reflects the number of irregular verb types.

Adam's input measured in types. Fig. 4 replots these data, comparing types with types (overregularization rate) and tokens with tokens (input). By the time the simulation begins overregularizing, the proportion of regular tokens in the input to the PM model is roughly twice as high as the (a)

ADAM

100%.

(b)

SIMULATION

100%1-owlTegularizationRate

80%;

1-O~rreguladza~nRa~

80%.

~60%. ~

40% 2O% 0%

40%. Propmen of V o o ~ y U l ~ Rembr . . . . . . . . . . . . . . . . . . . . . . .

Vooebulary 81ze

O% Vocabulary Size

Fig. 4. (a) Input to Adam measured as percentage of tokens which are regular, compared with overregularization rate calculated in types. (b) Input to the simulation measured as percentage of tokens which are regular, compared with overregularization rate calculated in types.

G.F. Marcus / Cognition 56 (1995) 271-279

275

proportion in Adam's parental speech. Given more realistic input, the model might even fail to generalize since "generalization is virtually absent when regulars contribute less than 50% of the items overall" (Plunkett and Marchman, 1993, p. 55). Moreover, whereas the proportion of tokens that are regular in the input to the model rapidly increases before the onset of overregularization, the proportion of tokens that are regular in the parent's speech to Adam is roughly constant, suggesting that the overregularization is a consequence of intrinsic properties of the child's language learning mechanism, rather than external changes in input.

5. U-shaped development Children's development of the English past tense follows a "U-shaped" sequence of development. Children first correctly inflect irregular past tense forms (when they mark them at all), subsequently produce occasional (about 4% of the time) overregularization errors like singed, and finally master the past tense (Ervin & Miller, 1963; Marcus et al., 1992; Marcus, in press). Rumelhart and McClelland (1986) modeled this U-shaped sequence, but only through a sudden external change in the training regimen, not available in the input provided to a real child (Pinker & Prince, 1988; Marcus et al., 1992). Connectionist advocates MacWhinney and Leinbach (1991, p. 130) concede that "this sort of fiddling with the input data is an illegitimate way of deriving the desired phenomenon". Most subsequent connectionist models lack abrupt changes in their training regimes, but, correspondingly, those models have been unable to display U-shaped development. Rather, these networks typically show initially2 poor performance on irregulars, which improves over successive epochs. Egedi and Sproat (1991) systematically compared two training regimes in a multilayered back-propagation network. If the input was divided into distinct stages, the model displayed a U-shaped developmental sequence; when the input was not presented in distinct stages the model did not display a U-shaped sequence. Unlike many of its predecessors, the PM model exhibits a U-shaped sequence. But surprisingly, the PM model, like the RM model, contains a nonlinear change in the training regime: "Early in training, a new verb is introduced every 5 epochs until vocabulary size reaches 100. Thereafter, training is reduced to 1 epoch per new verb" (Plunkett and Marchman, 2Daugherty and Seidenberg (1992) and Gasser and Lee (1991) do not present any longitudinal data for their models. Plunkett and Marchman's (1991) model does show wiggles in the developmental curve, but their model does not display an initial period of correct performance. As Kruschke (1990) noted: "The recent work of Plunkett and Marchman [1991] does not exhibit U-shaped learning, contrary to their claims. They showed that acquisition fluctuated depending on the particular training sequence, but they failed to mention that on average their model showed monotonic, not U-shaped, acquisition" (p. 61).

276

G.F. Marcus / Cognition 56 (1995) 271-279

1993, p. 33). 3 Fig. 5 replots the simulation's performance, with a line superimposed at 110 verbs - the point at which the training regime changes. (No data are available for vocabulary sizes of between 101 and 109, since the model is tested only once every 10 verbs.) This abrupt change in the training regime corresponds exactly with the model's first overregularization. For the first 100 words, the model is given 5 epochs per verb, apparently allowing the model ample time to learn each new stem-past pair without overgeneralization. Then the training regime suddenly switches, forcing the model to assimilate new verbs five times more rapidly, and the model, apparently lacking time to completely learn each verb, starts to overgeneralize- thus the dip in the U-shaped curve appears to be caused not by an internal reorganization trigged by a constant increase in vocabulary size, but rather by an externally imposed discontinuity.4 Plunkett and Marchman (1993, pp. 33-34) claim that this external change in training "[is] intended to model non-linearities in rate of vocabulary growth that are sometimes observed in longitudinal studies of young children (Dromi, 1987)". But the vocabulary spurt Dromi describes occurs

100%,

~

80%. 60%. 40%.

20*/,.

81mulation 1-Overmguladzdon I:kll

.jli

0%

Vocabulary Size Fig. 5. ( 1 - Overregularization rate) for Plunkett and Marchman's simulation. Vertical line indicates point of change in training regimen.

3There are actually two discontinuities: " v e r b s . . . that are introduced after the 100 vocabulary mark are trained with a token frequency of 1 [rather than the 5 or 3 tokens assigned earlier]" (p. 40). 4 U-shaped development in Hoeffner's (1992) model may also depend on an external, psychologically unmotivated change in the training regimen. While Hoeffner notes on p. 863 that there is no "sudden influx of regular verbs or a rapid change in the relative proportions of regular and irregular verbs", the input to the model is "expanded incrementally.., until epoch 45" when the model suddenly stops receiving new verbs. It is probably not coincidental that "by epoch 44, the number of overregularization errors began to diminish". Moreover, the model's period of peak overregularization occurs while the input to the model, measured in types, grows from 67% to 90% regular, whereas Adam's period of peak overregularization occurs after the proportion of regular verb types in his vocabulary has stopped growing rapidly.

G.F. Marcus / Cognition 56 (1995) 271-279

277

at 16 months (Dromi, 1987, p. 111)- roughly a year before the onset of overregularization (about 29 months; Marcus et al., 1992). Contrary to the simulation's prediction, the vocabulary spurt cannot explain the onset of overregularization.

6. Types of errors

One virtue of the RM model was that it modeled differences between types of errors. For instance, children overregularize vowel change verbs (singed for sang) far more than no-change (identity) verbs (hitted for hit); Adam overregularized vowel-change verbs (5.1%) more than no-change verbs (3.0%). Rumelhart and McClelland (1986, p. 250) noted that their "simulation results show clearly the same patterns evident in the Bybee and Slobin data. Verbs ending in t/d always show a stronger no-change response and a weaker regularized response than those not ending in t/d." Yet the Plunkett and Marchman simulation errs more with no-change verbs, mean = 1.1%, than vowel change verbs, m e a n = 0 . 8 % . Moreover, whereas children's overregularizations are usually of the form stem + ed (singed) rather than past + ed (sanged), only 1/7 of the PM model's overregularizations of vowel-change verbs are "stem ÷ ed". Similarly, children produce "irregularization errors (e.g., flow---~flew).., less frequently than the standard ~add /-ed/' error" (Plunkett and Marchman, 1993, p. 25; see also Xu and Pinker, in press). Adam overregularized more (mean, typewise = 5.2%) than he irregularized (mean, typewise=0.75%). But although RM modeled this correctly, the PM simulation overregularized less often (mean, typewise=0.9%) than it irregularized (mean, typewise = 1.01%). Bever (1992; Lachter & Bever, 1988) has suggested that the rule-like properties of the RM model may be due to implausible special-purpose linguistic mechanisms built into the model, rather than underlying network principles. For example, the model's phonological representation consisted of overlapping Wickelfeatures; e.g., sing is represented as [#si, sin, ing, ng#]. This representation tends to emphasize word boundaries (Bever, 1992), perhaps increasing the likelihood "stem + ed" errors while reducing irregularizations. In response to critiques like Bever's, Plunkett and Marchman adopted a more plausible, Wickel-free phonological representation scheme, but correspondingly their model has greater difficulty explaining variation between types of errors.

7. Summary

The apparent very close similarity between the learning of the past tense by Adam and the Plunkett and Marchman model is exaggerated by several misleading comparisons- including arbitrary, unexplained changes in how

278

G.F. Marcus / Cognition 56 (1995) 271-279

graphs were plotted. The model's development differs from Adam's in three important ways: • Children show a U-shaped sequence of development which does not depend on abrupt changes in input; U-shaped development in the simulation occurs only after an abrupt change in training regimen. • Children overregularize vowel-change verbs more than no-change verbs; the simulation overregularizes vowel-change verbs less often than nochange verbs. • Children, including Adam, overregularize more than they irregularize; the simulation overregularized less than it irregularized. Interestingly, the RM m o d e l - widely criticized as being inadequatedoes somewhat better, correctly overregularizing vowel-change Verbs more often than no-change verbs, and overregularizing more often than it irregularizes. Although Plunkett and Marchman's (1993) state of the art model incorporated hidden layers and back-propagation, used a more realistic phonological coding scheme, and explored a broader range of parameters than Rumelhart and McClelland's model, their results are farther from psychological reality. It is unknown whether any connectionist model can mimic a child's performance without resorting to unrealistic exogenous changes in the training or input, but it is clear that adding a hidden-layer and back-propagation does not ensure a solution.

Acknowledgements I thank Neil Berthier, Chuck Clifton, Jacques Mehler, Neal Pearlmutter, Steve Pinker, and Fei Xu and an anonymous reviewer for helpful comments. This research was supported by a Faculty Research Grant from the University of Massachusetts and NIH HD 18381.

References Bever, T.G. (1992). The demons and the beast: modularand nodular kinds of knowledge.In R. Reilly & N. Sharkey (Eds.), Connectionist approaches to natural language processing. Hillsdale, NJ: Earlbaum. CottreU, G.W., & Plunkett, K. (1991). Learning the past tense in a recurrent network: acquiring the mapping from meanings to sounds. Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society. HiUsdale,NJ: Erlbaum. Daugherty, K., & Seidenberg, M. (1992). Rules or connections? The past tense revisited. Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society. HiUsdale, NJ: Erlbaum. Dromi, E. (1987). Early lexical development. Cambridge, UK: Cambridge UniversityPress.

G.F. Marcus I Cognition 56 (1995) 271-279

279

Egedi, D.M., & Sproat, R.W. (1991). Connectionist networks and natural language morphology. Unpublished manuscript, AT&T Bell Laboratories, Linguistics Research Department, Murray Hill, NJ. Ervin, S.M., & Miller, W.R. (1963). Language development. In H.W. Stevenson (Ed.), Child psychology: The Sixty-Second Yearbook of the National Society for the Study of Education, Part 1. Chicago: University of Chicago Press. Gasser, M., & Lee, C.D. (1991). A short-term memory architecture for the learning of morphophonemic rules. In R. Lippmann, J. Moody, & D. Touretzky (Eds.), Advances in neural information processing systems 3. San Mateo, CA: Morgan Kaufmann. Hoeffner, J. (1992). Are rules a thing of the past? The acquisition of verbal morphology by an attractor network. Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum. Kim, J.J., Marcus, G.F. Pinker, S., Hollander, M., & Coppola, M. (1994). Senstivity of children's inflection to morphological structure. Journal of Child Language, 21, 173-209. Kruschke, J.K. (1990). ALCOVE: a connectionist model of category learning. Research Report No. 19. Bloomington: Cognitive Science Program, Indiana University. Lachter, J., & Bever, T.G. (1988). The relation between linguistic structure and associative theories of language learning: a constructive critique of some connectionist learning models. Cognition, 28, 195-247. MacWhinney, B., & Leinbach, J. (1991). Implementations are not conceptualizations: revising the verb learning model. Cognition, 40, 121-157. Marcus, G.F. (in press). Children's overregularization of English plurals: a quantitative analysis. Journal of Child Language. Marcus, G.F., Pinker, S., Uilman, M., Hollander, M., Rosen, T.J., & Xu, F. (1992). Overregularization in language acquisition. Monographs of the Society for Research in Child Development, 57 (4, Serial No. 228). McClelland, J.L. (1988). Connectionist models and psychological evidence. Journal of Memory and Language, 27, 107-123. Pinker, S., & Prince, A. (1988). On language and connectionism: analysis of a Parallel Distributed Processing model of language acquisition. Cognition, 28, 73-193. Plunkett, K., & Marchman, V. (1991). U-shaped learning and frequency effects in a multilayered perceptron: implications for child language acquisition. Cognition, 38, 43-102. Plunkett, K., & Marchman, V. (1993). From rote learning to system building: acquiring verb morphology in children and connectionist nets. Cognition, 48, 21-69. Rumelhart, D. and J. McClelland (1986). On learning the past tenses of English verbs. Implicit rules or parallel distributed processing? In J. McClelland, D. Rumelhart and the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. Xu, F., & Pinker, S. (in press) Weird past tense forms. Journal of Child Language.