Cognitive Psychology 84 (2016) 1–30
Contents lists available at ScienceDirect
Cognitive Psychology journal homepage: www.elsevier.com/locate/cogpsych
Regularization of languages by adults and children: A mathematical framework Jacquelyn L. Rische, Natalia L. Komarova ⇑ Department of Mathematics, University of California Irvine, Irvine, CA 92697, United States
a r t i c l e
i n f o
Article history: Accepted 20 October 2015
Keywords: Mathematical modeling Reinforcement algorithms Frequency boosting Frequency matching
a b s t r a c t The fascinating ability of humans to modify the linguistic input and ‘‘create” a language has been widely discussed. In the work of Newport and colleagues, it has been demonstrated that both children and adults have some ability to process inconsistent linguistic input and ‘‘improve” it by making it more consistent. In Hudson Kam and Newport (2009), artificial miniature language acquisition from an inconsistent source was studied. It was shown that (i) children are better at language regularization than adults and that (ii) adults can also regularize, depending on the structure of the input. In this paper we create a learning algorithm of the reinforcement-learning type, which exhibits patterns reported in Hudson Kam and Newport (2009) and suggests a way to explain them. It turns out that in order to capture the differences between children’s and adults’ learning patterns, we need to introduce a certain asymmetry in the learning algorithm. Namely, we have to assume that the reaction of the learners differs depending on whether or not the source’s input coincides with the learner’s internal hypothesis. We interpret this result in the context of a different reaction of children and adults to implicit, expectation-based evidence, positive or negative. We propose that a possible mechanism that contributes to the children’s ability to regularize an inconsistent input is related to their heightened sensitivity to positive evidence rather than the (implicit) negative evidence. In our model, regularization comes naturally as a consequence of a stronger reaction of the children to evidence supporting their preferred hypothesis. In adults, their ability to adequately process implicit negative evidence prevents them from regularizing the inconsistent input, resulting in a weaker degree of regularization. Ó 2015 Published by Elsevier Inc.
⇑ Corresponding author. E-mail addresses:
[email protected] (J.L. Rische),
[email protected] (N.L. Komarova). http://dx.doi.org/10.1016/j.cogpsych.2015.10.001 0010-0285/Ó 2015 Published by Elsevier Inc.
2
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
1. Introduction Natural languages evolve over time. Every generation of speakers introduces incremental differences in their native language. Sometimes such gradual slow change gives way to an abrupt movement when certain patterns in the language of the parents differ significantly from those in the language of the children. The fascinating ability of humans to modify the linguistic input and ‘‘create” a language has been widely discussed. One example is the creation of the Nicaraguan Sign Language by children in the course of only several years (Senghas, 1995; Senghas & Coppola, 2001; Senghas, Coppola, Newport, & Supalla, 1997). Other examples come from the creolization of pidgin languages (Andersen, 1983; Sebba, 1997; Thomason & Kaufman, 1991). It has been documented that in the time-scale of a generation, a rapid linguistic change occurs that creates a language from something that is less than a language (a limited pidgin language (Johnson, Shenkman, Newport, & Medin, 1996), or a collection of home-signing systems in the example of the Nicaraguan Sign Language). Language regularization has been extensively studied in children, see e.g. work on the phenomenon of over-regularization in children (Marcus et al., 1992). Goldin-Meadow, Mylander, de Villiers, Bates, and Volterra (1984), Goldin-Meadow (2005), and Coppola and Newport (2005) studied deaf children who received no conventional linguistic input, and found that their personal communication systems exhibited a high degree of regularity and language-like structure. The ability of adult learners to regularize has also been discussed (Bybee & Slobin, 1982; Cochran, McDonald, & Parault, 1999; Klein & Perdue, 1993). Much attention in the literature is paid to statistical aspects of learning, showing that learners are able to extract a number of statistics from linguistic input with probabilistic variation (Gómez & Gerken, 2000; Griffiths, Chater, Kemp, Perfors, & Tenenbaum, 2010; Saffran, 2003; Wonnacott, Newport, & Tanenhaus, 2008). Identifying statistical regularities and extracting the underlying grammatical structure both seem to contribute to human language acquisition (Seidenberg, MacDonald, & Saffran, 2002). Reali and Griffiths (2009) demonstrated that in the course of several generations of learning, the speakers shift from a highly inconsistent, probabilistic language to a regularized, deterministic language. A mathematical description of this phenomenon was presented based on a Bayesian model for frequency estimation. This model demonstrated, much like in experimental studies, that while in the course of a single ‘‘generation” no bias toward regularization was observed, this bias became apparent after several generations. The same phenomenon was observed by Smith and Wonnacott (2010). It was suggested that gradual, cumulative population-level processes are responsible for language regularity. In this paper we focus on a slightly different phenomenon. The work of Elissa Newport and colleagues demonstrates that language regularization can also happen within one generation. A famous example is a deaf boy Simon (see Singleton & Newport (2004)) who received all of his linguistic input from his parents, who were not fluent in American Sign Language (ASL). Simon managed to improve on this inconsistent input and master the language nearly at the level of other children who learned ASL from a consistent source (e.g. parents, teachers, and peers fluent in ASL). Thus he managed to surpass his parents by a large margin, suggesting the existence of some innate tendency to regularization. The work of Newport and her colleagues sheds light into this interesting phenomenon. In a number of studies, it has been demonstrated that both children and adults have the ability to process inconsistent linguistic input and ‘‘improve” it by making it more consistent. When talking about the usage of a particular rule, this ability was termed ‘‘frequency boosting,” as opposed to ‘‘frequency matching.” Let us suppose that the ‘‘teacher” (or the source of the linguistic input) is inconsistent, such that it probabilistically uses several forms of a certain rule. Frequency boosting is the ability of a language learner to increase the frequency of usage of a particular form compared to the source. Frequency matching happens when the learner reproduces the same frequency of usage as the source. Hudson Kam and Newport (2005) and Hudson Kam and Newport (2009) showed that (i) children are better at frequency boosting than adults and that (ii) adults can also frequency boost, depending on the structure of the input.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
3
In this paper we create an algorithm of the reinforcement-learning type, which is capable of reproducing the results reported by Hudson Kam and Newport (2009). It turns out that in order to capture the differences between children’s and adults’ learning patterns, we need to introduce a certain asymmetry in the learning algorithm. More precisely, we need to assume that the reaction of the learners differs depending on whether or not the sources’ input coincides with the learner’s internal hypothesis. The role of such implicit, expectation based evidence in human language learning has been emphasized in the literature, see e.g. Perruchet and Pacton (2006), Levy (2008), Ramscar, Dye, and McCauley (2013c). In the context of our model we find that children are more efficient at reacting to positive evidence, and adults to implicit negative evidence. Therefore, we propose that the differences in adults’ and children’s abilities to regularize are related to the differences in their ability to process positive and negative evidence. This paper is organized as follows. In Section 2 we develop the mathematical model used in this paper. In Section 3 we report the results. We describe how our model can be fitted to the data of Hudson Kam and Newport (2009), and what parameters give rise to the observed differences between children and adults. To test the validity of our algorithm, we also performed independent fitting to two alternative datasets: in Appendix B we present fitting to the data of Hudson Kam and Newport (2005), and in Appendix C to the data of Fedzechkina, Jaeger, and Newport (2012). In Section 4 we summarize our findings and discuss them in terms of processing implicit positive and negative evidence in language acquisition and other cognitive tasks by children and adults.
2. Theory and calculations 2.1. Reinforcement learning in behavioral sciences At the basis of our method is a mathematical model of learning, which belongs to a larger class of reinforcement-learning models (Sutton & Barto, 1998; Norman, 1972; Narendra & Thathachar, 2012). Over the years, reinforcement models have played an important role in modeling many aspects of cognitive and neurological processes, see e.g. Maia (2009) and Lee, Seo, and Jung (2012). Some of the most influential reinforcement learning algorithms have been created by Rescorla and Wagner (1972a), see also Rescorla (1968) and Rescorla (1988). These works have given rise to a large number of papers in psychology and neuroscience, some of which are reviewed by Danks (2003) and Schultz (2006), see also review by Miller, Barnet, and Grahame (1995) for a detailed list of successes and failures of the Rescorla–Wagner (RW) model. Our model is rooted in the simple algorithm first proposed by Bush and Mosteller (1955), which we modified to explain the observations of Hudson Kam and Newport (2009). Bush–Mosteller type learning algorithms have been widely used to model learning in biological and behavioral sciences, in different contexts, see e.g. Busemeyer and Pleskac (2009), Botvinick, Niv, and Barto (2009), and Niv (2009). Van de Pol and Cockburn (2011) model the response of phenological traits, such as timing of breeding, to climatic conditions. Bendor, Diermeier, and Ting (2003) and Fowler (2006) study the voting behavior of individuals. An economics prospective on this type of modeling is reviewed by Vulkan (2000). Flache and Macy (2002) use similar modeling techniques to study conflict resolution in humans. Bröder, Herwig, Teipel, and Fast (2008) used this model to explain memory patterns in adults with mild cognitive impairment. Mühlenbernd and Nick (2014) used a reinforcement learning model in combination with signaling games, to study the role of innovations and change in human language. The theory presented below is not an attempt to explain the process of language acquisition in its entirety (which would be a formidable task). Following a tradition in mathematical linguistics and learning theory (see e.g. Steels (2000), Nowak, Komarova, & Niyogi (2001), Niyogi (2006), Lieberman, Michel, Jackson, Tang, & Nowak (2007), Hsu, Chater, & Vitányi (2013)), we have deliberately simplified the task of the learner to concentrate only on certain aspects of learning. A learner is presented with an input string. By processing such an input, the learner tries to learn the underlying rule; this task is complicated by the input being ‘‘noisy”, of probabilistic, with respect to the rule. This formulation, which is described in detail below, already contains an important implicit assumption. It
4
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
assumes that the learner is able to extract (segment) from all the utterances received, the correct mutually exclusive forms of the rule under investigation. This in itself is a challenge studied in the literature, see e.g. Roy and Pentland (2002), Seidl and Johnson (2006), Monaghan, White, and Merkx (2013). We will not address this pre-processing step in the present manuscript. Algorithms described below are based on the fact that learning happens more quickly if an element of surprise is present in the input. Bush–Mosteller algorithm (and subsequently the RW model) are based on this principle. Schultz (2006) studied the neurophysiology of reward, explaining that neurons ‘‘show reward activations only when the reward occurs unpredictably and fail to respond to well-predicted rewards, and their activity is depressed when the predicted reward fails to occur”. In the context of experiments with human subjects, several authors have used simple models of reinforcement learning to successfully explain and predict behavior in a wide range of experiments (Mookherjee & Sopher, 1994; Roth & Erev, 1995; Mookherjee & Sopher, 1997; Erev & Roth, 1998). Further empirical background for models of the Bush-Mosteller type is provided by Duffy (2006), Camerer (2003), Bendor, Mookherjee, and Ray (2001), see also Izquierdo, Izquierdo, Gotts, and Polhill (2007). 2.2. The basic algorithm Let us suppose that a certain rule has n variants (forms). A learner is characterized by a set of n posP itive numbers, X ¼ ðx1 ; x2 ; . . . ; xn Þ on a simplex: ni¼1 xi ¼ 1. Each number xi is the probability for the learner to use form i (which we will call the propensity of form i). We will call the numbers fxi g the frequencies of the learner. If for some i ¼ I; xI ¼ 1, then the learner is deterministic and will consistently use form I. The learning process is modeled as a sequence of steps, which are responses of the learner to the input. The linguistic input (or source) emits a string of applications of the rule, and it is P characterized by a set of constant frequencies, m1 ; . . . ; mn (with ni¼1 mi ¼ 1). At each iteration, the learner will update its frequencies in response to the input received from the source. If the source’s input is form j, then the learner will update its frequencies according to the following update rules:
xk ! xk F k ðXÞ; k – j; X F k ðXÞ: xj ! xj þ
ð1Þ ð2Þ
k–j
This concept is illustrated schematically in Fig. 1(a). In the most general formulation, the function F k can depend on any components of X. In the case of the linear reinforcement model, we have (Narendra & Thathachar, 2012)
F k ðXÞ ¼ sxk ;
ð3Þ
with s a positive constant less than 1, such that
xk ! xk sxk ;
k – j;
xj ! xj þ sð1 xj Þ:
ð4Þ ð5Þ
This model assumes that the increment of learning is a linear function of the ‘‘novelty” or ‘‘surprise” effect. To see this, we rewrite the above equations for the case of only two forms, in the case when form 1 is received:
x1 ! x1 þ sð1 x1 Þ;
ð6Þ
x2 ! x2 sx2 ;
ð7Þ
see Fig. 1(b). The larger the gap between the association strength of form 1 and the maximum conditioning (quantity 1 x1 ), the larger the change in the state of the learner following the input form 1. This model was introduced by Bush and Mosteller (1955), and later became basis for the RW model. The RW model reduces to this update rule if we assume that (1) only one stimulus is presented at a time, and further if (2) the maximum conditioning produced by each stimulus and the rate parameters are the same for all stimuli (the latter assumption is made for example by Ramscar, Yarlett, Dye, Denny, & Thorpe (2010)).
5
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
0.12
0.12
0.12
0.10
0.10
0.08
0.08
0.08
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.00
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.0
0.10
0.02 0.2
0.4
0.6
0.8
1.0
0.00
0.0
0.2
0.4
0.6
0.8
1.0
Fig. 1. A schematic illustrating the learning algorithm. A two-form ðn ¼ 2Þ rule is assumed. (a) The state of the learner is represented by the value X, which is the propensity to use form 1 (whereas the propensity to use form 2 is 1 X). Upon hearing form 1, the learner will increase the value of X by the amount Fð1 XÞ, indicated by the horizontal arrow. Examples of the shape of the function F are given in (b–d). (b) F is a linear function, Eq. (3). (c) F is a power function, Eq. (8). (d) F is piecewise linear, Eq. (9), as studied in Mandelshtam and Komarova (2014).
2.3. Nonlinear learning algorithms Here we will use a more general version of reinforcement algorithm (1) and (2). Instead of assuming that the increment of learning in a linear function of the ‘‘novelty”, we allow for more general, nonlinear functions. We keep the basic assumption that the speed of learning is positively correlated with ‘‘novelty” (as was demonstrated on the molecular level, see e.g. Fiorillo, Tobler, & Schultz (2003), Schultz (2007), Ribas-Fernandes et al. (2011)). Nonlinear learning models have been widely used in various areas of behavioral sciences, see e.g. Holyoak, Koh, and Nisbett (1989), Niv and Montague (2008), and Guitart-Masip et al. (2012). A well known example is the so-called Q-learning, developed as a modification of RW algorithms. In Q-learning, the probability of action is defined by an implementation of Luce choices hypothesis (sometimes referred to as softmax) (Luce, 1977). This algorithm has received backing from the neurological (Niv, 2009) and cognitive (Gureckis & Love, 2009) point of view. In general, such nonlinear algorithms are widely used in economics, game theory, and social sciences, see e.g. Blume (1993) and Alós-Ferrer and Netzer (2010). General properties of nonlinear learning models in the context of algorithm (1) and (2) are studied elsewhere (Ma & Komarova, in preparation). In this paper we provide computations for two examples of such nonlinear models. In the first example (the power model), we assume that
( F k ðXÞ ¼
sðxk Þa ; xk > sðxk Þa ; xk ;
xk < sðxk Þa ;
ð8Þ
a is a positive constant, 0 < a < 1, see Fig. 1(c). This update rule states that in response to form j produced by the source, the learner will increase its probability to use that form by a certain amount, and consequently the probabilities of all other forms will be reduced across the board. As in the RW model, the amount by which the strength of a certain rule is increased, depends on how ‘‘unpredictable” the rule is for the current state of the learner. In the RW model, the increment of learning is a linear function of the difference between the current state xj and the maximum conditioning (which is one). In model (8), it is a power function. The second line in formula (8) simply ensures that xk stays within the bounds between 0 and 1.
6
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
Another model explored here is a piecewise linear model (Fig. 1(d)), whose basic form is given by
F k ðXÞ ¼
s=ðn 1Þ; xk > s=ðn 1Þ; xk ;
ð9Þ
xk < s=ðn 1Þ:
This model was introduced and studied by (Mandelshtam & Komarova, 2014). 2.4. Frequency matching and frequency boosting Learning algorithm (1) and (2) gives rise to a Markov processes characterized by a stationary frequency distribution. That is, starting from any initial frequency vector, a learner will converge to a quasistationary state, where the values fx1 ; . . . ; xn g fluctuate around fixed means. Fig. 2(a) demonstrates this behavior (please note that in the figure, apart from model (9), a generalization of this model is also shown; it involves an extra parameter, p, which is introduced later in the text). Linear Bush–Mosteller algorithm (1) and (2) with linear updates, Eq. (3), is characterized by the frequency matching property (see the dashed line in Fig. 2(b)). This means that the mean propensity of the learner matches exactly the frequency of the source, see e.g. Feldman and Newell (1961). On the contrary, algorithm (1) and (2) under nonlinear assumption (8) or (9) possesses a source boosting property. If the (inconsistent) source is characterized by a dominant usage of a certain form, on the long run the learner will use the same form predominantly, and the propensity of usage for that form will be higher for the learner compared to the frequency of the source. Let us suppose that m1 > 1=2 is the highest source frequency for n ¼ 2. Then, in the case of learning increment (9), at quasisteady state the learner will use form 1 with the expected propensity
m01 ¼ 1 ð1 m1 Þ
s 1þs þ 2m1 1 1 m1 ð1 þ ðm1 =ð1 m1 ÞÞ1=s Þ
! > m1 ;
ð10Þ
see Appendix A and Mandelshtam and Komarova (2014) for details of the calculations. Fig. 2(b) demonstrates the frequency boosting property by plotting the quasisteady state frequency m01 as a function of the source frequency, m1 . Very similar behavior is observed for nonlinear algorithm (8) with a < 1 (not shown). The frequency at which the dominant form is used by the source will affect the speed of convergence of the learning algorithm: higher frequencies lead to faster convergence. The strength of the
Quasisteady State Frequency: ν1’
1 1.0
0.9 0.8 0.7 0.6 0.5 0
2000
4000
6000
8000
10000
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
1
Source Frequency: ν1 Fig. 2. The stochastic learning process. (a) A plot of the propensity of the learner to use form 1, with respect to time. The frequency of the source for form 1 is m1 ¼ :8 (the dashed horizontal line). The linear model oscillates around the frequency of the source. The two nonlinear asymmetric models (Eq. (11) with a ¼ 0:4) are implemented with s ¼ 0:006; p ¼ 0:003 (converges to the highest value of the steady state) and also with s ¼ 0:003; p ¼ 0:006 (converges to an intermediate values of the steady state). (b) A plot of the quasisteady state frequency m01 as a function of the source frequency, m1 . The dashed line corresponds to the linear algorithm (3), where we observe frequency matching, m01 ¼ m1 . The solid line corresponds to algorithm (9), characterized by the frequency boosting property (m01 > m1 for m1 > 1=2). For this plot, s ¼ p ¼ 0:05.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
7
boosting depends on the increment of learning update, s: smaller values of s yield larger values of m0 . The value s also influences the speed of convergence: for higher s the algorithm converges faster, but the frequency at which the learner uses the dominant form decreases. 2.5. The asymmetric algorithm The basic nonlinear algorithms (8) and (9) are characterized by parameter s, which controls the size of the increment of learning. Next we introduce a two-parametric generalization of this algorithm, where the update rules are different depending on whether the source’s input value matches the highest-propensity value of the learner. Let us suppose that component xm is the largest: xm ¼ maxi fxi g. We define form m to be the preferred form (or the preferred hypothesis) of the learner. Below we present an asymmetric generalization of the power law, Eq. (8). In responds to the source emitting form j, the update rule (1) and (2) will have the following increment function:
8 sðxk Þa ; > > >
pðxk Þa ; > > : xk ;
xk > sðxk Þa ; xk < sðxk Þa ;
k ¼ m; k ¼ m;
ð11Þ
xk > pðxk Þa ; k – m; xk < pðxk Þa ; k – m:
Here is another way to express the update rules. If the source emits form j ¼ m that matches the form with the largest learner’s propensity, the propensities are updated as follows:
(
x k ! xk F k ;
Fk ¼
X x j ! xk þ Fk
sðxk Þa ; xk > sðxk Þa xk
otherwise
k–j ð12Þ
k–j
If the source uses form j – m different from the learner’s preferred form, then the update is as follows:
(
x k ! xk F k ; F k ¼ x j ! xj þ
X
pðxk Þa ; xk > pðxk Þa xk otherwise
k–j ð13Þ
Fk
k–j
The asymmetric generalization of algorithm (9) is expressed as follows:
8 s=ðn 1Þ; > > > p=ðn 1Þ; > > : xk ;
xk > s=ðn 1Þ;
k ¼ m;
xk < s=ðn 1Þ; k ¼ m; xk > p=ðn 1Þ; k – m;
ð14Þ
xk < p=ðn 1Þ; k – m:
If s ¼ p, then this algorithm is the same as the symmetric version (8) or (9). The novel feature of the two-parametric algorithm is that it tracks whether the source matches the learner’s ‘‘preferred” hypothesis. If it does (that is, if the input is the form whose propensity is the largest for the learner), then the learner increases this propensity’s value by an amount defined by s. Otherwise, the increment is defined by p (see Fig. A.9). For example, in the extreme case where p s, the updates are only performed when the learner is reassured that its highest propensity form is used. Otherwise the frequencies are updated very little. Note that this algorithm does not simply use different reward signals on positive and negative trials. Instead, it tracks whether the signal emitted by the source matches the current hypothesis of the learner. For example, let us suppose that with n ¼ 3 forms, the current state of the learner is ð0:1; 0:7; 0:2Þ. We further assume that the learner uses the piecewise algorithm, Eq. (14), and the values s ¼ 0:01; p ¼ 0:02. In this case, the preferred form of the learner is m ¼ 2. If the source emits signal j ¼ 1, which does not match the preferred form, the value of x1 will receive an increment of p, and the other values (x2 and x3 ) will both decrease by p=2. If the source emits signal j ¼ 2, which
8
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
matches the preferred form, then the value of x2 will receive an increment of s, and the other values (x1 and x3 ) will both decrease by s=2. We can see that depending on whether the source’s form is the same as the preferred form of the learner, the positive increment for that form will be different. Properties of the asymmetric algorithms are described in detail in Appendix A.2. 3. Results In this paper we create a mathematical framework to describe adults’ and children’s learning from an inconsistent source. We will use the results of Hudson Kam and Newport (2009) to test and parametrize our model (see also Appendix B for an application of our model to the data by Hudson Kam & Newport (2005), and Appendix C for an application of the model to the data of Fedzechkina et al. (2012)). Hudson Kam and Newport (2009) have shown that children and adults can learn from an inconsistent source and ‘‘improve” it by making it more consistent. Both adults and children are taught miniature artificial languages. The participants learn the language by listening to sentences of the language, which are presented in an inconsistent fashion (allowing for a probabilistic usage of several forms). The artificial language learned by the adults contained 51 words: 36 nouns, 7 intransitive verbs, 5 transitive verbs, 1 negative (neg), and 2 determiners (det), one for each of two noun classes. The language was created in conjunction with a small world of objects and actions, whose permissible combinations restricted the number of possible sentences (but even with these semantic restrictions there were over 13,200 possible sentences in the language). The basic word order was (neg) V–S–O. This basic structure permitted four possible sentence types: intransitive, transitive, negative intransitive, and negative transitive. All aspects of the language were consistent and regular, except the appearance of determiners within the noun phrases (NPs). When they occurred, the determiners followed the nouns within the NP. Determiners occurred probabilistically; the percentage of NPs with determiners varied across input conditions. The structure and complexity of the probabilistic input varied from experiment to experiment. In one of the experiments performed by Hudson Kam and Newport (2009) with only adult participants, the authors used five different types of inconsistent input, see Fig. 3(a). In the control case (termed 0ND), the sentences with the ‘‘correct” form of the determiner are given 60% of the time, and 40% of the time the NPs do not contain a determiner (the so-called presence–absence inconsistency). In the other four conditions, all NPs used a determiner, but several forms of the determiner were employed. In all the groups, the most frequent form of the determiner was also uttered 60% of the time, but different numbers of alternative forms were used. In the 2ND case two alternative forms are each used 20% of the time.
1.0 0.9 0.8 0.7 0.6 0.5 0.4
Fig. 3. Experiments of Hudson Kam and Newport (2009) and modeling. (a) The input structure for the experiment. (b) Comparison of the results from Hudson Kam and Newport (2009) (the points represent the mean, and the vertical bars the standard deviation) with the results from our models (the lines), adult learners only. The horizontal line corresponds to the linear model (3). The nonlinear model used is the power law (8) with a ¼ 0:2. The dashed line is for the best fitting symmetric reinforcement model (with s ¼ 0:29). The solid line is the best-fitting asymmetric model corresponding to s ¼ 0:03 and p ¼ 0:06.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
9
Similarly, in each of the conditions iND with i ¼ 2, 4, 8, 16, i alternative forms were used ð40=iÞ% of the time each. It was found that the adults in the control case did not boost the frequency but rather frequency-matched the 60% of the more frequent forms. Interestingly, as the complexity of the input increased, the learners in each of the conditions produced the most frequent form of the language more often. That is, the frequency-boosting increased with the number of alternative forms used. 3.1. Regularization by adult learners The vertical bars in Fig. 3(b) present the observations of Hudson Kam and Newport (2009). The degree of regularization increases steadily depending on the number of alternative forms used. We performed computer simulations to study whether our models can predict the observed patterns. For each condition, for each value of algorithm parameters, we have run each model 10,000 times for 1500 time steps. After each run, we noted the probability that the learner uses the ‘‘correct” (most frequent) form by averaging its propensity found at each time step (starting at time step 500 to give the algorithm time to converge), and averaged these probabilities. We used these averages to calculate the least squares error compared to the results of Hudson Kam and Newport (2009), to determine the parameter values that best match the experimental results. As expected, applying the linear Bush–Mosteller algorithm (Eq. (3)) did not lead to any degree of regularization, and the frequency of the dominant form remained at 60% for all conditions, see the horizontal line in Fig. 3(b). Therefore, in order to capture the effect of frequency boosting, we turned to the nonlinear models. We performed fitting of the data by using the nonlinear algorithms (8) and (9). For the power law algorithm, the results are presented in Fig. 3(b), with a ¼ 0:2. The dashed line corresponds to the best fit achieved in the symmetric model (Eq. (8)), and the solid line – to the best fit in the asymmetric model (Eq. (11)). We can see that the symmetric model does not adequately capture the learning behavior of the adults in this experiment. When systematically searching the 2-dimensional parameter space ðs; aÞ, the best fit for the symmetric model was given by parameters s ¼ 0:0574; a ¼ 0:604 (not shown), but even in this case the corresponding curve was relatively flat compared to the data in Fig. 3(b). The fits obtained by the asymmetric model (the solid line in Fig. 3(b)) yielded a much closer fit, but in order to judge the relative value of the two models objectively, we used the Akaike information criterion (AIC). This criterion includes the goodness of fit of a given model, but also penalizes for introducing extra degrees of freedom (extra parameters). The asymmetric model has one extra parameter (p) compared to the symmetric model. Which model is better can be tested by calculating the AIC statistics; the smaller the statistic, the better the model. For the case depicted in Fig. 3(b), where we fixed a ¼ 0:2 for both models, we obtain for the symmetric and the asymmetric model AIC asym ¼ 7:68 and AIC sym ¼ 17:88 respectively. Clearly, AIC asym < AIC sym . If we now consider a as a free parameter, we have AIC asym ¼ 9:68; AIC sym ¼ 15:31. Again, AIC asym < AIC sym , which means that although the asymmetric model contains an extra parameter, it can be considered a better choice in this situation. We conclude that the asymmetric model with s < p should be used to describe the experimental results for the adults reported by Hudson Kam and Newport (2009). In Appendix A.3 we present fitting the same data with the piecewise linear model, Eq. (9). Again, the symmetric model is rejected and it follows that we need to use the asymmetric model with s < p. To demonstrate versatility of the proposed model, we also perform model fitting of two different datasets. One is the dataset by Hudson Kam and Newport (2005) that described artificial language experiments similar in spirit to the ones of Fig. 3, see Appendix B for the results of the fitting. The other dataset is taken from a very different experimental study by Fedzechkina et al. (2012) that described the learning of case-marking by adults, see Appendix C. In both cases, the model is successful in describing the data, and moreover, it follows that the parameter s (the response of the learner to evidence consistent with her internal hypothesis) should be smaller than p (the response to evidence that contradicts the internal hypothesis). 3.2. Regularization by children and adults – a comparison In order to investigate the differences in frequency boosting between children and adults, Hudson Kam and Newport (2009) performed a similar experiment, both with children and adult participants,
10
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
by using a simpler artificial language. In the first condition, the ‘‘correct” form of the language was used 100% of the time. In the other three conditions, 0ND, 2ND, and 4ND were explored. Fig. 4(a) presents the data by plotting the mean and the standard deviation of the most frequent form production by adults and children. It was found that children perform worse than the adults in the 100% case and similar to the adults in the other cases. There are several differences between the experiment depicted in Fig. 4(a) (children and adults) and that depicted in Fig. 3(b) (adults). In contrast to the experiment shown in Fig. 3, in the second experiment, neither children nor adults showed regularization behavior. There could be various reasons for this, one being a very small number of participants in each group (7–8 children per group and 4 adults per group). More likely, the absence of frequency boosting in the data of Fig. 4 can be explained by the presence of noise. One can see that even when the correct determiner was used 100% of the time, the children only used it about 65% of the time. This leads us to hypothesize that some noise was present, perhaps due to children’s shorter attention span. In our interpretation of noisy learning, the data could be perceived ‘‘incorrectly” by the learners, such that a training example might be lost or mis-heard (Frank & Tenenbaum, 2011). Despite the fact that the two experiments do not appear to be directly comparable (because the adults do not exhibit the same behavior in both), some interesting patterns emerged from the second experiment that allowed us to reason about the differences between adults and children. Namely, Hudson Kam and Newport (2009) measured the number of systematic users (as opposed to ‘‘correct” users) for adults and children in each condition. Systematic users were defined as learners that always used the same form–even if it was not the right form. This information is presented in Fig. 4(b). It turned out that although the number of ‘‘correct” usages (that is, usages of the most frequent form of the source) in children was similar to that of the adults, the children used their forms significantly more systematically (although their form did not always match the most frequent form of the source). We can see, for example, that for the 2 ND and 4 ND groups, the children were almost always systematic learners, while the adults were never systematic learners. Therefore, the following patterns have to be explained: The adults exhibit regularization behavior in the experiment of Fig. 3(b), and do not exhibit regularization behavior in experiment of Fig. 4(a). Adults are almost never systematic users (that is, each individual typically uses more than one form). The children do not exhibit regularization behavior if the production data are pooled together from several learners (Fig. 4(a)), but individually, children exhibit a very high degree of regularization behavior (often boosting a wrong form).
(b) 100
Child Adult
90 80 70 60 50 40 30 20
80 70 60 50 40 30 20 10
10 0
Children Adults
90
Percent Systematic Producers
Mean Correct Determiner Production (%)
(a) 100
100%
60% + 0 ND 60% + 2 ND 60% + 4 ND
Input Type
0
100%
60% + 0 ND 60% + 2 ND 60% + 4 ND
Input Group
Fig. 4. Results from the experiment with children and adults in Hudson Kam and Newport (2009). (a) Correct form production. (b) Percent of systematic users broken down by children and adults for each input group.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
11
To describe these data, we need to include a certain level of noise in the model. The noise helps explain two aspects: (1) the adults failing to regularize in the 2nd experiment and (2) the children often regularizing the wrong form in the same experiment. Therefore, we explore the possibility of adding noise to the algorithm. Consider Fig. 5(a), which shows a typical simulation for a learner characterized by the parameters s ¼ 0:03; p ¼ 0:06 under the power law model, see also Fig. 3(b). These parameters are the ones obtained as an optimal fit for the adults in the first experiment. Here we consider the example of only two forms, with the dominant form used by the source 60% of the time. We can see that the learner oscillates around a mean (which is approximately the frequency of the source). The histogram in Fig. 5 (b) represents the statistics of the propensity of form 1. It shows that in a population of learners, most will be non-systematic users that would use form 1 between 40 and 80% of the time. Fig. 5(c) and (d) demonstrate the effect of noise. This is implemented in the following way. At each learning event, with a certain probability (20% in the case of this figure), the learner does not receive the input of the source, but instead ‘‘hears” a wrong form. This introduces an element of idiosyncrasy in the model, where some form may be more appealing to certain individuals (and this varies from individual to individual, especially if the number of forms is large, such as in 4ND condition). We can see in parts (c) and (d) of the figure that as a consequence of noise, percentage of use of the correct form is reduced, but there is no qualitative change in the learning dynamics. Most learners are inconsistent users, as expected for adults from the results of the second experiment of Hudson Kam and Newport (2009). The same qualitative picture holds for all cases as long as s < p in the learning algorithm. A very different pattern is observed for the cases when this inequality is reversed, such that s > p, see Fig. 6. There, we can see that adding noise (Fig. 6(c) and (d)) does not only lead to a significant reduction in the average percentage of usage of the ‘‘correct” form (form 1). It leads to a bimodal histogram
Fig. 5. The effect of adding noise, the case with p > s. (a) and (c): typical runs that show the stochastically changing propensity of form 1, in (a) the system contains no noise, and in (c), noise at 20% level is added. (b) and (d) show the corresponding histograms for the propensity of form 1. The parameters are: p ¼ 0:06; s ¼ 0:03, the power law model with a ¼ 0:2; the total number of forms is 1, and for 1 is used by the source at frequency m ¼ 0:6.
12
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
of the propensity, Fig. 6(d), to be compared with Fig. 5(d). The reason for this behavior is related to the tendency of the learning algorithm to push the learner toward systematic usage with s > p, and toward non-systematic usage for s < p, see Fig. A.10. The interpretation of this result is that under the assumption that p < s, some of the learners will be regularizing the ‘‘correct” form (using form 1 close to 100% of the time), and others will be regularizing the incorrect form (using form 2 close to 100% of the time). In the case, the data averaged over a number of learners will show no regularization, but when broken down by individual learners, the percentage of consistent users will be very high. Results shown in Figs. 5 and 6 are presented for a particular pair of increment parameters p and s. We however have performed a systematic study of the whole parameter space ðs; pÞ and found the following: If p > s (as in the case of adult learners), adding noise will reduce the population-level mean propensity of the ‘‘correct” form, while each learner will remain inconsistent, using correct and incorrect forms intermittently. If p < s, then apart from reducing the population-level average percentage of using the ‘‘correct” form, noise will also create a bimodal distribution of learners’ propensity, such that most learners will be systematic, but some will regularize ‘‘wrong” forms. Fig. 7(a) illustrates the last statement by showing the percentage of systematic learners for all parameter combinations in the ðs; pÞ-space. We can see that if s > p, the percentage of systematic users is significant, while for s < p it is negligible. It appears that a qualitative change happens in the model’s behavior depending on whether p < s or s > p. Recall that parameter s characterizes the increment of learning in the case when the form used by the source coincides with the internal hypothesis of the learner. Parameter p characterizes the increment if the source’s form is different from the learner’s form of the largest propensity.
Fig. 6. The effect of adding noise, the case with s > p. Same as in Fig. 5, except p ¼ 0:03 and s ¼ 0:06.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
13
From the above considerations, it is clear that in order to be consistent with the data, the parameters in the child learning model must satisfy s > p. For that choice of parameters, the learners demonstrate a large degree of consistency in their choices, even though the choice is not always the right one. This underlies the difference between the adult and child behavior in the experiments by Hudson Kam and Newport (2009). We have performed the fitting procedure for the adult and children’s data. For the adults, we started with the same parameter values (p ¼ 0:06; s ¼ 0:03, and a ¼ 0:2), as were found in Fig. 3(b). Adjusting the level of noise we were able to obtain the fit depicted in Fig. 7(b). Note that a similar quality fit can also be obtained for other values of s and p. As long as s < p, the level of consistency of usage remains low (except in the 100% case, consistent with the experimental finding). For the children, again it is possible to find a good fit with a large number of choices for the parameters ðs; pÞ, by adjusting the noise. An example of such a fit is given in Fig. 7. We emphasize that these fits are not unique (and they cannot be, considering the fact that there are only 4 data points for children and adults, and the amount of noise is unknown and needs to be adjusted). Therefore, the fitting procedure implemented here can only be interpreted as a proof of principle: multiple fits are possible, as long as s < p for the adults, and s > p for the children. 4. Discussion In this paper we created a mathematical model of learning that is able to capture several findings of Hudson Kam and Newport (2009) on adult and children’s learning. In particular, in our model (as in the experiments), children regularize inconsistent input more than the adults, but their behavior is ‘‘noisier”, so sometimes end up regularizing a ‘‘wrong” form. Conceptually, the learning algorithm proposed in this paper is characterized by two parameters, s and p. The value of s characterizes the change in the learner’s state following the source’s input if it coincides with the learner’s preferred hypothesis. The value of p characterizes the change in the learner’s state following the source’s input if it contradicts the learner’s preferred hypothesis. The following summarizes our findings. The proposed learning algorithm is a reinforcement-type algorithm, where differences in the parameters account for the observed differences in learning patterns between adults and children. In the algorithm, children are characterized by a stronger response to positive evidence (s > p), while adults – by a stronger response to implicit, expectation-based negative evidence (s < p).
Fig. 7. Modeling experimental data with adults and children. (a) The fraction of systematic users, defined as learners that use any of the forms, ‘‘correct” or ‘‘incorrect” more than 85% of the times. (b) Fitting the experimental data of Fig. 4 by a model with noise. For adults, p ¼ 0:06; s ¼ 0:03 (same as in Fig. 3(b)). For children, p ¼ 0:01; s ¼ 0:05. Both use the power law model with a ¼ 0:2.
14
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
In our algorithm, the learners can demonstrate frequency-boosting behavior, depending on the structure of the inconsistent input. The strength of frequency boosting increases as the number of alternative forms increases. In the model, the children have a heightened ability to regularize. Most of the child learners become ‘‘consistent users,” even if their preferred form differs from the most frequent form of the source. These patterns are consistent with the findings of Hudson Kam and Newport (2009). Below we discuss implications of our modeling and put our results in a wider context. 4.1. Mechanisms of frequency matching and frequency boosting Language regularization in children manifests itself in a number of ways, for example, children often have difficulties learning exceptions to rules (Marcus et al., 1992). Regularization of linguistic input by children has been related to overmatching or maximizing (Bever, 1982). It has also been reported that children invent and use their own patterns (Craig & Myers, 1963). The sources and mechanisms of regularization behavior have been extensively discussed in the literature. The Language Bioprogram Hypothesis (Bickerton, 1984) has been proposed to explain the ease with which children regularize inconsistent linguistic input when exposed to reduced communication systems. It was argued that children utilize some innate language-specific constraints that contribute to the process of language acquisition (DeGraff, 2001). Overregularization in children learning grammatical rules has been explained by means of a dual-mechanism model (Marcus, 1995), or an alternative connectionist model (Marchman, Plunkett, & Goodman, 1997; Plunkett & Juola, 1999). Wonnacott et al. (2008) and Wonnacott (2011) proposed that both children and adult learners use distributional statistics to make inferences about when generalization is appropriate. The ‘‘less-is-more” hypothesis of Newport (1990) suggests that the differences in adults’ and children’s language learning abilities can be ascribed to children’s limited cognitive capacities. Hudson Kam and Newport (2009) used the less-is-more hypothesis to explain the children’s remarkable ability to regularize. Interestingly, the tendency of children to ‘‘maximize” has been observed in non-linguistic contexts (see Derks & Paclisanu, 1967), where participants were asked to guess which hand a candy is in, when the two hands contained candy at the 25:75 ratio. It was observed that young children frequency boosted, and frequency matching behavior began to emerge after the age of 4, becoming stronger in older participants. Thompson-Schill, Ramscar, and Chrysikou (2009) and Ramscar and Gitcho (2007) suggested that, while both adults and children have a natural tendency to regularize, the adults use their well-developed prefrontal-cortex-mediated control system to override this. Adults implement their highly efficient machinery for cognitive control and conflict processing, which has evolved for performance. In learning, however, this may be considered an impediment as it makes regularization harder. In this paper we explore this phenomenon further and propose a possible explanation for the differences between adults’ and children’s abilities to regularize. By comparing simulations with data of Hudson Kam and Newport (2009), we found that the best fitting reinforcement models were not symmetric with respect to the learning increments. To describe the data, one has to assume that the increments following the source’s utterance must be different depending on whether or not the source’s utterance coincides with the learner’s most frequent (‘‘preferred”) form. For the child’s best fitting model, the increment following the utterance that coincides with the learner’s most frequent form is the largest (this corresponds to s > p). For adults, it is the smallest (s < p). We hypothesize that adults and children react differently to positive and negative expectation-based evidence. 4.2. Expectation-based evidence The focus of the modeling approach adopted here is the way children and adults treat the information offered by violations of expectation. Following Ramscar et al. (2013c) we argue that learners can react to evidence that both coincides with their expectations and contradicts them, and in fact weighing the expectations against the evidence of a source is an important part of the learning process.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
15
Ramscar et al. (2013c) examined the role children’s expectations play in language learning and presented an interesting model of plural noun learning, which explained how exceptions from rules could be mastered. It is argued that exposure to regular plurals can decrease children’s tendency to overregularize irregular plurals. A central point of the present paper is the fact that humans rely on both positive and (implicit, expectation-based) negative evidence when learning a language. Ignoring the implicit negative evidence in our context would mean setting p ¼ 0, which would make it impossible for our algorithm to learn. We emphasize that utilizing both positive and implicit negative evidence is essential for our model to work. In general, the role of expectation-based evidence has been examined in many contexts, both in animal (McLaren & Mackintosh, 2000; Pearce & Hall, 1980; Rescorla & Wagner, 1972b; Sutton & Barto, 1998) and human behavioral studies (McClure, 2003; Montague, Hyman, & Cohen, 2004; Niv, 2009; Schultz, 2006). In human language learning, implicit expectation-based evidence has been considered by many authors. Rohde and Plaut (1999) suggested that ‘‘. . . one way the statistical structure of a language can be approximated is through the formulation and testing of implicit predictions. By comparing one’s predictions to what actually occurs, feedback is immediate and negative evidence derives from incorrect predictions”. In the connectionist framework, the relevant concept is error-based learning often used in the context of simple recurrent networks (SRN) (Chang, Dell, & Bock, 2006; Elman, 1990; Rumelhart, Hinton, & Williams, 1986). To model the learning process, the SRN uses its internal representations to predict the next utterance of the source, and the difference between its prediction and the input word (the error) is used to adjust the weights of the network connections. The role of implicit negative evidence has been studied by Ramscar and colleagues in the context of RW modeling. Ramscar, Dye, Gustafson, and Klein (2013a) studied cognitive flexibility by looking at response-conflict resolution behavior in children. The RW model was used to describe label-feature and feature-label learning processes and predicted the very different testing results by children trained by the two methods. The success of the theory behind it was demonstrated by Ramscar et al. (2010), where the role of negative learning and cue competition was highlighted by modeling of two novel empirical studies. Ramscar, Dye, Popick, and O’Donnell-McCarthy (2011) applied this theory to children’s learning of small number words, where again, both positive and implicit negative information was essential. Ramscar, Dye, and Klein (2013b) studied the problem of learning the meaning of words by adults and children, and found that the informativity of objects played a more important role for children’s learning than for adult learning. Again, both positive and negative information was used by children (while adults, interestingly, tended to employ a very different, and more strategic approach, which proved by far less successful in the context of label-object matching). The setting in the present paper and that discussed by Ramscar et al. (2013b) are quite different and do not allow a direct comparison (for example, it does not appear that a strategic approach used by the adults in Ramscar et al. (2013b) can work in the experiments of Hudson Kam & Newport (2009)). Nonetheless, it is an interesting future project to investigate if our present model could describe the findings of Ramscar et al. (2013b). In general, the algorithms used by Ramscar and colleagues are more complex and multilayered compared to the simple algorithms developed in the present paper. Despite the mathematical differences, they share some important common features, such as incorporating positive and implicit expectation-based negative evidence. The present algorithm has the advantage of analytical tractability. It is work in progress to find out exactly the range of its applicability to different learning scenarios. 4.3. Differences between adult and children’s learning The main message that follows from this modeling effort is a suggestion that the difference (or, at least, one of the differences) between children’s and adults’ learning may be associated with a different way the two age groups process expectation-based evidence. More specifically, we propose that a possible mechanism that contributes to the children’s ability to ‘‘frequency boost” and regularize an inconsistent input is related to their heightened sensitivity to positive evidence compared to the (implicit, evidence-based) negative evidence. In our model, regularization comes naturally as a conse-
16
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
quence of a stronger reaction of the children to evidence supporting their internal hypothesis. For adults, their ability to adequately process implicit negative evidence prevents them from regularizing the inconsistent input, resulting in a higher degree of capturing the statistics of the source. Can we find any neurological evidence supporting our proposal of children and adults utilizing positive and negative expectation-based evidence to different degrees? We start with very indirect evidence of the described trend, which is related to the differences in adult and children’s processing of explicit positive and negative feedback. The implicit positive and negative evidence considered in this paper (and present in the experiments studied here) is clearly different from explicit negative feedback, which has also been studied in the literature, see e.g. Marcotte (2004) for a review. The notion of ‘‘negative evidence” or ‘‘negative feedback” is often defined as ‘‘information about which structures are not allowed by the grammar” (Leeman, 2003; Marcus, 1993; Seidenberg, 1997). Chouinard and Clark (2003) defined negative evidence as ‘‘information that identifies children’s errors as errors during acquisition.” These factors are not present in the experiments modeled by our framework. Having said that, studying adult and children’s processing of negative feedback may provide some hints on age-differences in processing implicit evidence. Differences in children’s and adults’ processing of explicit positive and negative feedback in language acquisition have been widely discussed in the literature. It is recognized that children preferentially utilize positive evidence while adults are relatively more successful in their processing of negative evidence, see (Birdsong, 1989; Carroll & Swain, 1993; Pinker, 1989) for language acquisition literature, and (Crone, Richard Ridderinkhof, Worm, Somsen, & Van Der Molen, 2004; Crone, Zanolie, Van Leijenhorst, Westenberg, & Rombouts, 2008; Huizinga, Dolan, & van der Molen, 2006) in a more general context. Van Duijvenvoorde, Zanolie, Rombouts, Raijmakers, and Crone (2008) studied the neural mechanisms involved in the reaction of adults and children to positive and negative feedback. In the experiment, the participants were shown pictures and they had to respond whether or not the pictures followed a rule; this was followed by an explicit feedback. The participants’ MRI brain scans obtained in the course of the experiment were analyzed. The authors looked at the fMRI images for three different age groups: 8–9 years, 11–13 years, and 18–25 years, and found differences in activation in specific regions of the brain. This activation was larger after negative feedback than after positive feedback for the 18–25 years old group. However, this was reversed for the young children (8–9 years old), who had a larger activation after positive feedback. For the 11–13 years old group, the activation amount was about the same for the positive and negative feedback. The authors concluded that the young children had a harder time learning from negative feedback than the adults. The adults were better able to process the negative feedback. Another piece of indirect evidence comes from examining the so-called feedback-related negativity (FRN) event-related potential (ERP) (Shephard, Jackson, & Groom, 2014). FRN is electrical activity in the brain measured through electroencephalography (EEG), which is time-locked to an external event (e.g., presentation of a visual stimulus) or a response (e.g. an error). Hämmerer, Li, Müller, and Lindenberger (2011) reported that in the context of probabilistic reinforcement learning, children showed less enhancement of the FRN for negative feedback compared with that for positive feedback. Santesso, Segalowitz, and Schmidt (2006) specifically studied error-related negativity (ERN), and discovered that, in a visual discrimination task, children had smaller ERNs than adults. FRN has been studied in many contexts, and interestingly, it has been proposed that FRN plays a role in outcome expectancies, or in situations where the feedback is implicit (Hajcak, Moser, Holroyd, & Simons, 2007). Ferdinand, Mecklinger, and Kray (2008) write that ERN type signals may ‘‘reflect activity of a common neural generator (the ACC) initiated by input signaling, that an event violates the participant’s expectancy”. In their experiments in sequence learning, they found that deviant stimuli from irregular sequences elicited an N2b component. This supports the view that deviant (expectation-breaking) events acquire the status of perceived errors during explicit and implicit learning, and thus, an N2b is generated resembling the ERN to committed errors. Therefore, the above experimental evidence could be consistent with the idea that adults and children process indirect, expectation-based feedback differently, with adults reacting stronger to positive and children to negative evidence. Finally, we would like to mention a modeling paper by Ramscar, Hendrix, Love, and Baayen (2013d), which studied lexical learning and how it changes as individuals age. It is argued by using
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
17
a RW type model, that, contrary to a common believe, the slower learning rate of adults is not necessarily a sign of decline. In fact, there is a tradeoff between a loss in speed and a gain of accuracy of learning, as shown by the learning model. Interestingly, although our model is mathematically quite different from that used by Ramscar et al. (2013d), and the models are applied to different learning phenomena, our model predicts some of the same patterns. In children, we set s > p (learning is more effective from positive evidence), and in adults, s < p (learning is more effective from implicit negative evidence). Computationally, the model for children tends to converge faster (that is, the children show a gain in the speed of learning), but the adult model reproduces the frequency of the source with a better precision (a gain in the accuracy of learning). A more detailed analysis of this is subject of future work. Further, Ramscar et al. (2013d) analyzed an extensive dataset for individuals split by age-groups, and it was found that, empirically, the factors influencing paired associate learning shift their weight across the lifespan. In particular, there was a shift (with age) in emphasis from positive to negative evidence. This offers independent empirical support for our model’s suggestion that earlier learners are more influenced by positive evidence and later learners by negative evidence. Finally, a mechanism by which older learners start paying more attention to negative evidence, may be related to an increased complexity of learning. In RW type models, the number of links scales with the square of the number of outcomes. Therefore, as the complexity of the learning tasks increases, the number of links must grow rapidly. If we assume that most of the time, the added links have negative weights, this may account for a shift in the learning dynamics, such that the learners start paying relatively more and more attention to implicit negative evidence (Ramscar, 2015). Appendix A. Additional model analysis A.1. A teacher–learner pair as a Markov walk Consider the reinforcement learning algorithm with n ¼ 2 forms, and suppose that the source is characterized by the values m1 ¼ m and m2 ¼ 1 m, with 0 < m < 1. Let us describe the reinforcement learner algorithm as a Markov walk in an interval ½0; L, where the state space consists of integer numbers in ½0; L, and state i corresponds to the propensity of variant 1 being x1 ¼ i=L. The increment of learning is denoted by D and is also an integer (in this case, increment s ¼ D=L will give the increment on the interval ½0; 1). Then transition matrix, P ¼ fpij g, is given by the following:
pi;iþD ¼ m;
for 0 6 i 6 L D;
pi;iD ¼ 1 m; pi;iþD ¼ 1 m;
for s 6 i 6 L; pLi;L ¼ m for 0 6 i 6 D 1;
the rest of the entries in this matrix being zero. The stationary probability distribution is given by the equation qP ¼ q, where q is a string vector of probabilities. This vector has a simple expression if the value D is a divisor of L. In this case we have (for the non-normalized eigenvector)
(
1m L=Di m
q¼
0;
; i ¼ ks;
k ¼ 0; 1; . . . ; L=D;
otherwise:
The mean value for form 1 is given by
PL
m0 ¼ Pi¼0 L L
qi i
i¼0 qi
! s 1þs : ¼ 1 ð1 mÞ þ 2m 1 1 mð1 þ ðm=ð1 mÞÞ1=s Þ This is a monotonically increasing function of m with following:
m0 ð0Þ ¼ 0 and m0 ð1Þ ¼ 1. We also have the
18
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
m0 ðmÞ < m for m < 1=2; m0 ðmÞ > m for m > 1=2: In other words, if the source uses form 1 predominantly (m > 1=2), then the learner will be using form 1 more often than the source, on average. The function m0 depends on s in the following way. If m < 1=2, then @ m0 =@s > 0, and if m > 1=2, then @ m0 =@s < 0. In other words, if the source uses form 1 predominantly, then increasing s will decrease the performance of the learner.
A.2. Properties of the asymmetric algorithm Fig. A.8 demonstrates the properties of the asymmetric learning algorithm depending on the increment values, s and p, which are varied between 0.01 and 1 with step 0.01. For the case n ¼ 2; m1 ¼ 0:6, it presents the contour plot of the value m01 m1 , which is the difference between the expected propensity of the learner at steady-state (Eq. (10)) and the frequency of the source. All positive values correspond to the existence of the boosting property, and the larger the value (denoted by the red colors), the stronger is the boosting effect. We can see that the strongest boosting property is observed when the non-preferred increment, p, is the smallest. On the other hand, the boosting effect disappears entirely if p is significantly larger than s (the dark blue regions). Fig. A.9(a) provides a graphical illustration of algorithms (9) and (8) with the example of 2 alternative forms. If p ¼ s, the increment is the same regardless of whether or not the source’s form coincides with the learner’s preferred form, Fig. A.9(a). In an asymmetric generalization of this model, the increments are different depending on whether the form uttered by the source matches the most frequent form of the learner. If the form uttered by the source is the same as the most frequent form of the learner, the corresponding propensity receives a ‘‘preferred boost,” s. Otherwise, a ‘‘non-preferred boost” p is used. Two cases, s > p and s < p, are illustrated in Fig. A.9(b) and (c). As follows from Fig. A.8, learners with s > p possess a larger boosting property. This is explained graphically in Fig. A.10, which assumes that there are n ¼ 2 forms and that form 1 is the more frequent one (m1 > m2 for the source). The value of x1 , the learner’s propensity for form 1, is presented by a small circle in a unit interval, similar to Fig. A.9. If this circle is closer to the right border of the interval, this means that form 1 is the learner’s preferred form. Otherwise form 2 is the learner’s preferred form. The arrows on the diagram show the direction and the relative size of the update followed by the source’s utterance. The black arrows are the updates if the source utters form 1, and the light green arrows are the updates if the source utters form 2.
1.0
0.35
s: The Preferred Increment
0.9 0.8
0.3
0.7
0.25
0.6
0.2
0.5
0.15
0.4
0.1
0.3
0.05
0.2
0
0.1
−0.05 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
p: The Non−Preferred Increment Fig. A.8. Contour plot of the value m01 m1 , which is the difference between the expected steady-state propensity of the learner and the frequency of the source. Here n ¼ 2 and s and p are varied between 0.01 and 1 with step 0.01.
19
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
(a)
s=p Source: Form 1
x2 Form 2
(b)
Source: Form 2
x2 Form 2
s
s>p Source: Form 1
(c)
Source: Form 2
s
x2 Form 2
s
s
p
p
Source: Form 2
p
x1
x1
x1
Form 1
Form 1
Form 1
Fig. A.9. A schematic illustration of the reinforcement learning algorithm proposed here, in the case of two forms of the rule. The vertical bar represents the state of the learner: the small circle splits the bar into two parts, x1 and x2 , with x1 þ x2 ¼ 1, such that x1 is the probability of the learner to use form 1, and x2 is the probability of the learner to use form 2. In this example, x1 > x2 , that is, form 1 is the preferred form of the learner. The black arrows show the change in the state of the learner following an utterance of the source. If the source produces form 1, the value of x1 increases by amount s. If the source produces form 2, the value of x1 decreases by amount p. Three cases are presented: (a) s ¼ p (the symmetric learner), (b) s > p, and (c) s < p.
s>p
Form 2
x1
s
Form 2
x2
x1 Form 2
Form 1
Form 1
x2
x2
x1 Form 2
x1
Form 1
Form 1
x2
Fig. A.10. Boosting tendency occurs when s > p. The value of x1 is the learner’s propensity for form 1. The arrows on the diagram show the direction and the relative size of the update following the source’s utterance. The black arrows are the updates if the source utters form 1, and the light green arrows are the updates if the source utters form 2. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
We can see that if s > p, then the arrows pushing the dot towards the edges (zero and one) are larger than the ones pushing it toward the middle (for s < p, this is reversed). This means that if the learner is characterized by a stronger response to the source when the utterance coincides with its preferred form, then the boosting tendency is observed. A.3. Fitting of the piecewise-linear model Applying the symmetric reinforcement model (9) to describe the data of Hudson Kam and Newport (2009) we found that no parameter s gives a satisfactory fit. The best fit was obtained for the value s ¼ 1 and is depicted by the dashed line in Fig. A.11(b). Although at the first glance, this fit is not too different from the data, this model has a very serious conceptual drawback. The best fitting
20
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
0.6
s: the "preferred increment"
0.9 0.8
0.5
0.7 0.4
0.6 0.5
0.3 0.4 0.2
0.3 0.2
0.1 0.1 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
p: the "non−preferred increment"
0.9
1.0
Mean Production of Main Determiner Forms
Adults 1.0
100% 90% 80% 70% 60% 50% 40% 30% 20%
Control
2ND
4ND
8ND
16ND
Input Group
Fig. A.11. Fitting the piecewise linear model, (9). (a) Heat plot of least squares error of the asymmetric model compared to the results of Hudson Kam and Newport (2009). s and p vary between 0.01 and 1 with a step size of 0.01. The dark blue areas indicate where the smallest error occurs. (b) Comparison of the results from Hudson Kam and Newport (2009) (the vertical intervals) with the results from our models (the lines), adult learners only. The dashed line is for the best fitting symmetric reinforcement model with s ¼ 1. The solid line is the best-fitting asymmetric model corresponding to s ¼ :19 and p ¼ :50. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
parameter value for the symmetric model is s ¼ 1. This means that the behavior of the learner predicted by the model is extremely unstable and unrealistic. The state of the learner is characterized by only two points, 0 and 1, with frequent switches between the two. This forces us to reject this model and seek a better fit in the generalized, asymmetric algorithms (Eq. (14)). Fig. A.11 gives a heat plot of the least square error computed for all pairs ðs; pÞ. The dark blue areas give the best overall error. We found that the parameters s ¼ 0:19 and p ¼ 0:50 give the best match. This is represented by the solid line in Fig. A.11. Therefore, for the adults, the best match comes when the ‘‘nonpreferred increment” is larger than the ‘‘preferred increment.” A similar result was obtained when we used the data of Hudson Kam and Newport (2005), see Appendix B. Appendix B. Application of the model to the data of Hudson Kam and Newport (2005) In Hudson Kam and Newport (2005), Newport and her colleagues conduct experiments similar to those from Hudson Kam and Newport (2009). In the first experiment, they create an artificial language that is divided into two groups. The ‘‘count/mass nouns” group has the nouns of the language divided into two classes on basis of meaning. The ‘‘gender condition” has the nouns divided into two classes in an arbitrary fashion. Each of these is then divided into four groups, where the determiner is used k% of the time (with k ¼ 45 (the low group), k ¼ 60 (the mid group), k ¼ 75 (the high group), and k ¼ 100 (the perfect group)), and one incorrect form is used otherwise. Altogether, there are eight groups of adults, with five learners in each group. In this experiment, the adults were not able to regularize the language. Their mean determiner production was almost always less than the input. We applied model (12) and (13) to fit the data presented in Hudson Kam and Newport (2005). As seen in Fig. B.12, for the count/mass groups, the best fit occurs when s ¼ 0:17 and p ¼ 0:50. For the gender groups, the best fit occurs when s ¼ 0:05, and p ¼ 0:50. These results are similar to what we found in experiment 1 from Hudson Kam and Newport (2009) (see Fig. 3). We again find that s < p. In the second experiment of Hudson Kam and Newport (2005), both children and adults were taught a simpler artificial language. In this experiment, there were 15 children and 8 adults. The children and adults were each divided into two groups. In one group, the determiner was used 100% of the time. In the other, it was used 60% of the time. These groups correspond to the 100% and 60% þ 0ND groups in the second experiment of Hudson Kam and Newport (2009), and their results were the same in both. The adults performed better than the children in the 100% group, and the adults and children performed about the same in the 60% group. However, the children were much more likely to be
21
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30 100
Mean Determiner Production (%)
90 80 70
Count/Mass−Paper Gender−Paper Input Count/Mass−Model Gender−Model
60 50 40 30 20 10 0 Low
Mid
High
Perfect
Input Group Fig. B.12. The best fit of the model compared to the results of Hudson Kam and Newport (2005). The solid lines give the frequency of production observed in the paper (blue with circles for the count/mass groups, and red with squares for the gender groups). The dashed lines show the best fit for each group by using model (12) and (13). The dashed black line gives the frequency of the input for each group. For the count/mass groups, the best fit occurs when s ¼ 0:17 and p ¼ 0:50. For the gender groups, the best fit occurs when s ¼ 0:05, and p ¼ 0:50. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
systematic learners. Although it is not possible to fit these data because there are only two data points for each group, the experimental results reported in Hudson Kam and Newport (2005) correspond to our findings in Section 3.2, where we modeled the second experiment of Hudson Kam and Newport (2009). Appendix C. Application of the model to the data of Fedzechkina et al. (2012) C.1. The language structure of the experiments in Fedzechkina et al. (2012) In Fedzechkina et al. (2012), adult learners were taught an artificial language with inefficient case marking. The artificial language consisted of simple sentences with a subject, an object, and a verb, with objects case-marked in a certain subset of sentences. All learners were native English speakers and had not previously been exposed to case-marking. The objective was to assess the learner’s efficiency of case-marking in different types of sentences, as explained below. C.1.1. Experiment 1 In the first experiment, the subjects were always animate, and the objects could be animate or inanimate. The order of the sentences could be Subject–Object–Verb (SOV) or Object–Subject–Verb (OSV). The sentences were considered harder to understand when both the subject and the object were animate. They were also considered harder to understand when the OSV form was used. The language input was broken down in two ways: (i) 50% of the sentences had an animate object and 50% of the sentences had an inanimate object and (ii) 60% of the sentences were SOV and 40% of the sentences were OSV. The objects were case-marked 60% of the time, with both animate and inanimate objects being equally likely to be case-marked. Fig. C.13 presents diagrams that show the input structure of the language. It follows that the language had four types of sentence: (i) animate and SOV, (ii) animate and OSV, (iii) inanimate and SOV, and (iv) inanimate and OSV. In each of these sentences, the object was either case-marked or not. The way the language was presented, the case marking was not efficient. A sentence that was harder to understand, such as an animate, OSV sentence might not be case-marked, while an easier to understand sentence (inanimate, SOV, for example) might be casemarked.
22
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
Fig. C.13. The breakdown of the sentences in the experiments of Fedzechkina et al. (2012). Top: breakdown in terms of their object: either animate or inanimate. Bottom: breakdown in terms of their word order: either Subject–Object–Verb (SOV) or Object–Subject–Verb (OSV).
The experiment went on for 4 days and had 20 participants. Each day, participants were taught 80 sentences, and then tested, with the exception that they were not tested at the end of the first day. The percentage of case-marked objects produced by each participant was recorded. The paper only presents data on the proportion of object case-marking from days 2 through 4. Fig. C.14 shows the results of experiment 1 (the solid lines). The following trends are immediately observed. When the objects are animate, the sentences are harder to understand, so one expects the learners to case-mark the animate objects more than the inanimate objects, which they do (see Fig. C.14(a)). Also, since the OSV sentences are harder to understand, one expects that the learners will case-mark the objects more often in these than they do in the SOV sentences. As we can see in Fig. C.14(b), they do. C.1.2. Experiment 2 In the second experiment, the input language was in a sense the complement of the language in the first experiment. The objects were always inanimate, and the subjects could be either animate or inanimate (50% of the subjects are animate and 50% are inanimate). The other aspects of the language were all the same. The authors were testing two hypotheses in the second experiment. First, they wanted to know if, in the first experiment, the preference to case-mark animate objects came from a bias to case-mark
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
23
Fig. C.14. Experiment 1. The solid lines represent the outcome of the experiments, with the standard deviation shown by vertical bars. The dashed lines show the best fit of the model, see Appendix C.4. It occurs when Dþ AA ¼ 0:080; DAA ¼ 0:089; DþUU ¼ 0:001; DUU ¼ 0:017; d1 ¼ 0:01, and d2 ¼ 0:97. (a) Animate versus inanimate. (b) SOV versus OSV.
the atypical. If this was the case, then in the second experiment, they would expect to see the inanimate subjects being case-marked more often. The second possibility that the authors were testing was that the preference to mark animate objects came from a bias toward animate things. If this was the case, then they would expect to see the animate sentences case-marked the most in both experiments. Interestingly, it was found that both biases were at play. At first, the learners case-marked animate sentences more often, but by day three, they were case-marking the inanimate sentences more (see Fig. C.15(a), the solid lines). With regards to the SOV and OSV sentences, again two hypotheses were being tested. Firstly, the authors wanted to see if the bias was to case-mark to avoid miscommunication. If this was the case, then we should see OSV sentences case-marked more often again. Alternatively, the bias to case-mark could be ‘‘a bias to mention disambiguating information as early as possible in the sentence”. If this was the case, then SOV sentences should be case-marked more often. Again, it was found that both biases seemed to be at play. SOV sentences are case-marked more often than OSV, but by day 4, the rate of case-marking SOV sentences was decreasing, while the rate of case-marking OSV sentences was increasing (see Fig. C.15(b), the solid lines). C.2. Parameterizing the learning algorithm In order to adopt the algorithm to this particular setting, we notice that the variations in the input only concerned whether or not the sentence’s object was case-marked. This means that there were only two variants present: to case mark or not to case-mark. Therefore, we have n ¼ 2. The update rule (14) simplifies for n ¼ 2. Each binary rule can be simply characterized by the probability to case-mark, P; 0 6 P 6 1. Then, if a case-marked sentence of the appropriate kind is received, the following update applies:
(
P!
½P þ Dþ 1 ; P > 1=2; ½P þ D 1 ; P < 1=2:
ðC:1Þ
Similarly, if a non-case-marked sentence of this kind is received, then
P!
½P D 1 ; P > 1=2; ½P Dþ 1 ; P < 1=2:
Above, we use the following operation denoted by the square brackets with the subscript 1:
½Z1 ¼
Z; Z 6 1 1; Z > 1:
ðC:2Þ
24
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
Fig. C.15. Experiment 2. The solid lines represent the outcome of the experiments, with the standard deviation shown by vertical bars. The dashed lines show the best fit of the model, see Appendix C.4. It occurs when Dþ AA ¼ 0:013; DAA ¼ 0:049; DþUU ¼ 0:100; DUU ¼ 0:200; d1 ¼ 0:10, and d2 ¼ 1:00. (a) Animate versus inanimate. (b) SOV versus OSV.
In the first case (Eq. (C.1)), the input sentence case-marks the object. Then, if the learner is more likely than not to case-mark this type of sentence, then her tendency to case-mark it increases by quantity Dþ (or becomes one); otherwise it increases by the quantity D (or becomes one). Eq. (C.2) describes the situation when the source’s sentence does not mark the object. In this case, if the learner is more likely to not case-mark this type of a sentence, then her tendency to not case-mark it increases by quantity Dþ (or becomes one); otherwise it increases by the quantity D (or becomes one). In the experiments, each sentence was both animate or inanimate (we will shorten these to an and in), and SOV or OSV (which are shortened to so and os). This means that the learners could hear four types of sentences: ðan; soÞ; ðan; osÞ; ðin; soÞ, and ðin; osÞ. Therefore, not one, but four different rules were learned, each corresponding to its sentence type. In the setting of the experiments, each individual can be characterized by probabilities
Pan;so ;
Pan;os ;
Pin;so ;
P in;os ;
where Pan;so is the probability of the learner to case-mark an animate, SOV sentence; P an;os is the probability of the learner to case-mark an animate, OSV sentence; P in;so is the probability of the learner to case-mark an inanimate, SOV sentence; and Pin;os is the probability of the learner to case-mark an inanimate, OSV sentence. The rules described above apply to all four sentence types. Therefore, we have eight update param eters, which we denote by Dþ m and Dm , where m 2 fðan; soÞ; ðan; osÞ; ðin; soÞ; ðin; osÞg. For each rule, we have two different increments that are used depending on whether or not the learner’s internal hypothesis matches the source. For example, if the source gives a case-marked ðan; soÞ sentence, and the learner prefers to case-mark ðP an;so > 1=2Þ, we update with the ‘‘+” increment. If the sentence from the source does not match the learner’s hypothesis, we use the ‘‘” increment. Upon receiving an input sentence, say an an; so sentence, the learner is assumed to only update the probability corresponding to that type of sentence; in other words, the four different rules are assumed to be independent. This is a simplification, and in reality, it is possible that the probabilities to case-mark other types of sentences can also be updated. For example, if the source’s sentence used case-marking in an an; so sentence, the quantities Pan;os and Pin;so could also increase, although perhaps to a lesser extent than the quantity Pan;so . Including such interactions among the four rules would increase the number of parameters significantly and make any parameter fitting impossible. Therefore, here we use the simplifying assumption of rule independence. Even in this simplest case however, the model contains eight independent increment parameters. In the next section we show how we can reduce the number of increment parameters to four by making further assumptions.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
25
C.3. Ambiguity and reduction rules for the increments The sentences in the experiment differ by the degree of ambiguity. Denoting ‘‘ambiguous” by A and ‘‘unambiguous” by U, we can classify each sentence as type AA; AU; UA, or UU. In the first experiment, animate and OSV are ambiguous. In the second experiment, inanimate and OSV are ambiguous. Therefore, all sentences can be classified into four classes, as summarized in Table C.1. In this way, we can consider four increment types: DAA ; DAU ; DUA and DUU (with each of the types having a ‘‘+” and ‘‘” increment). Based on ambiguity considerations, there are several theoretical possibilities for reductions, where we assume that some of the variables are equal to each other. Possible hypotheses that allow us to reduce the number of parameters to four are: All increments plus variables are equal to all minus variables. This gives 4 total independent increments and corresponds to a symmetric learning algorithm which was introduced in Mandelshtam and Komarova (2014). We can further group the increments by ambiguity classes. For example, the increments for AA; AU; UA could be the same, and the most unambiguous sentences UU could have different increments. The plus and minus increments are kept different, and therefore we have 4 independent increments. Similarly, we could use one type of increments for the most ambiguous sentences, AA, and another type of increments for AU; UA; UU. Again, this leads to 4 independent increments. We can assume that the increments of ‘‘mixed” ambiguity are given by the arithmetic means of the AA and UU types:
DþUA ¼ DþAU ¼
DþAA þ DþUU 2
and DUA ¼ DAU ¼
DAA þ DUU : 2
ðC:3Þ
þ Again, this corresponds to 4 independent increment parameters, Dþ AA ; DAA ; DUU , and DUU .
It turns out that none of these rules produce a reasonable fit except for the last rule. We were not successful in reducing the number of independent increments further, so four such increments was the smallest number of increment parameters that allowed for a successful fit. In our model, we keep track of the learners’ frequencies to case-mark the four types of sentences as they learn each day. We would like to include discounted learning into our model to simulate the learners not remembering everything that they learned from the day before. At the beginning of day 2, the learners’ frequencies to case-mark the four types of sentences will be multiplied by d1 , where d1 will represent the percent that they remember from their first day of learning. Starting with day 3, we will instead multiply their frequencies by d2 , where d2 represents the amount that they remember from the day before (after they have already been learning for at least 2 days). The reason for this modeling choice is the reported trend that learning without testing has a lower retention rate than learning in the presence of testing, see e.g. Rohrer and Pashler (2007), Karpicke and Roediger (2007), Larsen, Butler, and Roediger (2009), Butler and Roediger (2007), and Roediger and Butler (2011). The learners in the work of Fedzechkina et al. (2012) were not tested at the end of day 1, and they were tested at the end of each subsequent day. It has been suggested in the literature that immediate testing of the material improves the percentage of information retained by the learners.
Table C.1 Sentence types classified by their ambiguity, for the two experiments.
AA UU AU UA
Experiment 1
Experiment 2
ðan; osÞ ðin; soÞ ðan; soÞ ðin; osÞ
ðin; osÞ ðan; soÞ ðin; soÞ ðan; osÞ
26
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
C.4. Fitting procedure and the results We used the following procedure to perform parameter fitting of the model against data of Fedzechkina et al. (2012). We simulated the learning process of 100 subjects over four days. In the simulation of the first experiment, on each day, the learners were presented with 8 sentences of 4 types, according to the linguistic breakdown as in Fig. C.13. The learners updated their probabilities Pm with m 2 fðan; soÞ; ðan; osÞ; ðin; soÞ; ðin; osÞg according to the learning algorithm described above, with fixed parameters Dþ m and Dm (subjects to the constrains in Eq. (C.3), and with fixed memory discounting parameters d1 and d2 . All these parameters were assumed to be the same for all the learners. Then, at the end of days 2, 3, and 4, their probabilities Pm with m 2 fðan; soÞ; ðan; osÞ; ðin; soÞ; ðin; osÞg were recorded. In the data of Fedzechkina et al. (2012), these probabilities were not measured directly. Instead, the quantities P an ; P in ; P so , and P os were presented, which are the probabilities to case-mark animate, inanimate, SOV, and OSV sentences, respectively. Given the breakdown of the sentences (see Fig. C.13), they are connected with our variables as follows:
Pan ¼ 3=5Pan;so þ 2=5Pan;os ;
Pin ¼ 3=5Pin;so þ 2=5Pin;os ;
ðC:4Þ
Pso ¼ 1=2Pan;so þ 1=2Pin;so ;
Pos ¼ 1=2Pan;os þ 1=2Pin;os :
ðC:5Þ
Note that not all of these quantities are independent:
1=2Pan þ 1=2Pin ¼ 3=5Pso þ 2=5Pos : For each trial, we calculated quantities Pan ; Pin , etc from our probability values, and compared the mean values (over the population of learner) with the data of Fedzechkina et al. (2012) by calculating the mean square difference. Then all the parameters (that is, the increments and the memory parameters) were changed and the procedure repeated. In the end, we found the parameter combinations that corresponded to the least mean square error. Those were considered the best fitting parameters. A similar procedure was performed for experiment 2. The results are presented below. þ For experiment 1, the best fit occurs when Dþ AA ¼ 0:080; DAA ¼ 0:089; DUU ¼ 0:001; DUU ¼ 0:017; d1 ¼ 0:01, and d2 ¼ 0:97 (see Fig. C.14). þ For experiment 2, the best fit occurs when Dþ AA ¼ 0:013; DAA ¼ 0:049; DUU ¼ 0:100; DUU ¼ 0:200; d1 ¼ 0:10, and d2 ¼ 1:00 (see Fig. C.15). C.5. Summary We have adapted the asymmetric reinforcement-type learning Algorithm (11) to describe the experiments of Fedzechkina et al. (2012). We were able to parameterize the model and obtain good fits with the quantitative data of Fedzechkina et al. (2012). For both experiments, the best fitting parameter values for the learning increments satisfy Dþm < Dm , which is consistent with the model in the main part of this paper. Recall that Dþm is the learning increment for rule m, which is implemented following a source’s string that is compatible with the learner’s internal hypothesis. On the other hand, D m is the learning increment used after a source’s string that does not coincide with the learner’s internal hypothesis. Thus, adults appear to react stronger when the evidence is inconsistent with their internal hypothesis. Further, in both experiments, the best fitting memory discounting parameters satisfy d2 > d1 , where d1 is the proportion retained after the first day of learning, and d2 is the proportion retained after any subsequent day of learning. This inequality agrees with the findings in the literature (Butler & Roediger, 2007; Karpicke & Roediger, 2007; Larsen et al., 2009; Roediger & Butler, 2011; Rohrer & Pashler, 2007), who showed that learners remember more of what they learn if they are tested on it right away. References Alós-Ferrer, C., & Netzer, N. (2010). The logit-response dynamics. Games and Economic Behavior, 68, 413–427. Andersen, R. W. (1983). Pidginization and creolization as language acquisition. ERIC.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
27
Bendor, J., Diermeier, D., & Ting, M. (2003). A behavioral model of turnout. American Political Science Review, 97, 261–280. Bendor, J., Mookherjee, D., & Ray, D. (2001). Aspiration-based reinforcement learning in repeated interaction games: An overview. International Game Theory Review, 3, 159–174. Bever, T. G. (1982). Regression in the service of development. In Regression in mental development: Basic properties and mechanisms (pp. 153–88). Bickerton, D. (1984). The language bioprogram hypothesis. Behavioral and Brain Sciences, 7, 173–222. Birdsong, D. (1989). Metalinguistic performance and interlinguistic competence. Berlin: Springer-Verlag. Blume, L. E. (1993). The statistical mechanics of strategic interaction. Games and Economic Behavior, 5, 387–424. Botvinick, M. M., Niv, Y., & Barto, A. C. (2009). Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition, 113, 262–280. Bröder, A., Herwig, A., Teipel, S., & Fast, K. (2008). Different storage and retrieval deficits in normal aging and mild cognitive impairment: A multinomial modeling analysis. Psychology and Aging, 23, 353. Busemeyer, J. R., & Pleskac, T. J. (2009). Theoretical tools for understanding and aiding dynamic decision making. Journal of Mathematical Psychology, 53, 126–138. Bush, R. R., & Mosteller, F. (1955). Stochastic models for learning. John Wiley & Sons, Inc. Butler, A. C., & Roediger, H. L. III, (2007). Testing improves long-term retention in a simulated classroom setting. European Journal of Cognitive Psychology, 19, 514–527. Bybee, J. L., & Slobin, D. I. (1982). Why small children cannot change language on their own: Suggestions from the english past tense. In Papers from the 5th international conference on historical linguistics (Vol. 21). Camerer, C. (2003). Behavioral game theory: Experiments in strategic interaction. Princeton University Press. Carroll, S., & Swain, M. (1993). Explicit and implicit negative feedback. Studies in Second Language Acquisition, 15, 357–386. Chang, F., Dell, G. S., & Bock, K. (2006). Becoming syntactic. Psychological Review, 113, 234. Chouinard, M. M., & Clark, E. V. (2003). Adult reformulations of child errors as negative evidence. Journal of Child Language, 30, 637–670. Cochran, B. P., McDonald, J. L., & Parault, S. J. (1999). Too smart for their own good: The disadvantage of a superior processing capacity for adult language learners. Journal of Memory and Language, 41, 30–58. Coppola, M., & Newport, E. L. (2005). Grammatical subjects in home sign: Abstract linguistic structure in adult primary gesture systems without linguistic input. Proceedings of the National Academy of Sciences of the United States of America, 102, 19249–19253. Craig, G. J., & Myers, J. L. (1963). A developmental study of sequential two-choice decision making. Child Development, 34, 483–493. Crone, E. A., Richard Ridderinkhof, K., Worm, M., Somsen, R. J., & Van Der Molen, M. W. (2004). Switching between spatial stimulus–response mappings: A developmental study of cognitive flexibility. Developmental Science, 7, 443–455. Crone, E. A., Zanolie, K., Van Leijenhorst, L., Westenberg, P. M., & Rombouts, S. A. (2008). Neural mechanisms supporting flexible performance adjustment during development. Cognitive, Affective, & Behavioral Neuroscience, 8, 165–177. Danks, D. (2003). Equilibria of the Rescorla–Wagner model. Journal of Mathematical Psychology, 47, 109–121. DeGraff, M. (2001). Language creation and language change: Creolization, diachrony, and development. The MIT Press. Derks, P. L., & Paclisanu, M. I. (1967). Simple strategies in binary prediction by children and adults. Journal of Experimental Psychology, 73, 278. Duffy, J. (2006). Agent-based models and human subject experiments. Handbook of Computational Economics, 2, 949–1011. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Erev, I., & Roth, A. E. (1998). Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, 848–881. Fedzechkina, M., Jaeger, T. F., & Newport, E. L. (2012). Language learners restructure their input to facilitate efficient communication. Proceedings of the National Academy of Sciences, 109, 17897–17902. Feldman, J., & Newell, A. (1961). A note on a class of probability matching models. Psychometrika, 26, 333–337. Ferdinand, N. K., Mecklinger, A., & Kray, J. (2008). Error and deviance processing in implicit and explicit sequence learning. Journal of Cognitive Neuroscience, 20, 629–642. Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. Flache, A., & Macy, M. W. (2002). Stochastic collusion and the power law of learning a general reinforcement learning model of cooperation. Journal of Conflict Resolution, 46, 629–653. Fowler, J. H. (2006). Habitual voting and behavioral turnout. Journal of Politics, 68, 335–344. Frank, M. C., & Tenenbaum, J. B. (2011). Three ideal observer models for rule learning in simple languages. Cognition, 120, 360–371. Goldin-Meadow, S. (2005). The resilience of language: What gesture creation in deaf children can tell us about how all children learn language. Psychology Press. Goldin-Meadow, S., Mylander, C., de Villiers, J., Bates, E., & Volterra, V. (1984). Gestural communication in deaf children: The effects and noneffects of parental input on early language development. Monographs of the Society for Research in Child Development, 1–151. Gómez, R. L., & Gerken, L. (2000). Infant artificial language learning and language acquisition. Trends in Cognitive Sciences, 4, 178–186. Griffiths, T. L., Chater, N., Kemp, C., Perfors, A., & Tenenbaum, J. B. (2010). Probabilistic models of cognition: Exploring representations and inductive biases. Trends in Cognitive Sciences, 14, 357–364. Guitart-Masip, M., Huys, Q. J., Fuentemilla, L., Dayan, P., Duzel, E., & Dolan, R. J. (2012). Go and no-go learning in reward and punishment: Interactions between affect and effect. Neuroimage, 62, 154–166. Gureckis, T. M., & Love, B. C. (2009). Short-term gains, long-term pains: How cues about state aid learning in dynamic environments. Cognition, 113, 293–313. Hajcak, G., Moser, J. S., Holroyd, C. B., & Simons, R. F. (2007). It’s worse than you thought: The feedback negativity and violations of reward prediction in gambling tasks. Psychophysiology, 44, 905–912.
28
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
Hämmerer, D., Li, S.-C., Müller, V., & Lindenberger, U. (2011). Life span differences in electrophysiological correlates of monitoring gains and losses during probabilistic reinforcement learning. Journal of Cognitive Neuroscience, 23, 579–592. Holyoak, K. J., Koh, K., & Nisbett, R. E. (1989). A theory of conditioning: Inductive learning within rule-based default hierarchies. Psychological Review, 96, 315. Hsu, A. S., Chater, N., & Vitányi, P. (2013). Language learning from positive evidence, reconsidered: A simplicity-based approach. Topics in Cognitive Science, 5, 35–55. Hudson Kam, C. L., & Newport, E. L. (2005). Regularizing unpredictable variation: The roles of adult and child learners in language formation and change. Language Learning and Development, 1, 151–195. Hudson Kam, C. L., & Newport, E. L. (2009). Getting it right by getting it wrong: When learners change languages. Cognitive Psychology, 59, 30. Huizinga, M. t., Dolan, C. V., & van der Molen, M. W. (2006). Age-related change in executive function: Developmental trends and a latent variable analysis. Neuropsychologia, 44, 2017–2036. Izquierdo, L. R., Izquierdo, S. S., Gotts, N. M., & Polhill, J. G. (2007). Transient and asymptotic dynamics of reinforcement learning in games. Games and Economic Behavior, 61, 259–276. Johnson, J. S., Shenkman, K. D., Newport, E. L., & Medin, D. L. (1996). Indeterminacy in the grammar of adult language learners. Journal of Memory and Language, 35, 335–352. Karpicke, J. D., & Roediger, H. L. III, (2007). Repeated retrieval during learning is the key to long-term retention. Journal of Memory and Language, 57, 151–162. Klein, W., & Perdue, C. (1993). Utterance structure. In Adult language acquisition: Cross-linguistic perspectives (Vol. 2). (The results). Larsen, D. P., Butler, A. C., & Roediger, H. L. III, (2009). Repeated testing improves long-term retention relative to repeated study: A randomised controlled trial. Medical Education, 43, 1174–1181. Leeman, J. (2003). Recasts and second language development. Studies in Second Language Acquisition, 25, 37–63. Lee, D., Seo, H., & Jung, M. W. (2012). Neural basis of reinforcement learning and decision making. Annual Review of Neuroscience, 35, 287–308. Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106, 1126–1177. Lieberman, E., Michel, J.-B., Jackson, J., Tang, T., & Nowak, M. A. (2007). Quantifying the evolutionary dynamics of language. Nature, 449, 713–716. Luce, R. D. (1977). The choice axiom after twenty years. Journal of Mathematical Psychology, 15, 215–233. Ma, T., & Komarova, N. (in preparation). Stochastic nonlinear models of learning. (2015). Maia, T. V. (2009). Reinforcement learning, conditioning, and the brain: Successes and challenges. Cognitive, Affective, & Behavioral Neuroscience, 9, 343–364. Mandelshtam, Y., & Komarova, N. (2014). When learners surpass their sources: Mathematical modeling of learning from an inconsistent source. Available from arXiv preprint arxiv:1402.4678. Marchman, V. A., Plunkett, K., & Goodman, J. (1997). Overregularization in english plural and past tense inflectional morphology: A response to marcus. Journal of Child Language, 24, 767–779. Marcotte, J. (2004). Language acquisition, linguistic evidence, the baby, and the bathwater. Journal of Linguistics, 1–61. Marcus, G. F. (1993). Negative evidence in language acquisition. Cognition, 46, 53–85. Marcus, G. F. (1995). Children’s overregularization of english plurals: A quantitative analysis. Journal of Child Language, 22, 447–459. Marcus, G. F., Pinker, S., Ullman, M., Hollander, M., Rosen, T. J., Xu, F., & Clahsen, H. (1992). Overregularization in language acquisition. Monographs of the Society for Research in Child Development, i-178. McClure, S. M. (2003). Reward prediction errors in human brain. Ph.D. Thesis, Baylor College of Medicine. McLaren, I., & Mackintosh, N. (2000). An elemental model of associative learning: I. Latent inhibition and perceptual learning. Animal Learning & Behavior, 28, 211–246. Miller, R. R., Barnet, R. C., & Grahame, N. J. (1995). Assessment of the Rescorla–Wagner model. Psychological Bulletin, 117, 363. Monaghan, P., White, L., & Merkx, M. M. (2013). Disambiguating durational cues for speech segmentation. The Journal of the Acoustical Society of America, 134, EL45–EL51. Montague, P. R., Hyman, S. E., & Cohen, J. D. (2004). Computational roles for dopamine in behavioural control. Nature, 431, 760–767. Mookherjee, D., & Sopher, B. (1994). Learning behavior in an experimental matching pennies game. Games and Economic Behavior, 7, 62–91. Mookherjee, D., & Sopher, B. (1997). Learning and decision costs in experimental constant sum games. Games and Economic Behavior, 19, 97–132. Mühlenbernd, R., & Nick, J. D. (2014). Language change and the force of innovation. In Pristine perspectives on logic, language, and computation (pp. 194–213). Springer. Narendra, K. S., & Thathachar, M. A. (2012). Learning automata: An introduction. Courier Dover Publications. Newport, E. L. (1990). Maturational constraints on language learning. Cognitive Science, 14, 11–28. Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology, 53, 139–154. Niv, Y., & Montague, P. R. (2008). Theoretical and empirical studies of learning. In Neuroeconomics: Decision making and the brain (pp. 329–350). Niyogi, P. (2006). The computational nature of language learning and evolution. Cambridge: MIT Press. Norman, M. (1972). Markov processes and learning models. New York: Academic Press. Nowak, M. A., Komarova, N. L., & Niyogi, P. (2001). Evolution of universal grammar. Science, 291, 114–118. Pearce, J. M., & Hall, G. (1980). A model for pavlovian learning: Variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychological Review, 87, 532. Perruchet, P., & Pacton, S. (2006). Implicit learning and statistical learning: One phenomenon, two approaches. Trends in Cognitive Sciences, 10, 233–238. Pinker, S. (1989). Learnability and cognition: The acquisition of argument structure.
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
29
Plunkett, K., & Juola, P. (1999). A connectionist model of english past tense and plural morphology. Cognitive Science, 23, 463–490. Ramscar, M. (2015). Personal communication. Ramscar, M., Dye, M., Gustafson, J. W., & Klein, J. (2013a). Dual routes to cognitive flexibility: Learning and response-conflict resolution in the dimensional change card sort task. Child Development, 84, 1308–1323. Ramscar, M., Dye, M., & Klein, J. (2013b). Children value informativity over logic in word learning. Psychological Science, 24, 1017–1023. Ramscar, M., Dye, M., & McCauley, S. M. (2013c). Error and expectation in language learning: The curious absence of mouses in adult speech. Language, 89, 760–793. Ramscar, M., Dye, M., Popick, H. M., & O’Donnell-McCarthy, F. (2011). The enigma of number: Why children find the meanings of even small number words hard to learn and how we can help them do better. PloS One, 6, e22501. Ramscar, M., & Gitcho, N. (2007). Developmental change and the nature of learning in childhood. Trends in Cognitive Sciences, 11, 274–279. Ramscar, M., Hendrix, P., Love, B., & Baayen, R. (2013d). Learning is not decline: The mental lexicon as a window into cognition across the lifespan. The Mental Lexicon, 8, 450–481. Ramscar, M., Yarlett, D., Dye, M., Denny, K., & Thorpe, K. (2010). The effects of feature-label-order and their implications for symbolic learning. Cognitive Science, 34, 909–957. Reali, F., & Griffiths, T. L. (2009). The evolution of frequency distributions: Relating regularization to inductive biases through iterated learning. Cognition, 111, 317–328. Rescorla, R. A. (1968). Probability of shock in the presence and absence of CS in fear conditioning. Journal of Comparative and Physiological Psychology, 66, 1. Rescorla, R. A. (1988). Pavlovian conditioning: It’s not what you think it is. American Psychologist, 43, 151. Rescorla, R.A., & Wagner, A.R. (1972b). A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In Classical conditioning: Current research and theory. Rescorla, R., & Wagner, A. (1972a). A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. Black & W. Prokasy (Eds.), Classical conditioning II: Current Research and theory. New York: AppletonCentury-Crofts. Ribas-Fernandes, J. J., Solway, A., Diuk, C., McGuire, J. T., Barto, A. G., Niv, Y., & Botvinick, M. M. (2011). A neural signature of hierarchical reinforcement learning. Neuron, 71, 370–379. Roediger, H. L., III, & Butler, A. C. (2011). The critical role of retrieval practice in long-term retention. Trends in Cognitive Sciences, 15, 20–27. Rohde, D. L., & Plaut, D. C. (1999). Language acquisition in the absence of explicit negative evidence: How important is starting small? Cognition, 72, 67–109. Rohrer, D., & Pashler, H. (2007). Increasing retention without increasing study time. Current Directions in Psychological Science, 16, 183–186. Roth, A. E., & Erev, I. (1995). Learning in extensive-form games: Experimental data and simple dynamic models in the intermediate term. Games and Economic Behavior, 8, 164–212. Roy, D. K., & Pentland, A. P. (2002). Learning words from sights and sounds: A computational model. Cognitive Science, 26, 113–146. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge. Saffran, J. R. (2003). Statistical language learning mechanisms and constraints. Current Directions in Psychological Science, 12, 110–114. Santesso, D. L., Segalowitz, S. J., & Schmidt, L. A. (2006). Error-related electrocortical responses in 10-year-old children and young adults. Developmental Science, 9, 473–481. Schultz, W. (2006). Behavioral theories and the neurophysiology of reward. Annual Review of Psychology, 57, 87–115. Schultz, W. (2007). Behavioral dopamine signals. Trends in Neurosciences, 30, 203–210. Sebba, M. (1997). Contact languages: Pidgins and creoles. London: Macmillan. Seidenberg, M. S. (1997). Language acquisition and use: Learning and applying probabilistic constraints. Science, 275, 1599–1603. Seidenberg, M. S., MacDonald, M. C., & Saffran, J. R. (2002). Does grammar start where statistics stop? Science, 298, 553–554. Seidl, A., & Johnson, E. K. (2006). Infant word segmentation revisited: Edge alignment facilitates target extraction. Developmental Science, 9, 565–573. Senghas, A. (1995). The development of nicaraguan sign language via the language acquisition process. In Proceedings of the 19th annual Boston University conference on language development (pp. 543–552). Senghas, A., & Coppola, M. (2001). Children creating language: How nicaraguan sign language acquired a spatial grammar. Psychological Science, 12, 323–328. Senghas, A., Coppola, M., Newport, E. L., & Supalla, T. (1997). Argument structure in nicaraguan sign language: The emergence of grammatical devices. In Proceedings of the 21st annual Boston University conference on language development (Vol. 2, pp. 550– 561). Shephard, E., Jackson, G., & Groom, M. (2014). Learning and altering behaviours by reinforcement: Neurocognitive differences between children and adults. Developmental Cognitive Neuroscience, 7, 94–105. Singleton, J. L., & Newport, E. L. (2004). When learners surpass their models: The acquisition of american sign language from inconsistent input. Cognitive Psychology, 49, 370–407. Smith, K., & Wonnacott, E. (2010). Eliminating unpredictable variation through iterated learning. Cognition, 116, 444–449. Steels, L. (2000). Language as a complex adaptive system. In Parallel problem solving from nature PPSN VI (pp. 17–26). Springer. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1). Cambridge Univ Press. Thomason, S. G., & Kaufman, T. (1991). Language contact, creolization, and genetic linguistics. Univ of California Press. Thompson-Schill, S. L., Ramscar, M., & Chrysikou, E. G. (2009). Cognition without control when a little frontal lobe goes a long way. Current Directions in Psychological Science, 18, 259–263.
30
J.L. Rische, N.L. Komarova / Cognitive Psychology 84 (2016) 1–30
Van de Pol, M., & Cockburn, A. (2011). Identifying the critical climatic time window that affects trait expression. The American Naturalist, 177, 698–707. Van Duijvenvoorde, A. C., Zanolie, K., Rombouts, S. A., Raijmakers, M. E., & Crone, E. A. (2008). Evaluating the negative or valuing the positive? Neural mechanisms supporting feedback-based learning across development. The Journal of Neuroscience, 28, 9495–9503. Vulkan, N. (2000). An economistÕs perspective on probability matching. Journal of Economic Surveys, 14, 101–118. Wonnacott, E. (2011). Balancing generalization and lexical conservatism: An artificial language study with child learners. Journal of Memory and Language, 65, 1–14. Wonnacott, E., Newport, E. L., & Tanenhaus, M. K. (2008). Acquiring and processing verb argument structure: Distributional learning in a miniature language. Cognitive Psychology, 56, 165–209.