Stochastic learning models with distributions of parameters

Stochastic learning models with distributions of parameters

JOURNAL OF MATHEMATICAL 9, 404-417 PSYCHOLOGY (1972) Stochastic Learning Models with Distributions of Parameters JOSEPH City University D. OFF...

705KB Sizes 0 Downloads 122 Views

JOURNAL

OF MATHEMATICAL

9, 404-417

PSYCHOLOGY

(1972)

Stochastic Learning Models with Distributions of Parameters JOSEPH City

University

D.

OFFIR

of New York, Graduate School and New York, New York 10036

University

Center,

This paper investigates the effect on properties of stochastic learning models due to a distribution assumption over their parameters. The parameters of the linear model and the long-short model are assumed to be independent random variables, each with beta density. Expectations of some conditional learning statistics are taken with respect to these densities. It is concluded that the descriptive and predictive power of the models increases measurably.

1.

INTRODUCTION

In most applications of stochastic learning models it is assumed that the same parameter values characterize all the subjects and the items in the experiment. This homogeneity assumption is usually accepted as the rationale for pooling the data from different subjects and items for the purpose of parameter estimation. As Sternberg (1963) points out: “It must be kept in mind when this tacit assumption of individual homogeneity is made in the application of model type, that what is tested by comparisons between data and model is the conjunction of the assumption and the model type and not the model type alone.” Statistically, the problem may be represented as follows. Let M be a model with parameter space 9’. Then the property s-the expectation or variance of some statistic T-of M depends on the parameter values, i.e., s = s(O), where 0 is a vector of parameters corresponding to a point in the parameter space 8. If 0 is allowed to vary and assume some probability distribution D, s is appropriately interpreted as a conditionalproperty. Thus, if E(T / 0) and V(T 1 0) d enote the conditional expectation and variance, respectively, then the corresponding unconditional properties with respect to the distribution D of 0 are given by

WY = Ed-W I @)I

and

W”) = E,[W I @)I+ V&W I @)I (1)

(see, for example, Rao, 1965, p. 79). It is clear that when D is concentrated at a single point the unconditional and conditional properties are the same and the homogeneity 404 Copyright All rights

0 1972 by Academic Press, Inc. of reproduction in any form reserved.

STOCHASTIC

LEARNING

MODELS

405

assumption is valid. Otherwise conditional and unconditional properties may differ substantially. For example, (1) shows that V(T) exceeds the expectation of V( T 1 6). Very little work has been done in which variation in learning rate parameters is allowed. An early example appears in Bush and Mosteller’s (19.59) analysis of the Solomon-Wynne data: the linear model was used with a beta distribution assumed over the learning rate parameter. In mental test theory, Birnbaum (1969) modified his previous work on logistic test models (1968) by further assuming a logistic distribution on the ability parameter 0. References to a few other studies that try to cope with this problem are given in Offir (1971). Gregg and Simon’s (1967) analysis of the Bower-Trabasso data (1964) is particularly relevant: the concept identification model was assumed with a uniform distribution on the conditioning parameter in a certain range [ri , az], 0 < r1 < r < x2 < I. Their conclusion was that for large individual differences expressed by a large range bl 3 rr2] the increase in the variance of total number of errors is barely detectable. They go on further to say: “By similar arguments we can show that almost all the ‘fine grain’ statistics reflect mainly a random component... Hence the statistics are insensitive to individual differences, or, for that matter to any other psychological aspects of the subjects’ behavior that might be expected to affect the statistics.” I will show that this is not actually the case: fine-grain statistics are affected greatly when a correct distribution is used to describe individual differences. There are two main reasons for considering distribution theory for learning parameters. First, little is known about the properties of parameter estimation in the context of learning models and there is not even a single example of statistical hypothesis testing of these parameters. Secondly, a proper distribution theory may also be quite useful in applications based on learning models. In particular, there are some examples of adaptive teaching techniques which incorporate heterogeneity mechanisms into learning models (e.g., Laubsch, 1969). H ere again, no distribution theory has been developed but rather a composite learning parameter is partitioned into additive item and subject components. To summarize, the present study is motivated by the absence of adequate formalization of individual and item differences. Heterogeneity of individuals and items (compounded) will be introduced by considering the parameters of the single-operator linear model and the long-short model as independent random variables, each having a beta density. The objective of this paper, therefore, is to study the effect of the heterogeneity modification on predictions from these models. The plan of the paper is as follows. First, there is a short description of the learning models used and the statistical densities which are assumed for their learning rates. Expectations of some conditional learning statistics are then taken with respect to these densities. Next, using the resulting unconditional properties the question of goodness of fit is examined with data from eight paired-associate learning experiments considered by Atkinson and Crothers (1964). Finally, comparisons between the heterogeneous and

406

OFFIR

homogeneous predictions are made. The study will demonstrate that whenever large individual differences are present a marked improvement results in the goodness of fit due to the heterogeneity modification.

2. TOWARD A DISTRIBUTION

OVER LEARNING

AND GUESSING RATES

For the purpose of exposition use will be made only of the familiar learning models described by Eqs. 2 and 3 and of the beta probability density given by (7). The single-operator linear model (LM) is represented by 9 n+1 = (1 - an 7

41 =

1 -g,

(2)

where c denotes the learning rate, 1 - g the initial incorrect guessing probability, and qn the error probability on trial n. The long-short model was introduced by Atkinson and Crothers (1964). A twoparameter version of it, LS-2, is represented by the following transition matrix L

s

F

0 (1 - c)(l -f) (1 - c)(l -f) (1 - c)(l -f)

0 (1 - c)f (1 - c)f (1 - c)f

U 0 0 0 ’ 0I

(3)

where L, S, F, and U represent the learned state, the short-term state, the forgetting state, and the unlearned state, respectively. The subject starts at the unlearned state and the transition between S and F takes into account the probability of forgetting in the short-term state f. The probability of a correct response at each of the four states is given by P(correct) = [l, 1, g’, g’]. (4) The two states S and F may be lumped together and represented by the following transition matrix and response probability vector

iF

L

SF

[;

; ;;

U

P(Correct)

i] [I -;,+fgj.

(5)

Two facts should be noted. First, the guessing probability in the intermediate state SF, given by g = 1 -f + fg’, is greater than g’ except for the two extreme cases where f = 1 or g’ = 1. Secondly, after the first trial the subject is in the intermediate

STOCHASTIC

LEARNING

MODELS

407

state SF, or the learned state L, with probabilities c and 1 - c, respectively. Thenceforth the learning may be described by the following two-state Markov chain: L

SF

P(Correct) (6)

When forgetting is complete, i.e., f = 1, the LS-2 model reduces to Bower’s all-ornone model. Next, we would like to find a general family of bivariate distributions D, for c and g defined on the unit square. The Dirichlet distribution satisfies the desiderata for a family of prior distributions described by Raiffa and Schlaiffer (1961, pp. 43-44). A method which is often applied by econometricians to Markov chains with uncertain probabilities is to assume that the transition probabilities for each state of the Markov chain are r.v.‘s from a Dirichlet distribution (e.g., Silver 1963; Lee, Judge, and ‘Zellner, 1970). When this method is applied to a modified version of Eq. 6 it can be easily established that c andg are independent r.v.‘s from two beta densitiesf,(x 1m, n) and f,(y j Y, s), where fc(x 1m, n) = B-‘(m,

n) xm-l( 1 - x)“-l

(7)

for m, n > 0, 0 < x < 1 and B(m, n) = F(m) F(n)/F(m

+ n).

f,(y / Y, s) is defined similarly. The beta p.d.f., defined by Eq. 7, satisfies Raiffa and Schlaiffer’s desiderata for Bernoulli processes. It is the best known and the most flexible p.d.f. on the unit interval (Raiffa and Schlaiffer, 1961, pp. 218-219). Identifying the parameters m and 71or T and s allows the expression of one’s initial ideas concerning both the expectation and the variance of the learning rates c andg, respectively (Good, 1965). The parameters can be determined by equating the expectation and the variance to the following expressions E[cl = m/Cm + n), (8) v(c) = mn/(m + n)“(m + n + I),

(9)

and to similar expressions for the guessing properties. It is important to note that the variance of the beta distribution decreases monotonically for increasing values of the parameters. The graph of the density is bell shaped, rectangular or U shaped according to whether m and n > 1, m = n = 1, or m or n < 1, respectively. In cases where the variances are close to zero, i.e., almost no individual differences, the parameters m and n are large and therefore the measurement signal-to-noise ratio,

OFFIR

408

E[c]/l/[V(c)] = d[m(m + tl + 1)/n], is quite large (Parzen, 1960, p. 378). When this happens it is reasonable to assume that the learning rate is indeed a fixed constant. Before we proceed, let us consider briefly the question of individual differences as studied by Gregg and Simon (1967). There the conditional expectation and variance of total errors for the Bower-Trabasso concept identification model with parameter r are V(T j 5-r)= (1 - 77)/T?. E(T I7r) = 1/7T and (10) Gregg and Simon assume that n is an r.v. from a beta density with parameters (1, 1); i.e., from a uniform distribution. Let &(m) = B-l(r, s) ~l-l(l - ,),-l. Invoking (1) again, r+s-1 E(T)

=

y_ 1

V(T)

= E (G)

and + V (+)

= E(T)[E(T)

-

l] -&.

(11)

When Y + 2 and s = 10.45, E(T) -+ 11.45, which is Gregg and Simon’s (1967) value for E(T) (cf. Gregg and Simon’s Eq. 13), but our variance approaches infinity. Consequently, it is clear that the heterogeneity assumption can cause a sizable increase in the variance of total number of errors.

3. PREDICTIONS FOR THE MODELS HAVING INDEPENDENT LEARNING AND GUESSING RATES WITH BETA DISTRIBUTIONS

For the remainder of this paper we will consider the rates c and g as independent r.v.‘s with beta densities. The unconditional properties will be designated by asterisks as will the unconditional models. Thus, E*(T) = E[E(T 1g, c)], where T is the total error statistic; LM* and LS* will denote the LM and the LS-2 model with the beta densities over the learning and guessing rates. Probabilities

of Response Sequences over Trials n to n + 3

A four-tuple response sequence starting on trial n is the sequence

0i.n=
w

where i = I, 2,..., 16 and X,,, = 0 or 1 denoting a correct or an incorrect response on trial n + K, respectively. In the experiments discussed in Section 4, 71 = 2, i.e., the sequence starts on trial 2. The derivations of the conditional probabilities P(Oi,J are not presented here, since they are straightforward and involve only elementary probability theory. (Readers not familiar with the methods involved in such derivations

STOCHASTIC

LEARNING

409

MODELS

can consult Atkinson, Bower, and Crothers, 1965). The unconditional probabilities of the 16 sequences will be denoted (ja , js ,i4 ,j5}, where ji = 0 or 1 indicating correct or incorrect response on trial i. Thus, (1, 0, 0, l} is the probability of errors on trials 2 and 5 and correct responses on trials 3 and 4. The next step is to find the unconditional probabilities for the two models. The derivation here is again straightforward. The conditional probabilities P(O,,,) are simply integrated with respect to the joint density jc,s(x, y 1m, n, r, s) = DzP-~( 1 - x)+ryr-‘(1

- y)8-1,

where D = [B(Y, s)B(m,n)]-l. In order to derive the unconditional probabilities it is necessary only to evaluate expectations of some products of c’s and g’s. If P(Oi) = cj(l - c)“g”(l - g)” - S, where S is the sum of other conditional probabilities, then the unconditional probability is

{Q> = W(Q)] = D[B(m+ j, n + 4 W + P, s + 41 - W).

(13)

For example, in the LS-2 case, P(O,,) = (1 - ~)~(l - g)” - P(O,,) and therefore {O,,} = D[B(m, n + 3) B(r, s + 311- {%J. The unconditional probabilities {O,}, i = 1, 2,..., 16 are the results of these integrations and are given in Table 1 for the LS * and in Table 2 for the LM *. TABLE LS*

Probabilities in Terms

1

of Response Sequences over Trials of the Parameters (r, s) and (m, n)

2-5

io,,1 = II, 1, 1, 1) = D[B(r, s + 4) B(m, n + 411 I%~ = (1, 1, 1,Ol = D[B(r, s + 3) mm, n + 3)l - IO,,: {O,J

x

{l,

l,O,

1) = D[B(v

+

1,s +

3)B(m,n

+ 4)l

YO*,) = 11, 1, 0, 01 = D[B(r, s + 2) B(m, n + 211- {O,,l - (O,,l - IO,,) iOn1 {OnI

= (1, 0, 1, 1) = = (1, 0, 1,O) =

D[B(r D[B(r

+ +

1, S + 3) B(m, 1, S + 2) B(m,

n + 4)] = {O,,) n + 3)l - {O,,)

VA,1 = (1, 0, 0, 1) = D[B(r + 2, s + 2) am, n + 4)l {03 = {I,& 0, 01 = D[B(Y, s + 1) B(m, n + 111- .E& {Oil iw = K4 1, 1, 11 = {O,,l i&l = to, 1, 1,Ol = {Oil} {W = m 190, 11 = {Old {Od (0,)

= {O, 1, 0,O) = D[B(r = {O, 0, 1, 11 = {%)

+

1, s +

1) B(m,

iw = to, 0, 1,O) = D[B(r + 2, s + 1) B(n, iO,l = {‘JO, 0, 11 = D[B(r + 3, s + 1) B(m, iOd

=

(0, 0, 0, 01 =

1 -

c::,

lo,)

n +

2)] -

(0,)

n + 3)] n + 4)I

(0,)

-

IO,]

- {O,}

410

OFFIR

TABLE LM*

{%I IO,,} IO141 {O,,l {%I {%I {%I IQJ {OS) IO,} {O,) {O,}

= = = = = = = = = = = =

(041 = IO31 = {O,l

=

{O,l =

{L 1, 1, 11 = (1, 1, l,O) = (1, 190, 11 = {l, 1, 0, 01 = (1, 0, 1, 1) = (1, 0, 1, 01 = U, 60, 11 = 11,0,0,01 = (0, 1, 1, 11 = 9% 1, 1,Ol = {O, 1, 0, 11 = PA 1, 0, 01 =

Probabilities in Terms

D[B(Y, D[W, WQ, D[B(Y, D[B(y, WB(y, NW, WW, D[&y, D[B(Y, D[B(Y, D[B(Y,

s+ s+ s+ s+ s+ s+ s+ s+ s+ s+ s+ s+

2

of Response Sequences over Trials of the Parameters (Y, s) and (m, n)

4) 3) 3) 2) 3) 2)

Ben, n + B(m, n + B(m, n + B( m, fl + B(m, n + B(m, 12+ 2) B(m, n + 1) B(m, n + 3) B(m, ff + 2) B( m, n + 2) B( m, n + I) B( m, n +

1011 6)l 7)l 3)l 811 4)l 5)l I)] 911 -

{O,,) {O,,) IOd IO,,) {O,,) {O,,) ,&, {O,,)

2-5

- {Old - {%I - {%I - {Od {Oil

- {Od - {O,,}

31 - {QJ - I%} - {%I

@I - IQ) - {Od - {Old 2)l - @,I - {O,l - {W - {Old {O, 0, 1, 11 = D[B(Y, s + 2) B( m, fl + 7)l - {W - {W - {%I K4 0, 1,Ol = W&y, s + 1) Bh n + 3)l - {Od - {O,l - {OS) - {Old to, 0, 0, 11 = D[B(Y, s + 1) B( m, n + 4)l - {OJ - {OJ - IO81 - {Old 1 - c::, {Oil

- {Od {Ol& - {OIJ - {Old {%I - {%I - {W {Ol,) - {OlLJ

In order to make predictions from Tables 1 and 2, estimates of the distribution parameters are needed. Toward this end the x2 associated with the Oi events is minimized. Let {Oi; m, n, I, S} denote the probability of the event Oi , where m, n, Y, and s have been listed to make explicit the fact that the expression is a function of the four parameters. Further, let N(OJ denote the observed frequence of outcome Oi over trials 2-5. Finally, let T = N(0,) + N(0,) + ... + N(O,,). Then we define the function

l6 [T{Oi ; m, n, Y, s} - N(O,)] T(Oi ; m, n, Y, s} i=l

x2@, fl, y, 4 = C



and select the estimates of Y, S, m, and 1z so they jointly minimize the function (14). A high-speed computer was programmed to carry out a numerical search over all possible parameters until a minimum was obtained accurate up to one decimal place. The degrees of freedom associated with a model that requires K parameters to be estimated from the data are df = 16 - k - 1. The one is subtracted because of the restriction that the 16 probabilities sum to 1. In this case k = 4. There are other numerical estimation procedures available, i.e., numerical maximum likelihood or least-squares procedures, but since our data was analyzed originally by means of

STOCHASTIC

LEARNING

minimum x2 procedures, this method is preferred between the original analysis and ours. Error

Rate Conditionalized

on Previous

411

MODELS

in order to facilitate

later comparisons

Errors

We now consider the probability of an error on trial k + 1 conditionalized on error on trial K, P(e,+, 1 e,J. Atkinson and Crothers used this quantity to discriminate between various learning models. For the LS-2 model this probability is constant after the first trial whereas for the LM it decreases with increasing values of K. For the LM*, i.e., the LM with the heterogeneity modification, the conditionalized probability of error after error is given by

B(r,s+2) B(r,s+l) = (r (;T:)

B(m,n+2kB(m,n+k-1) ,) fi

1)

(1%

(n + k - 2 + i)/(m

+ n + k - 2 + i).

2=1

It can be concluded from (15) that the probability P*(e,+, / e,) is a monotonic nonincreasing function of k. Also, P*(e,+, 1 ek) # P*(e,+,), i.e., LM* is response sensitive whereas LM is response insensitive. For LS* we get quite a different result:

P*(e,+l, ek)= P*(ek+ln 4 = -W - 4” (1 - d”lE[(l - c)~--l (1 - g)] P*(4 n+K-l

s+l = i r+s+l

Ii m+n+k-1

1*

Specifically, the probability of an error following an error increases in K. The rate of increase, however, will be smaller for large m/n values. From Eqs. 15 and 16 it is to be expected that if the LM* and the LS* are fitted to some data with decreasing frequencies of an error following an error, the estimate of m/n in the LS* case should be larger than the estimate in the LM* case. Oddly enough, in the LS* case, the Vincent curves remain invariant under the heterogeneity assumption. That is, the probability of a correct response on trial i, given that the last error occurs on trial j (j > i), is constant for all trials i and depends only on the guessing parameters r and s:

p*cci , ej> = :[
- 4-ldl E[(l

- ei

- c)j-l(I

(1 -g)]

dl _

r r+s+l

(17)

412

OFFIR

Total Number of Errors Let T be the total number of errors in infinite learning trials. Then the conditional total error properties for given c and g are given by

Distribution: Pr(T = 0 ] g, c) Pr(T = K ] g, c) Mean: W

LS-2

LM

k’ (1 - g’b) b(1 - b)k-l

-

(1 - g’4lb

(1 - g)/c

I g> 4

Variance :

V(T I g,c)

(1 - g’b)(l- b(l - g’)P E(T I g, 4 - (1 - d2/[1- (1- c)“l (18)

where b = c[l - (I - c)g]-‘. We now calculate the unconditional properties for LS*:

E*(T) = EMT I g, 41 = E[(l - g’b)/b] =Ds

’ ’ (1 /b - g’) g’-‘( 1 - g)“-’ cnc-l(1 - c)+l dg dc, 0 s0

where D = [B(m, n) B(r, s)]-‘. It is readily found that

E*(T) = (1-g’) + b/(m- l>l[d(y+ 41.

(19)

To calculate the variance we apply Eq. 1:

V*(T) = WV Ig>4 + WV” I g, 41 = 2E(l/b2) - E2(l/b) - E(l/b) + g’(l -g’) = 2Wb2)

(20)

- (E*(T) + g’)(l + E*(T) + g’) + g’(1 - g’).

We need to derive E( 1/b2): E(l/b2) = E[(l - (1 - c)g)2/c2] = E[(l - d2/c2 + 2g(l - R)/C + g21 = D[B(m-2,n)B(r,s+2)+2B(m+

B(m,

4

B(r

+

2,41.

l,n)B(r+

I,s+

1)

(21)

STOCHASTIC LEARNING MODELS

413

The unconditional mean for the LM* is s/(r + s) + [n/(m - l)][s/(r + s)]. Unfortunately the unconditional variance for the LM* cannot be evaluated in a closed form because of the 1 - (1 - c)~ term in the denominator of the conditional variance.

4. DATA ANALYSIS Atkinson and Crothers (1964) carried out the analysis of data from eight pairedassociate learning experiments for seven different conditional models; i.e., models for which the learning parameters were assumed to be fixed constants for the population of subject items. They estimated the parameters by minimizing a x2 associated with the sixteen response sequences over trials 2-5. Experiments Ia, Ib, II, and IV were run with college students whereas III, Va, Vc, and Ve were run with young children. It is instructive to use their data to compare different conditional and unconditional models. The minimization procedure described in Eq. 14 was applied to the data of observed frequencies described in Atkinson and Crothers’ Table 2. Our Table 3 presents the estimates of the four distribution parameters r, s, m, and n that minimize the x2 function (14) for the unconditional models LS* and LM*. In the LS* case the following should be noted. The experiments which were run with the college students, Ia, Ib, II, and IV, tended to have higher estimates for the distribution parameters. These estimates should be viewed with caution since the numerical estimation procedure using Eq. 14 became increasingly unstable for high values. They also possess high measurement signal-to-noise ratio. Therefore, in these cases we should use Atkinson and Crothers’ estimates for the LS-2 parameters and their corresponding x2 values. TABLE

3

Parameter Estimates for All Eight

Experiments

Experiment Model parameter Ls*

LM*

Ia

Ib

II

III

IV

va

?n 11

55.00 53.62 15.59 27.12

54.13 63.12 34.12 76.62

46.75 54.04 14.00 41.99

2.69 3.00 3.31 16.75

12.69 20.75 24.00 67.69

I s m n

13.59 31.48 1.93 1.75

2.69 19.25 3.00 3.00

1.82 1.79 2.46 8.84

1.13 1.14 2.06 24.30

1.75 51.75 3.31 4.25

I

VC

Ve

6.50 10.50 2.50 21.50

3.00 3.00 2.00 12.25

10.50 10.50 3.00 8.00

2.22 2.20 1.29 51.29

1.00 1.75 1.00 2.75

1.48 1.37 2.42 7.07

414

OFFIR

Table 4 summarizes the estimated values of the means and the variances of the beta densities by substituting the estimates of Table 3 in Eq. 9. It is important to note that when a comparison is made between the expected values of the conditioning parameter for the LS* model, E(c), and the corresponding estimate for the LS-2 model the results are nearly identical. Table 5 presents the minimum x2 values for the conditional and the unconditional models. In the latter case, these values are obtained by using the parameter estimates TABLE Estimates

4

of Distribution

Means

and Variances

Experiment Model property LS*

Ia

II

III

Va

Vc

Ve

so2

.461

.463

.472

.379

.382

.500

.500

E(c)

.363

.308

.250

.165

.262

.I04

.140

.273

.230

.210

.244

3.730

.684

1.310

3.570

1.130

.531

.190

.329

.654

.208

.373

.791

1.650

E(g)

.301

.122

.503

.498

.032

.502

.363

.521

E(c)

.523

.500

.218

.078

.438

.024

.261

.255

.046

.046

.542

,764

.005

.460

.617

.647

.532

.357

.139

.026

.287

.004

.411

.181

lOV(d lOV(c)

TABLE Minimum Experiment

5 x2 Values

LM*

Ls*

LS-2

LM

Ia

5.10”

10.31”

6.75”

Ib

21.75”

19.69”

95.86

II

20.21’ 4.45”

66.26

3.73”

251.30

50.92

III

22.96”

66.56

33.02

IV

12.54”

42.45

12.32”

146.95

Va

21.36”

65.88

24.41”

201.98

VC

9.92”

Ve

17.69”

Total Total o Not

IV

E(g) l~V(g) lOOV(c)

LM*

Ib

6.805 47.05

114.23 df

significant

11 x 8 = at .Ol level.

327.06 88

11 x8=88

296.30

27.12”

236.15

20.12”

262.56

147.16 13 x 8 =

1542.02 104

14 x 8 =

112

STOCHASTIC

LEARNING

415

MODELS

of Table 3 in Eq. 14. The value needed for significance at the 0.01 level is 24.7 for 11 degrees of freedom. Using our estimates, all of the values for the LS* are not significant at this level. For the LM* the values for experiments Ia, Ib, and Vc are not significant. Comparing our minimum x2 values with those derived by Atkinson and Crothers for the LS-2 model, the improvement in x2 easily outweighs the loss of two degree of freedom for experiments III and Vc; e.g., in experiment III the probabilities corresponding to the values for LS* and LS-2 are 0.02 and 0.002, respectively. From Table 4 it is seen that the variances of g and c for these experiments are much larger than the same variances for experiments Ia, Ib, II, and IV. The LM* parameter estimates in Table 3 tend to be much smaller when compared with the same estimates for the LS *. This fact is reflected more clearly in Table 4 where the variances for both g and c assume larger magnitude of order greater than 10 for the LM* than for the LS*. The larger variances may be a consequence of the model’s inability to account for the data. The unconditional variances for experiments Ia, Ib, Vc, and Ve were calculated by using Eq. 20 and are presented in Table 6; the conditional variances for LS-2 I/(T) are calculated from Eq. 18. TABLE Predicted

Conditional

6

and Unconditional

Variance

of T

Variance Experiment

V(T)

v*(T)

Ia

1.89

2.19

Ib

2.94

3.18

VC

19.40

x1

Ve

5.78

17.00

Table 6 demonstrates that the variance of total errors is very sensitive indeed to individual differences. For small differences it is to be expected that the unconditional variance V*(T) will be only slightly larger than V(T). Experiments Ia and Ib were run with college students and almost no errors were committed after the second trial. In these two experiments the v*(T) variances are very close to the conditional ones. On the other hand, experiments Vc and Ve were run with four- and five-year-old children and there was a large difference in their performance. This difference is expressed overwhelmingly in the magnitude of the difference between the variances, i.e., I/*(T) > V(T). It is clear therefore that the model is sensitive enough to detect individual differences if there are any.

416

OFFIR

5.

DISCUSSION

AND

CONCLUSION

The intention of this paper has not been merely to compare or select models but rather to amend some of the inadequacies which are introduced into simple learning models by ignoring the essential features of individual learning rates. Simple learning models were used to test whether or not the homogeneity assumption has in fact a sizable effect on the learning properties. It was shown that the properties of the models become sensitive to individual differences to the degree that such differences exist. This fact was demonstrated by the change of variances of total number of errors, and by highly improved x2 values for experiments which were run with heterogeneous subject populations. Much work remains to be done before these techniques can be routinely used to evaluate the variance component of learning properties due to individual differences. What is needed is better estimation procedures tried with different distribution assumptions. In addition, sequential techniques, similar to the ones used in a Bayesian analysis, could be very useful in describing student learning rates on different trials.

REFERENCES ATKINSON, R. C., BOWER, G. H., AND &OTHERS, E. J. An introduction to mathematical learning theory. New York: Wiley, 1965. ATKINSON, R. C., AND CROTHERS, E. J. A comparison of paired associate learning models having different acquisition and retention axioms. Journal of Mathematical Psychology, 1964, 1, 285315. BIRNBAUM, A., Statistical theory of logistic mental test models with a prior distribution of ability. Journal af Mathematical Psychology, 1969, 6, 258-276. BIRNBAUM, A. Some latent trait models and their use in inferring and examinee’s ability. In F. M. Lord and N. R. Novick (Eds.), Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley, 1968. Chapters 17-20. BOWER, G. H., AND TRABASSO, T. R. Concept identification. In R. C. Atkinson (Ed.), Studies in mathematical psychology. Stanford: Stanford University Press, 1964. Pp. 32-94. BUSH, R. R., AND MOSTELLER, F. A comparison of eight models. In R. R. Bush and W. K. Estes (Eds.), Studies in mathematical learning theory. Stanford: Stanford University Press, 1959. Pp. 293-307. DEGROOT, M. H. Optimal statistical decisions. New York: McGraw-Hill, 1970. GOOD, I. J. The estimation of probabilities. Cambridge: The MIT Press, 1965. GREGG, L. W., AND SIMON, H. A. Process models and stochastic theories of simple concept formation. Journal of Mathematical Psychology, 1967, 4, 246-276. LAUBSCH, S. H. An adaptive teaching system for optimal item allocation. Tech. Rep. 151, Institute for Mathematical Studies in the Socal Sciences, Stanford University, 1971. LEE, T. C., JUDGE, G. G., AND ZELLNER, A. Estimating the parameters OJ the Markow probability model from aggregate time series data. Amsterdam: North Holland, 1970. OFFIR, J. Some mathematical models of individual differences in learning and performance. Tech. Rep. 176, Institute for Mathematical Studies in the Social Sciences, Stanford University, 1971.

STOCHASTIC

PARZEN,

LEARNING

E. Modern probability theory and its applications. H., AND SCHLAIFER, R. Applied statistical decision

417

MODELS

New York: Wiley, theory. Cambridge:

1960 The

MIT 1961. ho, C. R. Linear statistical inference and its applications. New York: Wiley, 1965. SILVER, E. Markovian decision process with uncertain transition probabilities or reward. Rep. I, O.R. Center, MIT, 1963. STERNBERG, S. H. Stochastic learning theory. In R. D. Lute, R. R. Bush, and E. Galanter Handbook of mathematical psychology. New York: Wiley, 1963. Vol. 2. RAIFFA,

Press,

Tech. (Eds.),