JOURNAL
OF MATHEMATICAL
PSYCHOLQGY
Dynamic WILLIAM
36,
185-212 (1992)
Paired-Comparison
H. BATCHELDER,
Scaling
NEIL J. BERSHAD, AND ROBERT S. SIMPSON
University
of California
al Irvine
This paper describes an approach to paired-comparison scaling that may be useful when underlying scale values change with time. Sequentially revised estimates of scale values underlying a fixed set of choice objects are calculated from binary choice data collected on discrete trials. The system is defined by a reward function that controls the sequential increments and decrements to the estimates. Results of the reward system depend jointly on the difference between the current estimates and the binary choice data. Natural restrictions on reward systems are presented that imply uniqueness up to a scaling factor. Asymptotic properties are provided for a particular reward system based on a uniform distribution function, and other methods for obtaining results for other distributions are discussed. One-object, two-object, and many-object estimation problems are formulated. It is shown that the two-object problem based on a uniform reward system corresponds to the Bush-Mosteller linear operator model for two-choice learning; that the one-object problem can be related to both adaptive testing and adaptive threshold estimation; and that the many-object problem provides a sequential estimation scheme for dynamic paired-comparison systems.Refinements include a method for introducing a draw or no-preference outcome into the reward system. c) 1992 Academic Press. Inc.
We describe an approach to paired-comparison scaling when underlying scale values may change with time. The approach described here enables one to obtain sequentially revised estimates of scale values or underlying ratings on a fixed set of choice objects. The approach uses choice data collected on discrete trials and assumes that underlying ratings may change. Since most paired-comparison scaling methods presume that underlying scale values are fixed, our methods provide new tools that may be useful in situations where change may be expected. The approach is motivated by the problem of rating chess ability. Chess ability is rated throughout the world by a system created by Arpad Elo (1978). The Elo system is formally described, along with some extensions, in Batchelder and Bershad (1979) and in Batchelder and Simpson (1989), so this paper can be thought of as a sequel to these two papers. The heuristic behind the current approach is the psychological fact that the chess community regards chess ratings Special thanks are due to Rogers Saxon and David Strauss for their insightful comments during various stages of this work. Preparation of the manuscript was supported in part by National Science Foundation Grants SE!%8808358 to A. K. Romney and W. H. Batchelder and BNS8910552 to W. H. Batchelder and D. M. Riefer. We also acknowledge the support of the Irvine Research Unit in Mathematical Behavioral Science. Reprint requests for this and related articles should be sent to Dr. William H. Batchelder, School of Social Sciences, University of California at Irvine, Irvine, CA 92111.
185 0022-2496192 $5.00 Copyright 19 1992 by Academx Press, Inc. All rights of reproduction m any form reserved.
186
BATCHELDER,
BERSHAD,
AND
SIMPSON
in many respects as economists treat money. When chess players encounter an opponent, they risk rating points, and when they win, they win rating points from the opponent. The literature of the chess playing community expresses a concern for inflationary or deflationary trends in the rating pool, as well as in creating rating equity between geographically distinct areas. These views of the chess playing community have evolved in praxis; our goal is to capitalize on these insights by attempting to axiomatize a system of rating revision rules that is based on both economic and statistical notions. The final system we developed is very easy to implement on a personal computer, is applicable to any paired-comparison scaling situation, and has a number of desirable properties, some of which may be transferable to adaptive threshold determination and adaptive testing procedures where underlying thresholds or abilities may change. The paper is organized into several sections. Section 1 provides formal definitions of paired-comparison scaling, and Section 2 extends these notions to a dynamic setting. Section 3 axiomatizes a sequential estimation scheme for paired-comparison scaling. Sections 4-6 provide some of the estimation theory for the sequential scheme developed in Section 3; in particular, results for the two-object, one-object, and many-object estimation problems are presented. Finally, Section 7 shows how to incorporate a draw or no-preference outcome into the system. 1.
PRELIMINARIES
The idea of a monotone paired-comparison system is captured in the next two definitions adapted from Suppes and Zinnes (1963). In the intended interpretation, the function p(a, b) is the probability that object a is preferred to (beats) object b in a choice experiment (contest). DEFINITION 1. a = (A, p) is a paired-comparison system (PCS) in case A is a finite set, and p: A x A + [0, 1 ] such that, for all (I, b E A,
p(u, b) = 1 -p(b, DEFINITION
u).
2. A paired-comparison system a = (A, p) is said to be a monotone system (MPCS) in case there is a pair (F, p) such that
paired-comparison
(i) F is a cumulative distribution function on the reals (Re) subject to F(x) = 1 - F( -x), for all x E Re, and F is strictly increasing on its “support” (here the x with 0 < F(x) < 1). (ii)
p: A -t Re such that, for all a, b E A, P(G b) = FMu)
-
p(b)].
The contents of Definitions 1 and 2 are well known in the literature on pairedcomparison scaling. For example, Baird and Noma (1978, Chap. 9) discuss several
DYNAMIC
PAIRED-COMPARISON
187
SCALING
models satisfying these definitions, including the Thurstone Case V model and the Bradley-Terry-Lute (BTL) model. For later reference, we give the F in these models in general form in the next two equations. The Thursone Case V model employs a zero-mean Gaussian distribution given by F(~)=~B~~~exp(-t’/2B’)
dt,
and the BTL model involves a zero-mean logistic distribution F(x) = [l + exp( -x/2B)]
-‘,
(2)
given by (3)
where the quantity B > 0 is a scaling factor. In addition to these examples of monotone paired-comparison systems, it is useful to define another system presented in Batchelder and Bershad (1979) called the uniform system. The uniform system involves a zero-mean uniform distribution given by (1
if
x>B
if
-B
if
XC-B,
(4)
where B > 0. The uniform system is a useful approximation to many MPCS. To see this, note that for differentiable F and small 1x 1, the first-order Taylor expansion about x=0 is F(x) z 5 +f(O) x,
(5)
where f is the density function corresponding to F. Equation (5) corresponds to a uniform system with B = [y(O)] -I. An important fact about monotone paired-comparison systems is that they satisfy a type of interval measurement property. For instance, suppose that (F, p) makes the paired-comparison system a = (A, p) into a MPCS. Then one can pick c1> 0, and B arbitrarily, and define the resealing function r on A by z(a) = q(a) + B, for all a E A. If the distribution
(6)
function G is defined by G(x) = F(x/a),
(7)
it is easy to show that (G, 7 > also makes a into a monotone PCS. In fact, Do&non and Falmagne (1974) discuss some reasonable, additional conditions that guarantee
188
BATCHBLDER,BERSHAD,
AND SIMPSON
interval level measurement in the sense that any two scalings on the same pairedcomparison system (F, p ) and (G, r ) are related by Eqs. (6) and (7) for some a>0 and /?.
2. DYNAMIC PAIRED-COMPARISON
SYSTEMS
In practice, monotone PCSs are used to estimate the scale values p(a), hereafter called ratings, from binary choice data. To see this, note that Eq. (1) leads to the equation
da) - p(b) = F-‘Mu, 611,
(8)
since F is strictly increasing on its support and thus has an inverse F- ’ on (0, 1). Typically the p(u, b) are estimated by relative frequencies in a replicated round robin design, and then some iterative search routine leads to estimates of the ratings. If replicated round robin data are unavailable, one may consult a large literature on systematic, incomplete designs (e.g., see Golledge and Rayner, 1982, for discussions of sampling and incomplete designs in paired-comparison experiments). Also, Batchelder and Bershad (1979) discuss methods for obtaining estimates of the ratings from very sparse and unsystematic data structures. There is a large and still growing literature on paired-comparison scaling. Traditional sources are covered by Davidson and Farquhar (1976), which includes essentially all the work through the mid-1970s, and David (1963, 1988) remains a useful source. However, almost all of the work on paired comparisons assumes that underlying ratings do not’change over time. The purpose of this section is to develop a sequential estimation scheme for handling the case where underlying ratings may not be stationary. Suppose binary choice data on A are collected on each of a series of trials. On any trial, assume that at most one choice experiment is conducted on any pair of choice objects. Accordingly, define the random variables 1 T,(u, b)=
-1
if a chosen over b on trial n if b chosen over a on trial n
0
if no choice experiment on a and b conducted on trial n,
for all a, b E A and n = 0, 1, 2, ... . The data structure for dynamic captured in the next definition. DEFINITION
(9)
3. The structure 6 = (A; T,, n = 0, 1, 2, ...)
estimation
is
DYNAMIC
PAIRED-COMPARISON
189
SCALING
is a discrete trial binary choice experiment in case A is a finite set, and, for each n = 0, 1, 2, .... T, is a function from A x A into random variables with space { LO, - 1 > and satisfying T,(a,b)+T,(b,a)=O, for all a, b E A. If we assume underlying ratings may change, we need to broaden the definition of a monotone PCS to reflect this. We want to keep the choice set A and the distribution F fixed but allow p and consequently p to vary with trials n. This idea is captured in the next definition, where the intended interpretation links the T,(a, b) and p,, by the equation
WT,(a, b)= l)=p,(a,
61,
where it is assumed that a choice experiment is conducted between a and b. DEFINITION
4. The structure 9 =
, (F, P,, >, n = 0, L2, .... >
is a dynamic monotone paired-comparison system (dynamic n, (A, pn ) is a monotone PCS with (F, p,).
MPCS) in case, for each
We are now in a position to state the central problem of this paper. We suppose the choice set A is fixed and that we have data from a discrete trial binary choice experiment, that is, a realization of d = (A; T,, n = 0, 1,2, ...). We assume some distribution function F subject to the conditions of Definition 2, and we assume that underlying d is a dynamic monotone PCS with the assumed F. Now suppose we have reasonably stable initial estimates r,,(a), for the rating of each a E A, relative to F; how do we use the data to obtain sensible revised estimates r,,(a) of the unknown, underlying dynamic ratings p,(a)? Put less formally, we are given discrete trial binary choice data on a choice set and some suitable distribution function F. If we are also given stable initial estimates of the ratings, how can we use the discrete trial binary choice data to obtain trial-to-trial sequentially revised estimates of the underlying, possibly changing ratings? There are two general approaches to our problem. First, one could postulate some learning model (stochastic process) on the underlying ratings. The model would postulate how the underlying ratings change over trials, and it would have its own learning parameters. Then one could use the discrete trial data to estimate the parameters of the learning model, and these estimates could be used to provide estimates of the ratings. One example of this approach is Lute’s (1959) beta model, which nicely combines paired-comparison scaling with a suitable stochastic learning model. For a second example, we show in Observation 3 that the Bush-Mosteller (1955) linear operator model for two-choice learning can also be
190
BATCHELDER,
BERSHAD,
AND SIMPSON
looked at in this way. The learning model approach to our problem is recommended for situations where enough is known about the phenomenon underlying the sequential choice data to justify postulating a particular model. On the other hand, there are many situations involving the scaling of preferences or abilities where it is not reasonable to postulate a particular learning model. For example, food preferences may change in unpredictable and idiosyncratic ways over the life span of an individual. In the case of cognitive skills, such as chess playing, we know that ability tends to increase with study and may decrease with disuse and age. However, the data base may not include enough evidence of this sort to permit the formulation of an explicit model for changing ability, and even if it did, use of the model to rate ability would probably be resisted by many people, especially elderly ones whose strong performances would likely be discounted by the model. In this paper, we will approach our problem without imposing any model whatsoever about how underlying ratings might change over trials. In this case, a rational sequential estimation scheme must meet the following two requirements: (1) it must be responsive to changes in underlying ratings when they occur; and (2) it must behave well in cases where underlying ratings stabilize over a long sequence of trials. Thus the system we design will need to track changing ratings when they occur and converge in some sense to stable ratings when change is absent.
3. AXIOMS FOR A REWARD SYSTEM In this section we define a reward system for sequential estimation. The idea is that if, say, object a is chosen over object b on some trial n, then the revised estimate of p(a) should increase and the revised estimate of p(b) should decrease. Since choice probabilities depend only on the differences between underlying ratings in Eq. (l), it is natural to let the increments and decrements described above depend only on the difference between the current estimates regardless of where on the scale these differences occur. This scale homogeneity condition leads to the additive representation discussed next. To formalize this idea, let R,(a) be an estimator and r,(a) a particular estimate of the rating of object a on trial n, for all a E A and n = 0, 1, ... . In the simplest case, let A = {a, b} and suppose for some trial n we have estimates rn (a) and r,(b). Then the sequential estimation scheme can be defined by
where W and L are non-negative, analogous equation for b is
if
T,(u, b)= 1
if
T,(u, b)=O
if
T,(u, 6) = - 1,
(10)
real-valued functions of x, = r,(u) - rn (b). The
DYNAMIC
R,+,(b)=
PAIRED-COMPARISON
191
SCALING
r,(b)-G-x,)
if
T,(a, b)= 1
r,(b)
if
T,(a, b)=O
if
T,(a, b) = - 1.
i r,(b) + W-x,)
(11)
We can think of W as a reward function for being chosen and L as a loss function for not being chosen. For arbitrary choice sets, Eq. (10) can be generalized naturally as follows for all a E A:
R,+l(a)=r,(a)+1 Wrn(a)-r,(b)l- c beA T,(a. b) = 1
baA T,(a. b)=
u-r,(a)-rn(~)l.
(12)
- 1
To make progress, some restrictions are needed on W and L. Fortunately there are several natural conditions on W and L that can be formulated. They are captured in several axioms, which are stated after the idea of a reward system is formalized. DEFINITION
5.
R = (F, W, L)
is a reward system in case
(i) F is a distribution function subject to F(x) = 1 - F( -x), and F is strictly increasing on its support. (ii) Wand L are non-negative, continuous, real-valued functions on the reals. W is ,monotonically decreasing and L is monotonically increasing on the support of F. For fixed F, Eq. (1) shows that the ratings on each trial n in Definition 4 are unique up to an additive constant. For this reason, it is useful to conserve the sum of all the estimates on any trial, i.e.,
aFAr,(a)= 1 r,(a),
(13)
LlEA
for n = 1, 2, ... . The easiest way to achieve Eq. (13) is to require that the increment to the rating of a chosen object in Eq. (10) be compensated by an equal decrement to the unchosen object’s rating. This idea is one of the cornerstones of the Elo system used to rate chess ability and is captured in the next axiom. Axiom 1 (Zero Sum). A reward system W = (F; W, L) W(x) = L( -x), for all x E Re.
is zero sum in case
Axiom 1 implies that L is completely determined from knowledge of W. it is easy to show that if Eq. (12) holds along with Axiom 1, then Eq. (13) If the current rating estimates are accurate and the underlying ratings change, then it is reasonable to require that the expected change in rating for any choice object. This is captured in the next axiom.
Further, follows. do not be zero
192
BATCHELDER,
Axiom 2 (Fair Game).
BERSHAD,
AND
SIMPSON
A reward system W = (F, W, L ) is a fair game in case W(x) F(x) = L(x) F( -x),
(14)
for all x E Re. It is easy to show that Axioms 1 and 2 imply that for any aE A the expected value of R,, 1(a ) is r,(a) under the conditions that the r,(b) are accurate for each b E A, that is if p,(b) = r,,(b). In this case, the choice probabilities as well as the reward and loss functions are all functions of difference in the rating estimates, namely, x, = r,(a) - r,(b). Thus if T(a, b) # 0, we can write
ECR,+,(a)lp,(a)=r,(a), P,(b)=r,(b)l =r,(a)+Fh) W(x,)-F(-x,)Ux,) = rn (a). Axioms 1 and 2 greatly restrict the reward system. In fact if W = (F; W, L) satisfies Axioms 1 and 2, it is easy to show that for all x in the support of F,
W(-x)=
W(x) f’(x) F(-x)
’
(15)
Equation (15) is significant because it shows that if F is fixed and W is fixed for x 2 0, then W is completely determined for x < 0. What is desirable is some additional, rational restriction on W that allows W to be completely determined from F up to a scaling factor W(0) = C, where C > 0 controls the variability of the system or the weight placed on a single choice outcome. There are several ways to further specify W. One way, captured in Axiom 3, is to require that all choice experiments involve the same “value.” Axiom
3 (Constant Value).
There is C > 0 such that, for all x E Re, W(x) + W( -x) = 2c.
(16)
One interpretation of Axiom 3 is that when a choice experiment between two objects, a and b, with true rating difference x = p(a) - p(b) is conducted, object a stands to lose W( -x) rating points and object b stands to lose W(x) points. Axiom 3 then assures that the sum of the potential losses is a constant, where C= W(0). A useful heuristic way to look at Eq. (16) is that player a “antes” W( -x) and player b antes W(x) into a pot of size 2C in a winner-take-all situation. A second way to further simplify W in Eq. (15) is to equate the variance of the points risked over choice pairs. Since the expected shift in rating points is zero from Axiom 2, it is easy to see that Eq. (17) directly expresses the concept of constant variance. Axiom 4 (Constant Variance).
Let W(0) = C; then for all x E Re,
F(x) W2(x) + F( - w) W2( -x) = C2.
(17)
DYNAMIC
PAIRED-COMPARISON
193
SCALING
Axioms 3 and 4 are somewhat arbitrary and, in fact, there are probably other attractive ways to specify the reward system up to a constant C= W(0). Nevertheless, both axioms, when coupled separately with Axioms 1 and 2 have interesting implications, stated next. OBSERVATION 1. Suppose a reward system (F; W, L) satisfies Axioms 1 and 2, and let W(0) = C > 0. Then
(i)
If the system satisfies Axiom 3 also, then for all x E Re, F(x) = W( -x)/2C
(18)
W(x) = 2CF( -x).
(19)
and
(ii)
If the system satisfies Axiom 4 also, then for all x in the support of F,
W(x)W(-x) =c*,
(20)
F(x) = C’/[C*
+ W*(x)],
(21)
and W(x) = C[F( -x)/F(x)]“*.
(22)
The proof of Observation 1 is an elementary consequence of plugging Eqs. (16) and (17), respectively, into Eq. (15). Equation (20) is interesting in that it contrasts with Eq. (16) by preserving as constant the product of W(x) and W( -x) rather than the sum of W(x) and W( -x). Both Eqs. (19) and (22) provide rational solutions to the specification of a reward system. It is easy to obtain explicit solutions of the reward functions for the three systems described earlier. For example, if the logistic system in Eq. (3) is assumed, then the reward function for constant value is given by W(x) = 2C[ 1 + eXi2B] -‘,
(23)
and for constant variance, we get W(x) = Ce -X’4B.
(24)
On the other hand, the uniform system in Eq. (5) yields, for constant value, if
xaB -B
if
x<--B
(25)
194
BATCHELDER,BERSHAD,
AND SIMPSON
and, for constant variance,
-B
undefined
if
(26)
x< -B.
The system based on constant value has some advantages for sequential estimation because 0 < W(x) < 2C. The other system leading to Eqs. (24) and (26) does not bound W(x), so the system is quite volatile and not robust to “accidents” in choice data. So we focus on the case of constant value expressed in Eq. (19) for the remainder of the paper. Once an F is selected, the reward system revealed in Eqs. (lo), (12), and (19) provides a tracking scheme or sequential estimation scheme for estimating rating differences. In this paper we restrict ourselves to only a few main results; and, as we will see, there are many facets of the tracking scheme that remain to be developed.
4. ESTIMATION
THEORY FOR Two OBJECTS
First, we analyze in detail the simplest case where choices are repeatedly made between two objects, a and b. In the realm of chess ratings, this case would correspond to a match (excluding draws) between two players. Let n = 0, 1, 2, .... index trials where choices are made. In this situation, it is only possible to estimate the difference in underlying (true) ratings on each trial given by y, = pn (a) -p,(b) because the origin is arbitrary. Let x0 be an initial estimate of the true rating difference of y, and let X, be random variables, with values x,, that are estimators of the y,, for n = 0, 1,2, ... . Then from Eqs. (lo), (It), and (19), we can obtain the recursion X ?l+1=
X,+4C[l-F(X)] X, - 4CF(X,)
if if
T,(a, b)= 1 T,(a, b)= -1.
Let us define the l&O random variables, V, = (T,(u, b)+ 1)/2, so Pr(V,=
(27)
1) =
F(y,). Then Eq. (27) can be rewritten as
X n+l=X,-4CF(X,)+4CV,.
(28)
Equation (28) provides a stochastic difference equation relating the series of sequential estimators of the yn. It shows that the new estimator, X, + , , is made up of a deterministic function of the previous estimator, namely, X, - 4CF(X,), and a stochastic component, namely 4CV,. From Eq. (28) it is easy to see that for any given initial estimate x0, the sequence of estimators (X, ),“= 1 constitutes a discrete
DYNAMIC
PAIRED-COMPARISON
195
SCALING
trial, inhomogeneous Markov process (e.g., see Parzen, 1962, Chap. 6) with transitions on any trial possible to only two states given by the two limbs of Eq. (27). Without further assumptions about the y,, little can be said about the behavior of Eq. (28). However two useful identities are easily obtained: E(X,+,IX,=x)=x-4CC~(x)-~(y,)l
(29)
and Var(X,+ i 1X, =x) = 16C*F(y,)
F( -y,).
(30) Equation (29) is interesting because in the special case where x = yn, the conditional expectation remains at x. This property directly reveals the effect of the fair game axiom (Axiom 2). In the general case, it is not possible to obtain analytic results for Eq. (28); however, several methods can be used to obtain approximate results, namely approximation by the uniform model, diffusion approximations, and Monte Carlo simulation. Because the two-object problem underlies all the other estimation problems that we examine, each of these methods is discussed in some detail in the next few subsections. The Uniform Model In applications that involve pairs of objects that are not too different in ratings, the uniform system in Eq. (4) provides reasonably accurate ratings. In fact, the Elo (1978) chess rating system was based on a system like the uniform system until the United States Chess Federation obtained a computer and shifted to a logistic formulation. In this case, considerable practical experience can be cited to support the accuracy of estimated ratings based on the linear approximation to the Gaussian and logistic distributions in the range of a standard deviation or so around the mean. When the middle limb of the uniform system in Eq. (4) is substituted into Eq. (27), the result is the stochastic difference equation X “+l=X”(l-e)+eBT,(a,b),
(31)
where tI = 2C/B and C is chosen so that 0 < 8 < 1. It is possible to make analytic progress with Eq. (31) as the next observation shows. OBSERVATION
-B
y,
2. Assume the uniform n=O, l,.... Then E(X,(X,=x,)=
(1 - Qn x0+
-02
i J=l
480/36/Z-3
system
ejcl
(l-e)*(i-l)y*-,
in
Eq.
(4),
(1 -e)j-‘Y,-j,
n
J’
and
assume (32)
(33)
196
BATCHELDER,
BERSHAD,
AND
SIMPSON
Proof: To obtain Eq. (32), compute E(X,+ r I X, = x, X, = x0) from Eq. 31 and use Eq. (4) in computing F(y,). Then after expectations are taken with respect to X,, the result is the difference equation
and this equation yields Eq. (32). One can obtain Eq. (33) in a similar fashion by working with Var(X,+ 11X0 = x0). Th e result is the difference equation Var(X, 1X0 = x0) = (1 - 0)’ Var(X, I X0 = x0) + fJ*(B* - y’), which is easily seen to yield Eq. (33).
1
In the case where the underlying rating differences y, are not constant, the sequential rating scheme in Eq. (28) tends to track the y,. In the case of the uniform system, the mean tracking error or lag e,, assuming - B < y,, < B, is given from Eq. (32) by e,=E(X,I&-y, =(l
(34)
-@“x,+8
i
(1 -e)‘-‘y,Pj-y,.
j=l
To illustrate Eqs. (32) and (34) suppose the y, undergo slow, positive linear growth given by Y, = an,
(35)
where n = 0, 1, .... and 0
estimate is accurate;
E(X,IX,=O)=an-(a/8)[1-(l-d)“].
Thus in the range 0
-(l
-e)“],
(36)
which approaches the constant u/8 = uB/2C as n increases. The e, in Eq. (36) decrease as C increases; however, on the early trials, the variance in Eq. (33) is easily seen to increase as C increases. Thus higher accuracy in the mean is achieved at the expense of higher variance in the case of slow linear growth, and this situation holds for many other situations of changing yn. A more psychologically realistic model for change is when the y, follow a positive growth curve such as (37)
DYNAMIC
PAIRED-COMPARISON
where O
197
SCALING
If Eq.(37)
is plugged into Eq.(32),
the
and if a+8=1, ~(X,IXo=xo)=~,-(~,-xo)(l-~)”
-(y,-y,)Bn(l-8)“-1.
(39)
It is easily seen from Eqs. (37), (38), and (39) that the mean tracking error goes to zero as n increases. Figure 1 plots Eq. (37) for the case where a = 0.95, y, = 0, and y, = 100. The open circles in Fig. 1 plot Eq. (38) for x0 = 100, B= 350, and CE (4, 8, 16, 32). To reveal the effect of changing the value of C, we choose x0 = 100 to be a large initial
:..... 0
20
40
60
T&G C=16
80 TiidS
100
El TM
loo
lx)
140
120
140 1
c=32
1.53
El------
140 120 e loo gKl
% 100 80
P 160
f" p 40
44
x)
al
0
FIG. 1. (solid line)
0
20
40
60
BD loo Trials
Open circles plot the expected for different values of C.
120
140
160
0
0
value of x, (given
20
40
60
in Eq. (38))
against
a simple
growth
curve
198
BATCHELDER,
BERSHAD,
AND
SIMPSON
misrating. It is clear from Fig. 1 that as C increases, both the speed of convergence of E(x,) to yn and the impact of initial trials increase. In the case where the y, are constant, Eqs. (32) and (33) yield E(X,IX,=x,)=y-(y-x,)(1
-e)n
(40)
Equations (40) and (41) appear very similar to analogous equations in the Bush-Mosteller (1955) linear operator model for two-choice learning, and the next observation provides a direct connection. OBSERVATION
3.
Let (X, ) ,“= 1 be governed by the conditions leading to Eq. (3 1)
and define
P, = F(X,) = (B + X,)/2B.
(42)
Pick 0 < C < B/2, let 0 = 2C/B, and assume - B < y, 6 B. Then (P, >,“= 1 is governed by the Bush-Mosteller linear operator model for two-choice learning given by
P n+1=
(i-e)p,+e
if
(1
if
-w,
T,(a, b) = 1 T,(a, b) = - 1,
(43)
where Pr[T,(a,
b)= I] =v.
Equation (43) is easily obtained by splitting Eq. (31) into two limbs, adding and subtracting (1 - 0) B, and applying Eq. (4). In case y, = y, the linear operator model for probability learning results, where n = (y + B)/2B. These results provide an interesting interpretation of the linear operator model; namely, F-‘(P,) can be interpreted as a rating difference between the two choice objects. Since the first order (linearization) of a reward system is the uniform system for some choice of B, and since the linear operator model has been investigated in detail (e.g., Norman, 1972), we can use these results to determine a great deal about the approximate behavior of reward systems under conditions of fixed ratings. Diffusion
Approximation
Let 0 = 4C and from Eq. (28) let AX,B=Xfj+l
- xl =
e[v
- F(X,)],
(4)
where, by an abuse of notation, we omit parentheses around the 0 in superscript positions. Then it is easy to see that x:+1=
i
j= 1
AX; +x0,
(45)
DYNAMIC PAIRED-COMPARISON
199
SCALING
where we assume Pr(X,, =x0) = 1. Equation (45) shows that Xfl,, is a sum of random variables; however, the AX; are neither independent nor identically distributed. Nevertheless, as 0 goes to zero, the step size AX,! goes to zero; so the methods for slow learning by small steps discussed in Norman (1972) can be applied to Eq. (28). Norman’s methods show that under certain conditions diffusion approximations for a suitably normalized form of Xfl yield approximate Gaussian distributions for both transient and asymptotic behavior of the Markov process cc>. In the case that F(;(x) has two bounded derivatives, the conditions leading to Theorem 1.1 in Norman (1972, Chap. 8) except the requirement of a bounded state space are satisfied. We leave it as an open problem to obtain small step results for the transient and asymptotic cases for various choices of F, when the y, are not constant. In case the y, are constant, the Markov process defined by Eq. (27) is homogeneous. In this case the steady-state results of Norman (1972, Chap. 10) apply as long as F(x) has two bounded derivatives. The result is presented in Observation 4. OBSERVATION
4.
Supposey, = y and F has two bounded derivatives. Then
z11=(Xj:-Y)lfi is asymptotically Gaussiandistributed with mean zero and variance
(47) as e -+ 0. Proof The conditions of Theorem 1.1, Part (ii) (Norman, satisfied easily for 8 = 4C, where
w(x) = F(Y)
1972, Chap. 10) are
-F(x),
and S(Y) = f’(~)Cl -F(Y)]. In particular,
y is the desired interior point that satisfies the conditions w(x) > 0
if
x>y,
w(x) = 0
if
x=y,
w(x) < 0
if
x
and
200 from Norman by computing
BATCHELDER,BERSHAD,
AND
SIMPSON
(1972, p. 152). The result follows from the conclusions of the theorem
v=
Q)
JlY)Cl
2W’(Y)
-F(yH.
,
2f(Y)
Observation 4 gives us a diffusion approximation for the steady-state behavior of the rating system when no change in underlying ratings occurs. For small values of 8 = 4C, then zz/,,,& is approximately Gaussian with mean zero and variance one, so one can obtain an approximate confidence interval for y. In general, diffusion approximations to the behavior of Eq. (27) are based on the idea that the fluctuations of the random variable X, are small in comparison with its mean. Kushner (1984) has extended this type of approximation theory to the general case of non-linear, discrete time systems driven by wideband noise which model various types of communication and control systems. The Case of Constant y, It is important for a tracking system that it behave well under the conditions of constant y, at some (unknown) value y, = y in the support of F. In this case, the V, in Eq. (28) are independent and identically distributed as a 1-O random variable with Pr(V, = 1) = F(y), so the corresponding Markov Process is homogeneous. Because of Axiom 2, the fair game axiom, we have that y is a unique point such that mL+IIXn=Y)=Y;
(48)
however, unfortunately it is not in general the case that lim E(x,) = y. That is, the long-run distribution of x, will exist; however, the distribution does not necessarily have expectation y. In such cases, the sequential estimation system has a bias, b(y, C)= that depends on the underlying rating It is obvious from the symmetry of -b( -y, C), and our experience with y > 0, 6( y, C) tends to be positive and provides a second-order approximation
lim E(X,)-y n-m
(49)
difference y and the weighting coefficient C. F that the bias in Eq. (49) satisfies b(y, C) = Monte Carlo simulations has shown that, for to increase with both y and C. Observation 5 to b(y, C) that rationalizes our experience.
OBSERVATION 5. Assume Eq. (28) with y, = y and assume F has two bounded derivatives at y. Then a diffusion approximation to b( y, C) is given by
(50)
DYNAMIC
PAIRED-COMPARISON
201
SCALING
To see this, let tI = 4C, e = lim E(X,), and V, = lim E(X, -G)*. in Eq. (29) quadratically about y yielding
f’(x) = J’(Y) +f(y)(x -Y) +
Next expand F
f’(YM 2 -Y12 .
(51)
Next, plug Eq. (51) into Eq. (29) and take expectations (with respect to X, ). The result is
E(X,+,)=E(X,)Cl-~f(y)l+~yf(y)--
Of’(Y) 2 a%l
-YJ2.
(52)
Next write 8(X, - y)’ = E(X, - e)* + (e-y)’
+ 2([ -y) E(X, -e).
(53)
When Eq. (53) is plugged into Eq. (52) and limits are taken, a quadratic equation in b(y, C) is obtained as follows: (54) Equation (50) results as the solution to Eq. (54), where the diffusion approximation to V, is replaced by its diffusion approximation in Eq. (47), namely 8V. From Eq. (50) it is easy to see that if f is unimodal, b( y, C) is positive for y > 0, since f'(y) c 0, and further, for C not too big b( y; C) grows with increasing C. Finally, in Eq. (49), since b( y, C) = -b( -y, C), for all y > 0, and b is continuous in y, b(0, C)=O. We compared the approximation in Eq. (50) to Monte Carlo simulations using the logistic model of Eq. (3) with B= 55.7, y= 100, 200, and C in (4, 8, 16, 32). Simulated matches of 200 trials were replicated 2500 times, where each run started at the value x0 = y. Table 1 reports the observed mean values of the variance and bias over the last 50 trials and compares them to the approximations in Eq. (50) and Eq. (47). Table 1 show that the approximation to the bias is fairly accurate. The noticeable bias in Table 1 in some of the cases just described does not pose a serious limit on the use of the sequential estimation scheme in actual pairedcomparison scaling. First, the bias is not large relative to the standard deviation for small C in usual cases like the logistic or Thurstone systems. For example, in the simulations, the standard deviation of the logistic with a B = 55.7 is 0 = 202.06, and even in the case where C = 32 (and y = 100) the bias is only 7% of this. Second, in most applications, each choice object is compared with a variety of other choice objects rather than a single, fixed object as analyzed in this section. Because the bias is symmetric about zero, it will tend to cancel if efforts are made to conduct choice experiments with objects whose ratings are near to and on both sides of the to-be-rated object. It is interesting to note in Table 1 that the diffusion approximation to V, is quite accurate in the range of C values considered for the logistic model.
202
BATCHELDER,
BERSHAD,
AND
TABLE
SIMPSON
1
Observed Mean Values of Variance and Bias from Simulated Data and Comparisons to Approximations in Equations (47) and (50) Y 100 100 100 100
c
V(data)
V(diffusion)
b(data)
b(approx)
4 8
898.66 1897.72 3925.31 8642.70 881.94 1810.56 3672.91 7664.70
891.26 1782.53 3565.06 7130.11 891.20 1782.40 3564.80 7129.60
1.24 3.05 7.69 14.57 2.16 6.25 11.05 24.30
1.69 3.39 6.83 13.84 2.89 5.83 11.90 24.88
16
32 4 8 16 32
200 200 200 200
Note. In the table, y stands for the underlying rating difference, C is the change coefficient, V stands for V in Eq. (47), and b stands for bias in Eq. (49). Entries in data columns are obtained from Monte Carlo simulation, and entries in V(diffusion) and b(approx.) are obtained from Eqs. (47) and (50), respectively.
c=tl 160 140
1:: b&l P 2643
l/IL--i 0
a
40
60
80 Trials
loo
120
40 20 140
160
Oo
al
40
8)
C=16 160
160r
140
4
"
"
"
80 loo Trials c=?#z " "
1.3
"
140 160
"
lx) $3 gad e $60 40 x) 0
0
20
40
80 Ttik
loo
120
140
160
FIG. 2. Open circles plot the empirical sample mean of X, (with F given by the logistic distribution) in simulated trials against a simple growth curve (solid line) for different values of C.
DYNAMIC PAIRED-COMPARISON
SCALING
203
Monte Carlo Simulation
If it is necessary to obtain accuracy beyond that provided by linear and diffusion approximation methods, Monte Carlo simulations of the process in Eq. (27) are easy to perform on a personal computer. All one needs is a way to compute random observations from F(x), and then a number of simulated matches can be obtained by using Eq. (28). To illustrate, Fig. 2 provides a simulation for the logistic model in Eq. (3) assuming the y, satisfy the growth model in Eq. (37). The conditions for the simulation were chosen to be as close as possible to the uniform system results leading to Fig. 1. For example, B = 55.704 in Eq. (3) was picked to equate the variances in both models. Figure 2 presents the results of 2500 simulations of a series of 150 trials. The empirical sample mean of X, is plotted separately with open circles for the C used as in Fig. 1. Figure 2 shows again that larger values of C increase the rate of convergence of E(X,) to y,. However, Fig. 2 also reveals that the choice of C influences the variability of x, about y,. Furthermore, it should be noted that for C= 32 it is evident in Fig. 2 that the asymptotic value of E(X,) is not y,, as is obtained for the uniform model in Eq. (40). Instead, a positive bias consistent with those shown in Table 1 is obtained.
5. ESTIMATION
THEORY FOR ONE OBJECT
In this case, we are concerned with sequential estimation of a single object when it is paired with different objects over a series of trials. In the chess setting, this is the usual case where an individual’s estimated rating depends on a series of results with different opponents. Also, from a statistical viewpoint, it provides the marginal process of the many-object situation discussed in the next section. Further, as we will see, the one-object situation is formally equivalent to some approaches in item response theory (e.g., Hambleton & Swaminathan, 1985), and to sequential methods in psychophysical threshold determination (e.g. Watson & Pelli, 1983). To formulate the problem, select a single object, say a E A. Let pn (a) = P,, denote the underlying rating of object a on trial n, and R, be an estimator and r, an estimate of P,,. Also let 5, be the underlying rating of the object paired with a on trial n, and p,, - g, = yn be the underlying rating difference on trial n. In this case, a recursion analogous to Eq. (28) for two objects is easily obtained from Eqs. (10) and (19). Let V, be a 1-O random variable with Pr(V, = 1) = F(y,), namely, the probability that a is chosen on trial n. Then R,+,(a)=R,(a)-2CF(r,-5,)+2CV,.
(55)
Equation (55) is more complex than Eq. (28) because r,, and hence yn, can change on each trial whether or not p,, is constant. We can obtain equations
204
BATCHELDER,BERSHAD,
ANDSIMPSON
analogous to Eqs. (29) and (30) by conditionalizing pn. The results are easily derived from Eq. (55):
on the values of R,, t,,, and
E(R,+,IR,=r,5,=s,p,=t)=r-2C[F(r-s)-F(t-s)],
(56)
and Var(R,+,
1R, = Y, 5, = s, pn = t) = 4C*F(t - s) F(s - t).
(57)
Equation (57) shows that the conditional variance is independent of the current rating estimate, as was the case in Eq. (30). Equation (56) has a fixed point if F(r - s) = F(t - s) or r = t. Also, Eq. (56) shows that the conditional expected rating does depend on the current estimate as well as the true ratings of the two objects. To explore this dependence, suppose object a is misrated on some trial (r # t), and it is desired to find the “optimal” object (opponent) to pair with a. In this case, optimality can be interpreted as the value of 5, that minimizes the difference between t and the mean of the revised estimate, namely, the conditional expectation of R,+1- The result is provided in Observation 6. OBSERVATION 6. ‘Let object a have constant true rating p,, = t and on some trial n have rating estimate R,= r < t. Zf f (x ) is unimodal, then the opponent’s rating 5, = s that minimizes
is given by s = (r + t)/2. Proof:
From Eq. (55), g(s) = t - r + 2C[F(r
- s) - F(t - s)],
and g’(s) = 2C[f(t
- s) -f(r
- s)] = 0
yields a solution s = (r + t)/2. This is the only solution if r = t and f(x) is unimodal. Since f is increasing for xc 0 and decreasing for x > 0, it is easy to see that g”[(t + r)/2] > 0 so s = (t + r)/2 minimizes g(s). m Observation 6 shows that if an object’s true rating is fixed, then for r < t, the optimal pairing for an object is given by the average of the current (under) estimate and the true rating. It is easy to see that the same result is given if r > t, so an optimal pairing is always an object whose underlying rating is an average of the current rating and the true rating. Of course, in the case of the uniform model in Eq. (4), which does not satisfy the unimodal requirement of Observation 6, g(s) is constant so long as t - s and r-s are both in the support of F, namely, ( -B, B). Further results for the one-object case are easily obtained by the approximation and Monte Carlo methods discussed in Section 4.
DYNAMIC
PAIRED-COMPARISON
SCALING
205
It is instructive to compare the one-object rating problem with psychophysical threshold determination. Psychophysical threshold methods assume there is a psychometric function P=(S), characterized by threshold T, that gives the probability of a “yes” (or “detect”) response to a (log) stimulus intensity variable S. As Watson and Pelli (1983) note, it is often the case that
PA) = W - T), where Y has the properties of a distribution function, and T is taken to be the stimulus intensity variable that gives a particular value of pT such as the 75 % point. Further, Watson and Pelli (1983) assume, The psychometric function has the same shape under all conditions when expressed as a function of log intensity. From condition to condition, it differs only in position along the log intensity axis. This position is set by a parameter, T, the threshold, also expressed in units of log intensity (p. 113). The situation described above is formally captured by the one-object rating problem by resealing the T to the 50% point. Then we can define the function P, (3) = F(s - n
(58)
where T is the rating of object a (assumed constant), s is the rating of a possible opponent, and F is the cumulative distribution function of some MPCS. Thus, Eq. (58) can be interpreted as the probability that object a is not chosen over (loses to) an object (opponent) with rating s. Watson and Pelli (1983, p. 114) go on to assume, “the parameter T of the psychometric function does not vary from trial to trial.” This is a standard assumption in threshold determination problems; however, in cases involving such things as learning or disease progression, there are reasons to assume that T may change, thus our methods may apply. The one-object problem also can be shown to be formally equivalent to some situations in item response theory, where the examinee corresponds to a particular object and the test items correspond to the other objects or opponents. For example, the item characteristic curves of the one-parameter normal ogive model and the one-parameter logistic model (see Hambleton & Swaminathan, 1985, Chap. 3) take the form of Eqs. (2) and (3), respectively, where x=0-b,, 8 is the examinee’s ability, and bi is the difficulty of item i. As in the threshold determination problem, underlying ability usually is assumed constant, so our methods may provide a useful approach to item response theory when underlying ability is possibly changing, while item difficulty parameters do not change. More specifically, the estimation methods provided here can be tied closely to certain adaptive estimation procedures used in psychophysical and mental testing (Lord, 1971; Owen, 1975) if a suitable scheme is used to select the rating 5, of the next object to be paired with object a in Eq. (55) as a function of previous ratings
206
BATCHELDER,
BERSHAD,
AND
SIMPSON
and results (a method is considered adaptive if previous stimulus levels and associated responses are used to determine the current testing level). For example, simple up-and-down, or staircase, methods (see Wetherill & Glazebrook, 1986, for a review of adaptive quanta1 response methods) may be used to estimate the midpoint of a psychophysical function. In a “yes/no” task one selects a set of evenly spaced stimulus levels. An initial stimulus of intensity s, say a tone of loudness s, is presented to the subject. A “yes” response (the subject hears the tone) is followed by the presentation of the next lower intensity level, and a stimulus of higher intensity follows a “no” response. This scheme guides the placement of stimulus levels, and it is a separate set of problems to determine a termination rule and a method of estimating the threshold from the data collected from the up-and-down procedure. Because the set of possible stimuli are evenly spaced, we may summarize the up-and-down procedure of trial placement as Sn+,=sn-CWn-PL where S, is the stimulus level administered on trial n, p is the desired probability of a “yes” response, c gives the spacing of the stimuli intensities, and Z, records the quanta1 response on trial n:
z,=
1 0
if “yes” to stimulus of level S, if “no” to stimulus of level S,.
Equation (59) provides the link between staircase trial placement and the oneobject rating problem. For instance, by letting p = 4, reversing signs in Eq. (55) (to parallel the nature of the threshold detection task), and letting c = 2C, we obtain
The r, in Eq. (60) may be interpreted as an estimate of the threshold on trial n. Then if intensity level s, = rn is administered on trial n + 1, we see that Eq. (60) describes the up-and-down placement of levels given in Eq. (59). Equation (57) shows that the conditional variance of the threshold estimate is dependent on the spacing of the stimuli. Robbins and Monro (1951) addressed this problem by introducing the stochastic approximation method of obtaining threshold estimates, Sn+l=%-C,(Z,-P),
(61)
where S, is the stimulus level administered on trial n and the c, are positive, decreasing constants chosen such that lim c, = 0. In practice the decreasing sequence, c,, is frequently chosen as c/n, with c>O arbitrary. Equations (60) and (61) are quite similar in form; however, they differ in that Eq. (60) specifies a model by selecting the function F, and it can track underlying changes in the threshold because the c stays constant rather than converging to zero over trials. We think a productive area for extending our methods is to allow
DYNAMIC PAIRED-COMPARISON
SCALING
207
the constant c in Eq. (60) to depend on trials. One open problem would be to allow the c, to depend on past results, for example, to interrupt the convergence of c, to zero when recent performance statistically suggests that the threshold is changing. 6. ESTIMATION
THEORY FOR MANY
OBJECTS
The system based on Axioms 1,2, and 3 can be used to provide joint, sequential estimators for many objects. Let A = {uil i= 1, ..,, M} be a set of M objects, and to simplify the discussion, suppose a complete paired-comparison experiment (round robin tournament) is conducted on A on each of a series of trials n = 0, 1,2, ... . Finally, let pi,n be the true rating of ai on trial n, and Ri,, and ri,n be estimators and estimates, respectively, of the P~,~. It is straightforward to obtain sequential estimators for all M objects from Eq. (12) given by the system Ri,n+l =&,,-2C
C J’(Ri.,-Rj,,)+2C j#i
where V,(i,j)
C V,(i,A,
(62)
j#i
is a 14 random variable given by V,(i,j)
= (T,(i,j)
+ 1)/Z
and i= 1, 2, .... M and n = 0, 1, 2, ... . The system in Eq. (62) is easy to implement on a computer (and even on some pocket calculators), and if initial estimates, ri,O, and round robin results on each trial n (the V, (i, j)) are provided, Eq. (62) provides simultaneous, updated estimates of all objects on each trial. It is easy to show that the system in Eq. (50) “conserves” rating points in the sense that ;E, Ri.n = it,
ri.Ol
(63)
for all n. To see this, note that for all i#j,
so the second and third terms of the right hand side of Eq. (12) cancel when summed over object indices. In fact Eq. (63) reflects the zero-sum property entailed in Axiom 1, so even if the pi,n are changing on various trajectories, the aggregate of all rating estimates is constant. Equation (29), for x = y,, reflected the fair game property in Axiom 2 for a two-object set, and it is easily generalized to the current case. Assume ti is the underlying rating of ai, assumed constant, and ri is an estimate of ti. Then the analogue to Eq. (29) is E(R,n+ 1 ) (Ri,,=ri))=ri-2C
1 [F(r,-rj)-F(ti--tj)], i#j
(64)
208
BATCHELDER,BERSHAD,
AND SIMPSON
i = 1, 2, .... M. The fair game condition is reflected in Eq. (64) by letting ri= ti, for all a, in A. In this case the second term on the right hand side of Eq. (64) is zero. It is instructive to consider the uniform model of Eq. (4). When this is applied to Eq. (62), the result is Ri,, + 1=Ri,n-‘C
j#i
(R,n-Rj,n)+C
1 (2V,,(i,j)-l),
(65)
j#i
where 0 = C/B. Equation (65) is equivalent to one version of the Elo (1978) chess rating system discussed in Batchelder and Bershad (1979) and Batchelder and Simpson (1989, Eq. 27). To see this, note that 1 [ZV,(i,j)-
l] = Wi-L,,
ifi
where Wi and Lj are the number of times i is chosen (wins) and not chosen (loses) against M- 1 other choice objects in A. In fact, if 8 = 0.04 and C = 400 in Eq. (65) Eq. (27) of Batchelder and Simpson (1989) is obtained, which corresponds to the actual Elo (1978) formula used to rate tournament chess players in the United States during much of the 1960s and early 1970s. Batchelder and Bershad (1979) discuss a general approach to dynamic estimation in paired-comparison scaling motivated by the Elo system. The 1979 approach differs from the current one in several respects. The main difference is that in the 1979 approach classical estimators must be developed for each particular F and each particular tournament structure. In contrast, the current system is also motivated by the Elo chess rating system; however, explicit estimators are not required at any stage of the implementation. While the two systems closely approximate each other in many cases, the uniform model leading to Eq. (65) is the only non-trivial case that we have found where the two approaches coincide exactly.
7. INTRODUCTION
OF A DRAW OR NO-PREFERENCE
OUTCOME
In the comparison of a set of choice objects, it may be useful to allow for a no-preference (or “draw” or “tie”) response. For example, draws occur frequently between the closely matched competitors in master-level chess play (Elo, 1978; Batchelder & Bershad, 1979; Batchelder & Simpson, 1989; also see David, 1963, for an early discussion of the tie in Thurstonian scaling). In this section we extend the Axioms 1, 2, and 3 for the reward system presented in Section 3 to accomodate the draw outcome and its associated reward. Let W = (I;; W, L) be a reward system satisfying Definition 5 that we want to modify to allow for a draw. The win, draw, and loss outcomes of a choice experi-
DYNAMIC
PAIRED-COMPARISON
SCALING
ment on a pair of objects a and b can be summarized by modifying of the performance random variables T, (a, b) in Eq. (10): 1
if a chosen over b on trial n
0
if no preference between
T,(a, b) =
a and b on trial n I -1
209 the definition
(66)
if b chosen over a on trial n,
for all a, b E A and n = 0, 1, 2, ... . Following the assumption that performance probabilities depend only on the rating difference, we introduce the functions w(x), d(x), and I(x) as the win, draw, and loss functions, respectively, for - co < x < co, and x = p(u) - p(b). Definition 5, part (i) is modified to require that w(x), the counterpart to F(x), be a continuous, distribution function that is strictly increasing on its support and w(O) < $ Then win-loss symmetry and the fact that draws are shared requires the identities Z(x) = w( -x)
(67)
d(x) = 1 - w(x) - w( -X).
(68)
and
Next we need to modify Definition 5, part (ii) by introducing B’(x), D(x), and L(x) as the rewards (continuous functions) for win, draw, and loss, respectively. As before W and L are positively valued and continuous; however, D is a continuous function that is negative for positive x. Then Axiom 1 (zero sum) requires that L(x) = W( -x)
(69)
D(x) = - D( -x).
(70)
and
Next Axiom 2 (fair game) is modified
to require
W(x) w(x) + D(x) d(x) - L(x) l(x) = 0.
(71)
Before simplifying Eq. (71), it is desirable to relate W(x) and D(x). Elo (1978) bases the chess rating system on the scoring rule that two draws score the same as a win and a loss. This reasonable assumption is captured as follows. Axiom
5 (Draw is Half a Win). D(x)=
for all x E Re.
W(x)-
W-x) 2
’
(72)
210
BATCHELDER,
BERSHAD,
Now Eq. (71) can be simplified
AND
SIMPSON
easily from Eqs. (67)-(70) to yield G(x) = Cl _ G(x), W(x),
W-x)
(73)
where G(x) = 4 + [w(x) - w( -x)1/2, and G(x) = 1 - G( -x) can be interpreted as the expected value of the choice experiment, i.e., G(x) = w(x) + d(x)/2. Equation (73) is analogous to Eq. (15), and, as before, some additional principle is required to simplify it further. Adopting Axiom 3 (constant value), Eq. (16) is easily seen to yield W(x) = 2CG( -x) and D(x) = W(x) - A, where C = W(0) as before. Rather than starting with a win probability function, w(x), it is more natural to start with a distribution function F(x) as in Definition 2 and seek to generalize it to allow the possibility of a draw. The method adopted by David (1963) and Elo (1978) is to select a threshold t>O (draw region) and make w(x) = Pr[X > t 1p(a) - p(b) =x]
(74)
= F(x - t),
where X is thought of as a “performance differential;” that is, for a to beat b (be chosen over b) its momentary performance (or differential utility) must exceed t. Then we have d(x)=F(x+t)-F(x-t) and 1(x)=1-F(x+t). As an example of a complete system that allows a draw, Y = (F; W, D, L), assume the uniform model given by Eq. (4). If the system satisfies Axioms 1,2, 3, and 5, then, for all x E Re and C > 0 x>B -B
(75a)
x< -B,
-C D(x) =
- Cx/B
i C 2c L(x) =
C(B + x)/B 0
x>B -B
(75b)
x< -B, xaB -B
(75c)
DYNAMIC PAIRED-COMPARISON
211
SCALING
It is straightforward but tedious to make models with the draw, like Eqs. (75a, b, c), into a sequential tracking scheme. We leave it as an open problem to develop the estimation theory for the one-object, two-object, and many-object cases for models that incorporate the draw.
8. CONCLUSION
Paired-comparison methods generally assume static underlying scale values. In many situations in the social sciences, it is reasonable to suppose that underlying scale values may change with time. Relevant data to assess these changes come from paired-comparison choice experiments conducted on discrete trials, then the methods in the paper provide rational ways to provide sequentially revised estimates that track the underlying values. The tracker is easily implemented on a personal computer, and a number of approximate results can be obtained by linearizing the choice function or by obtaining diffusion approximations using methods for slow learning by small steps, such as those discussed in Norman (1972). The rationale for the trackers we develop comes from considerable practical experience with the Elo (1978) chess rating system. While the systems we develop are new, their point of departure was the Elo rating system discussed formally in Batchelder and Bershad (1979) and Batchelder and Simpson (1989).
REFERENCES BAIRD, J. C., & NOMA, E. (1978). Fundamentals BATCHELDER, W. H. & BERSHAD, N. J. (1979).
of scaling and psychophysics. New York: Wiley. The statistical analysis of a Thurstonian model for rating
chess players. Journal of Mathematical Psychology, 19, 39-60. W. H., & SIMPSON, R. S. (1989). Rating systems for human abilities: the case of rating chess skill. UMAP Modules in Undergraduate Mathematics and its Applications: Module 698. Reprinted in P. J. Campbell (Ed.), UMAP modules 1988: Tools for teaching (pp. 289-314). Arlington, MA: Consortium for Mathematics and its Applications, Inc. BUSH, R. R., & MOSTELLER, F. (1955). Stochastic models for learning. New York: Wiley. DAVID, H. A. (1963). The method of paired-comparisons (1st ed.). London: Griffin. DAVID, H. A. (1988). The method of paired-comparisons (2nd ed.). London: Griflin. DAVIDSON, R. R., & FARQIJHAR, P. H. (1976). A bibliography on the method of paired-comparisons. BATCHELDER,
Biometrics,
32, 241-252.
& FALMAGNE, J. C. (1974). Difference measurement and simple scalability with restricted solvability. Journal of Mathematical Psychology, 11, 473-499. ELO, A. E. (1978). The rating of chess players past and present. New York: Arco Publishing. G~LLEDGE, R. G., & RAYNER, J. N. (Eds.). (1982). Proximity and preference: problems in the multidimensional analysis of large data sets. Minneapolis, MN: Univ. of Minnesota Press. HAMBLETON, R. K., & SWAMINATHAN, H. (1985). Item response theory. Boston: Kluwer-Nijhoff Publishing. DOIGNON,
J.
P.,
H. J. (1984). Approximation tions to stochastic systems theory.
KUSHNER,
480/36/2-4
and weak convergence
methods for random
Cambridge, MA: MIT Press.
processes,
with applica-
212
BATCHELDER, BERSHAD, AND SIMPSON
LORD, F. M. (1971). Robbins-Monro
procedures for tailored testing. Educational
and
Psychological
Measurement, 31, 3-31. LUCE, R. D. (1959). Individual choice behavior. New York: Wiley. NORMAN, M. F. (1972). Markov processes and learning models. New
York: Academic Press. OWEN, R. J. (1975). A Bayesian sequential procedure for quanta1 response in the context of adaptive mental testing. Journal of the American Statistical Association, 70, 351-356. PARZEN, E. (1962). Sfochasfic processes. San Francisco: Holden-Day. ROBBINS. H-., & MUNRO, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics,
22, 400-407.
SUPPES,P., & ZINNES,J. L. (1963). Basic measurement theory. In R. D. Lute, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. I, pp. 3-76). New York: Wiley. WATSON, A. B., & PELLI, D. G. (1983). Quest: A Bayesian adaptive psychometric method. Perception and Psychophysics, 33, 113-l 20. WETHERILL, G. B., & GLAZEBROOK, K. D. (1986). Sequential methods in statistics. London: Chapman & Hall. RECEIVED: October
11, 1989