Statistics & Probability Letters 2 (1984) 305-310 North-Holland
October 1984
ON A MATCHING STATISTIC
Leon PESOTCHINSKY Signetics Corporation, Sunnyvale, CA, USA Received June 1983 Revised June 1984
Abstract: A discrete time process on [0,11 considered in this paper is related to various problems involving two independent samples. In particular one may suggest a simple matching rule for the case of continuously generated samples and a goodness-of-fit test based on the number of unmatched dements. A recurrence formula for computing the exact distribution of this statistic and the asymptotic behavior of its expectation are found.
Keywords: Cram6r-v. Mises statistic, goodness-of-fit test, matching.
Introduction
Let Xj~ be independent random variables ( j = 1, 2; i = 1,... ,n j) with (unknown) continuous distribution functions (c.d.f.) Fj. Suppose that the goodness-of-fit hypothesis
H:
FI=Fz=G
is to be tested. (G is a specified c.d.f., and without loss of generality we may assume that G is uniform on [0,1]). Rosenblatt (1952) and Kiefer (1959) suggested generalizations of Kolmogorov-Smimov and Cram~r-v.Mises tests, and various other test statistics were proposed including one of the latest by Ajtai, Koml6s and Tusnhdy (1982). Most of these assume that the sample sizes nj are fixed and use some kind of a distance between empirical c.d.f.'s and G, or some matching of the sample elements. The matching considered in this paper arises in a setting when X1 and X2 are generated one at a time and at each moment a decision which random variable to generate is taken at random (e.g., with probability ½ for each). Thus we can fix only combined sample size n = n a + n 2 (or the size of any one sample). We will investigate a matching procedure which is described by the following rules:
1. if an element of the first population (X1) is generated, then its value is recorded and kept; 2. if an d e m e n t of the second population ( X 2) is generated, then it is 'matched' to the nearest larger element (say, X~) from the records of the first population, and the record of X~ is then erased; 3. if such an element X~ does not exist, then no record of X 2 value is kept. It is intuitively clear that the numbers of unmatched elements of each population have the same distribution, and thus only the number of unmatched Xl's (denoted further by T~) may be needed to construct a procedure to test, say, the goodness-of-fit hypothesis. This number T. is available to us at each moment n along with the record of unmatched up to this moment Xl's, and in this paper we are interested in the distribution of T,, and in its asymptotic behavior under the null hypothesis. In Section 1 we formulate the problem and prove that under the null hypothesis the numbers of unmatched Xl's (called 'survivors') and unmatched X2's (called 'escapees') have the same distribution. This result is used in Section 2 to derive a recurrence method of computing the distribution of T. for each n, and then in Section 3 to find the asymptotic behavior of the expectation
0167-7152/84/$3.00 © 1984, Else~er Science Publishers B.V. (North-Holland)
305
Volume2, Number 5
STATISTICS & PROBABILITYLETTERS
ET,. Unfortunately we are unable to determine the asymptotic distribution of T~, or the distribution of T, under various alternatives. What makes the latter problem especially difficult is that the numbers of survivors and escapees may not have the same distribution. The author hopes to address this problem in a later publication.
1. Preliminaries and definitions
Consider two independent sequences of i.i.d. uniform on [0,1] random variables X i and Y/such that P(Y~= 1} = P { Y ~ = - 1 } = ½. Let Z~ be defined as Z~ = (X,, Y,.). We perform an experiment with trials Z 1, Z2,...,Z,,,... and define inductively the n-survivors S(n) in the experiment by the following rule: I. If
X,={.~,...,X,.o~., }
S(n)= { X,,,.,, with
A trial ZL = (XL, YL) (1 < L ~ Xirl._,(L-1). For each n we partition the sample space of the n trials fl,, into 2"n! sets 6 . = LJ#,,,y where y = (o~1, k': ..... k'.) is a vector with components +_1 and ~t = (rq, % , . . . , % ) is the permutation of 1, 2 , . . . , n defined by x.,
...
<&.
Each of the sets £,,,y is assigned probabifity 1/2"n !: Let
M(n)= ( Xk,<.,, &<., ..... X~.,.,), x < ( . ) < x~,(.) < .-- < xk~.~.~, denote the n-excape points. Define the n-reversal mapping ~,, on the sample space ~ . by
,. : to-.,.(to)
x,,<.> < x~=<.> < . - -
< x,T.<.~
where
and (the number of survivors) T~ >1 1, then (1) for Z . + a = ( X . + x , 1) and X i A . ) < X . + , < Xi++,(.) for some j = 0, 1,..., T.,
s ( , + 1)= { x,,,.>, .... x,~.>, x.+,, x,,,<.~ .... ,
Xirn(n) } (2) for Z.+~ = (X.+~, - 1 ) and X,~.)< X.+~ < X#.,o,) for s o m e j = 0, 1,..., T, - 1,
S(n + 1)= { x/,,.,,...,x,,~.,, x,,.,,., } and T.+ 1 = T . - 1. (In both eases we assume that X 0 = 0 and X%+,¢.) -- 1.) II. If T,, = 0, then S(n) = 0 and (1) for Z,,+a -- (X.+l, 1),
S(n+I)={X,,+I
)
and
q~.(to) = (1 - X., - Y . ) ..... ( 1 - X a , - Y 1 ) . Theerem 1. (i) The transformation qa. is measure
preserving. (i.i) The n-survivors and n-escape points of to and ~. (to) are related by
to n-survivors.
M(n)= (X~,<.,, &<.~ ..... X~o<.~), to n-escape points.
s,(n) = {1 - x~,~.,, a - & , . , , . . . , x - x~o.,.~ }, q~.(to) n-survivors, M,(n)=
{1 - Xho,), 1 - X/,(.),..., 1 - Xir.C.)},
T.+,=I;
(2) for Z,,+I = (X.+a, - 1), and
to = ((Xx, Y,) .... , ( X . , Y.)),
S(n) = { X,,<.,, X,2<., ..... X,T.,.,)
and T . . a = T. + 1;
S(n+l)=O
October 1984
T,,+I=O.
q~.(to) n-escape points. (iii) In particular.
T.(to)= V.(,.(to)),
~ ( t o ) = T.(,.(to))
Volume 2, Number 5
STATISTICS & PROBABILITY LETTERS
so that P r { T . = 1, U . = u } = P r { T . = u ,
U.=t}.
The formal proof of the theorem is omitted, but the basic argument is outlined below• Consider a process Z [ i = ( X j , Ys:)= ( 1 - X , , - j + I , j=l,2
Y.-j+I),
It is clear that in this 'new' process a matched pair still remains matched. A trial Zj becomes an escapee if Yj' = - 1 and Xj > X:k, ,~here S ' ( j ) = { S ' i~ }, k = 1, 2 , . . . Tj(j), is the j-survivor in the new process. Incidentally, the latter implies that E',~ = + 1 for all k = 1, 2 , . . . , Tj(j). It follows now that in the original process Y~-j+I = + 1 , Y,,-~k+l = - 1, and X,,_j+ 1 < X~_i,+l, k = 1, 2 , . . . , Tj(j). These conditions mean that either all negative trials after m o m e n t n - j + 1 were larger than X~_j+ 1, or that they were matched to some other positive trials• This implies that Z~_j+~ = (X~-j+I, Y~-j+I) is a survivor at moment n in the original process. It may be interesting to notice that for the study of asymptotics of ET~ it is sufficient to use the fact that ET~ = EU,. The latter immediately follows from the observation that
U.=U._I+r.,
2. Recurrence equations for/.,I,
Q~ ( u l x l , . . . , x t)
--~-
Pr ( U,+, = u[T, = t, Xik = xk, k = 1 , . . . , t } . (2.1)
It is easy to see that the right-hand side in (2.1) does not depend on s and therefore we can always assume that s = 0 and omit any reference to it. In other words, we start the process at a moment 0 with t survivors in (0,1). Lenuna 2.1. The function the recurrence equation
Qn(ulx,... ,x,)
,:{!
1 - x, -
2
Q~_l(u-llxl,...,x,) t
1
+
j=O V,
xj+l
xj
Xj+I,.. , x t ) d v + i •
if Y, = + 1 and X, > Xir._,(,_l), if Y,, = - 1 and X. > Xir._,(,_l) , if Y~ = . - 1 and X. ~
and 10 if 1:.= - 1 a n d X . > X / r . _ , ( n _ l ) , otherwise.
Indeed we can see now that ET,, = ES,, which implies ET~ = EU~. [] The above theorem establishes a 'duality' between the processes of survivors and escapees and in particular allows to analyze the distribution of the number of escapees U~ instead of the number of survivors T~. This result is the most instrumen-
satisfies
Qn(ulxl,...,x,)
xj
where
3'~ =
tal one because Un (unlike T,) is a nondecreasing function on n. We present below a recurrent mechanism for the computation of probabilities P,,/,.
Define, for a fixed s and x 1 <~x 2 ~ • • • <~x,,
. . . . ,n.
= Tn_, + 8.,
October 1984
j=O
Xj_l 2
x a,_l(ulxl,...,Xj_l,Xj+l,...,x,),
(2.2)
with x o = 0 and xt+ l = 1. Proof. The result is obtained by conditioning on the outcome of the first trial. The first term in (2.2) corresponds to an escapee, the second to the addition of a new survivor, and the third to the elimination of one of the survivors. [] I.emma 2.2. The function Q~( u l x l , . . . ,xt) is a polynomial of power not larger than n in no more than nfin{ t, n - u + 1} variables. Proof. The first part of the statement follows from (2.2) and a standard induction argument. To prove the second part we should notice that a trial 307
Volume 2, Number 5
STATISTICS & PROBABILITY L E T r E R S
October 1984
Zj = (Xj, - 1 ) becomes an escapee only if all the survivors (X~., + 1) with X i larger than Xj were matched prior to moment j. Therefore none of the u escapees in n trials can fall to the left of x t_ (,_.). Thus only n - u + l of xj's w i t h j > / t - ( n - u ) can be included in a polynomial Q,(ulxl,...,x,). []
tributed binomial random variables. The expected number of survivors in the interval (x, x + c) is therefore equal to the expected number of survivors in, say, interval (0, c) minus the expected number of survivors 'eliminated' by escapees from interval (0, x). By the previous lemma, the latter number is an increasing function of x. []
Theorem 2. For all n >i 2 and 1 ~ k <~n,
Remark. We could easily prove a stronger result. Namely, that T~(x, x + c) is stochastically decreasing with x.
P,.,= Pr{ T~= k }
=½Pn-l,k-1+½£Q.-,{klx}dx = Q,,{k[O}.
E(Xij+ltn)Xia(n)) increases
Corollary. with j = 0, T,(n); Xi0 = 0, XiT(n)+l(n ) = 1 (e.g., expected (2.3)
The proof follows immediately from (2.2).
[]
Theorem 2 allows for the recursive computation of probabilities P,,k. It may be shown that the total number of coefficients of the polynomials needed for the nth step does not exceed q"-~, where q is some number between 1 and 2.
3. Asymptotic evaluation of ET~ Lemma 3.1. Let U,(O, x) denote the number of
escapees from interval (0, x) (e.g., the number of trials with Yj = - 1 and
distance between two consecutive survivors is an increasing function.) Proof. The proof of the corollary follows from the observation that by the above lemma the expected average distance between the survivors in interval (x, x + c) increases with x. [] Lemma 3.3. The event (T. = 0} represents the renewal state. The probability of renewal P,,0 < Proof. Using the duality of T, and U, we will estimate P{ U, = 0}. Let ~/, be defined as in Theorem I and let X
x>~ Xj>~ max Xi,(,,) ). =E{
1 + X~T('-')(n--1)] U'-1=0}2
Then EU.(O, 'x) is an increasing function of n and of
= o}
X.
~E
Proof. Clearly, Uk(tO) increases with k. Therefore E Uk is an increasing function of k and thus
{ l+X(n-1)(n-i) [U._I=O} 2
xP{v
_l=O}
B
1 +(n- 1)/np(u,_l=O }
EU.(O, x ) = ~_. EVk(~)xk(1 - x) "-k
2 2n - 1 - 2----~- P{ V,_l = 0}.
k=0
increases with n and x.
[]
Lemma 3.2. Let T~(a, b) denote the number of survivors in internal (a, b ) c (0, 1) after n trials.
Then ET,(x, x + c) is a decreasing function of x, 0
-c.
Proof. We note that the numbers of trials n c falling into intervals (x, x + c) are identically dis308
(3.1)
(We use in (3.1) the fact that with probability one X i . . ( n - 1) < Xo,_ls(n - 1) for all n.) "~r'om (3.1) we immediately obtain that
e{v
=o}
(2n - 1)! 1 = - - n ! ( n - 1)!2 2n 2 V ~
1 ~V/~ "
< - -
[] (3.2)
Volume 2, N u m b e r 5
STATISTICS & PROBABILITY LETrERS
Corollary. Let R , denote the time (between 1 and n) when the process is in the renewal state (e.g., with no survivors). Then
eR.= Proof. Let
°
if,
:o
otherwise. Then E F k = P ( T k = 0} = Pk,0- Since R n = E~=IFk it follows from (3.2) that
creased the expected number of survivors (or escapees). We summarize the above arguments by observing that the expected number of escapees in the last two cases does not exceed the expected number of escapees in the process with n trials plus the probability that the smallest trial becomes an escapee. Thus denoting the moment of the occurrence of the smallest trial by k we can write EU,, ~< ( E U ~ _ , ) / 2 + ( E U , ) / 2 1
+~n ER, =
Pk,o < ~ k=l
V~-
< ~
[]
October 1984
" E P ( T k - a = 0}"
(3.6)
k=l
(3.3)
Assume now that
"
Lemma 3.4. For n >11,
EU~_a < AVCn-1
ET,,+I >1ET, +½(ET,, + 1)-1
(3.4)
Proof. Using the notation of Theorem 1 we may write that
For some A > 0. Using Lemma 3.3 we derive from (3.6) that ET~ = EU, ~ Avrn - 1 + 1/v/-n - 1.
(3.7)
The right-hand side in (3.7) does not exceed AV~ as long as A >1 2. The rest of the proof follows now from Lemma 3.4 because
Un+l-~- Un-t-Vn, and, by the corollary to Lemma 3.2,
n-1
E ( y , , l S ( n ) } = E{½(1 - X~,.,,(,))[S(n)}
n-1
E Tn >. 1 E ( E T k + 1) -1 >~½ E (XvCk-+ 1) -1"-"~ k=l
>/{2(T. + 1)} -1
[]
k=l
Remark. We were unable to prove the existence of The latter inequality implies that lim
E T , + I = E U , + i = EU~ + ET, >1ET, + ½( ET, + 1) -1.
[]
However, we can show that
Lemma 3.5. There exist A and AA_such that 0 < A <~.d and
er.
YCg.
eT.
(3.5)
Proof. We can compute the re!tuber of escapees in n trials by conditioning on the trial with the smallest X in [0, 1]. This way the total number of escapees U~ equals either the number of escapees in the process with n - 1 trials (if the smallest is a survivor), or this number plus one (if the smallest becomes an escapee), or the sum of the numbers of escapees prior to the occurrence of the smallest trial and after it (if a matching occurs). In the last case we note that by matching the smallest trial we have not decreased the expected number of matched pairs, and therefore we have not in-
lim ET, and that, given the existence of lira,.., ooT,/¢t-~, we have, for all x ~ (0, 1), lira ETa(0, x) = 7%-. ,--oo ET, In other words the expected numbers of survivors in subintervals of (0, 1) have a limiting distribution with density (2V%-)-1. We can take Xir (n) as the largest quantile of a sample of size T, From a distribution with c.d.f. F ( x ) = ¢r~-, O ~ < x g 1. This gives us n
EXit. ~") "~ T. + 2 309
Volume 2, Number 5
STATISTICS & PROBABILITY LETTERS
and, as in Lemma 3.4,
ET,+I = ET.
+
½E(1 - XiT.¢,))-- ET, -¢
October 1984
bertson from UC Santa Barbara for introducing him to the problem and for valuable discussions. ET~ + 2 '
which can be satisfied only by
References
Acknowledgment
Ajtai, M., T. Koml6s and G. TusnAdy (1982), On optimal matching,s, to appear. Kiefer, G. (1959), K-sample analogues of the KolmogorovSmirnov and Kramer-v. Mises tests, Annals Math. Stat. 30, 420-447. Rosenblatt, M. (1952), Limit theorems associated with variants of the von Mises Statistics, Annals Math. Stat. 23, 617-623.
The author is grateful to Professor Karp from UC Berkeley and to Professors Konheim and Ro-
310