Diversity analysis in multiple-choice questionnaires

Diversity analysis in multiple-choice questionnaires

Statistics and Probability Letters 80 (2010) 1103–1110 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage...

300KB Sizes 0 Downloads 35 Views

Statistics and Probability Letters 80 (2010) 1103–1110

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Diversity analysis in multiple-choice questionnaires Haizhen Wu, Giorgi Kvizhinadze ∗ School of Mathematics, Statistics and Operations Research, Victoria University of Wellington, Wellington 6140, New Zealand

article

abstract

info

Article history: Received 9 September 2009 Received in revised form 1 March 2010 Accepted 4 March 2010 Available online 18 March 2010

We extend the analysis of the diversity of questionnaires with binary (yes–no) answers to the case of questionnaires with multiple answers. It has been demonstrated that the original techniques used by Khmaladze (2009) are universal and can be applied to prove limit theorems for more general cases. It was shown that this different framework again leads us to a model similar to that for a large number of rare events that Khmaladze obtained in Khmaladze (2009), namely, to the so called Karlin–Rouault law. © 2010 Elsevier B.V. All rights reserved.

Keywords: Contiguity Edgeworth expansion Large deviation Large number of rare events Karlin–Rouault law

1. Introduction In practice there are many examples where the random variable is a q-dimensional vector with coordinates taking different numbers of values. For example, imagine a system which contains a large number (say, 20) of components and these components can be in various states at any given moment; therefore the state of a system at any given moment could be described as a 20-dimensional vector. Biologists use a sequence of procedures for identifying bacteria. During each procedure the bacterium is placed in a certain chemical substance where it changes its color. Looking at the sequence of colors, biologists can say to which group of bacteria it belongs. It is obvious that the result of such a test is also an example of a multidimensional random variable. And finally as a basic example we will take a questionnaire with q multiple-choice answers. See below. PN Consider N disjoint events with probabilities p1 , p2 , . . . , pN , i=1 pi = 1, and let νn = (ν1n , . . . , νNn ) be the vector of frequencies of these events in n independent trials. The so-called ‘‘spectral statistics’’ and ‘‘empirical vocabulary’’ (see, e.g., Baayen, 2001; Khmaladze, 1988; Khmaladze and Chitashvili, 1989) are defined by

µn (m) =

N X

I {νin = m},

m = 1, . . . , n ,

i=1

µn =

N X i =1

I {νin ≥ 1} =

n X

µn (m) .

m=1

In classical statistical analysis, the number of events N and the probabilities {pi } are usually fixed. This setting applies in many statistical problems where the possible events are well-defined and their probabilities are specified. However, in many interesting applications, such as in the study of lexical statistics (see Baayen, 2001), more flexible frameworks are necessary, where the possible events may extend and the probabilities may vary along with sample size n increasing.



Corresponding author. E-mail address: [email protected] (G. Kvizhinadze).

0167-7152/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2010.03.004

1104

H. Wu, G. Kvizhinadze / Statistics and Probability Letters 80 (2010) 1103–1110

As the numbers of possible events are increasing (and so we have large number of events), the probabilities for each event are usually decreasing (so we say that the events are rare). This new framework stimulated the development of an independent area of statistical theory, which is called the large number of rare events (LNRE) theory. The pioneering work of formal statistical analysis on the theory of large number of rare events was Khmaladze (1988), where Khmaladze introduced the notion of LNRE, and studied its different forms as well as investigating necessary and sufficient conditions for when the LNRE schemes arise. Here we recall two of his definitions. Definition 1 (d1). A sequence {νn }n≥1 forms a sequence of a large number of rare events if lim

Eµn (1) n

n→∞

> 0.

Definition 2 (d2). A sequence {νn }n≥1 forms a sequence of a large number of rare events if lim

Eµn (1)

n→∞

E µn

>0,

lim Eµn = ∞.

n→∞

In Khmaladze (1988) a number of necessary and sufficient conditions for these two kinds of large numbers of rare events have been given, and the relations between them have been discussed. Recall that d1 implies d2, but not vice versa. In this paper, we are interested in a specific framework in which the disjoint events can be indexed using q-dimensional vectors E xq = (x1 , . . . , xi , . . . , xq ) with coordinates xi changing from 1 to ki respectively. Then pi and νin , i = 1, . . . , N, in the Qq q previous setting become p(E xq ), respectively, with E xq ∈ Ξq = ×i=1 {1, . . . , ki } and N = i=1 ki being the cardinality of Ξq . Hence µn (m) becomes

µn (m) =

X

I {νn (E xq ) = m}.

Exq ∈Ξq

Therefore,

X

Eµn (m) =

P{νn (Exq ) = m} =

Exq ∈Ξq

X n Exq ∈Ξq

m

p(E xq )m 1 − p(E xq )

n−m

and

Eµn =

X

1 − 1 − p(E xq )

n 

.

Exq ∈Ξq

This framework can be found in many applications. For example, E xq can be interpreted as an opinion in a questionnaire with q multi-option or multi-choice questions (the i-th question has ki options). And the ratios

Eµn (m) n

and

Eµn (m) E µn

(1)

can be interpreted as the proportion of the number of opinions with m supporters in all n responses, and of the total number of opinions with at least 1 supporter, respectively. The main setting of the framework was given in Khmaladze (2009). In that paper all xi were binary. However, in this paper, we want to take advantage of the fact that the proofs given in Khmaladze (2009), as we see, are of more general nature. We demonstrate this by extending settings for questionnaires from binary to completely general structure. 2. Preliminary analysis and discussion 2.1. The relations among n, q and N There are three variables n, q and N which control the asymptotic behavior of the ratios (1). Among them, q and N are directly associated with each other by the definition of N. Therefore, it is sufficient that we discuss the relation between n and N. Since N is the number of disjoint events (opinions) and n is the sample size, or the number of responses, when n = o(N ), most frequencies tend to zero and those nonzero frequencies will mostly be 1. On the other hand, in the situation of N = o(n), but n large, most frequencies are nonzero and will tend to infinity eventually. It is more interesting to investigate the situation where N and n are comparable, particularly, n = λN for some constant 0 < λ < ∞, while N → ∞. In this paper, we focus solely on the last case.

H. Wu, G. Kvizhinadze / Statistics and Probability Letters 80 (2010) 1103–1110

1105

2.2. Introduction of the likelihood ratio Mq Let Pq denote the probability measure on Ξq which is defined by probabilities p(E xq ) :



Eq = Exq Pq X



= p(Exq )

and let P0,q denote the uniform measure on Ξq :



Eq = Exq P0,q X



= p0 (Exq ) =

1 N

.

Then,

Eµn (m) =

X n Exq ∈Ξq

= N EP0,q

m

n−m

xq ) p(E xq )m 1 − p(E

  n m

n−m 



Eq )m 1 − p(XEq ) p(X

and

h

n i



Eµn = N EP0,q 1 − 1 − p(XEq )

.

Define Mq as the likelihood ratio of the measure Pq to the uniform measure P0,q , i.e. Mq (E xq ) =

dPq dP0,q

(Exq ) =

p(E xq ) p0 (E xq )

= Np(Exq ).

(2)

Then we have

Eµn (m) = N

n 1 m

nm

" EP0,q (λMq (XEq ))

m

1−

λMq (XEq )

!n−m # (3)

n

and

" E µn = N

1 − EP0,q

1−

λMq (XEq )

!n #! .

n

(4)

At first sight, it looks artificial that we introduce such a likelihood ratio Mq . However, as was suggested by Khmaladze

Eq is Pq , using Mq we in Khmaladze (2009) the benefit of this introduction is significant: although the ‘‘physical’’ measure of X Eq has uniform distribution P0,q , and as a likelihood ratio and a martingale in q, can exploit its asymptotic properties as if X

Eq ) possesses some good and well-known asymptotic properties, which is very convenient. Mq (X Further, according to Lemma 1, expressions in the right hand side of (3) and (4) can be replaced by Poissonian limits. This suggests that we can lay aside the role of sample size n in the asymptotic behavior of the ratios, and focus solely on the Eq ), or equivalently, Np(XEq ) under the measure induced by P0,q (we will use the same limit behavior of distribution of Mq (X notation later on). Eq ) defined by (2), Lemma 1. For Mq (X "

λMq (XEq ) (λMq (XEq ))m 1 −

EP0,q

n

!n−m #

  h i 1 E = EP0,q (λMq (XEq ))m e−λMq (Xq ) + O . n

Proof. Since

   m 1 x n−m m −x sup x 1 − −x e =O n n 06x6n Eq ) 6 n, and 0 6 λMq (X " ! n −m # h i Eq ) λ Mq (X Eq ) m −λ M ( X m q − EP0,q (λMq (XEq )) e EP0,q (λMq (XEq )) 1 − n   Z ∞   n − m m 1 x 1 − x 6 − xm e−x dFλMq (XEq ) (x) 6 O .  −∞

n

n

1106

H. Wu, G. Kvizhinadze / Statistics and Probability Letters 80 (2010) 1103–1110

2.3. The structure of p(E xq )

Eq = Exq }, and we can define By definition, p(E xq ) is the probability of {X ai (j) = Pq (Xi = j) to be the probability of answering ‘‘j’’ to the i-th question. In the cases where X1 , . . . , Xq are independent, p(E xq ) =

q Y

ai (xi )

i =1

and Mq (E xq ) = Np(E xq ) =

q Y

ki ai (xi ).

i =1

If we consider

ξi = ln(ki ai (Xi )) then we can define

Eq ) = ln Mq (XEq ) = Lq (X

q X

ln(ki ai (Xi )) =

i=1

q X

ξi .

i=1

Since Lq can be expressed as a sum of q random variables, it is more convenient to discuss the limit distribution of Lq in these circumstances. Let us call a questionnaire ‘‘neutral’’ if the distribution of each Xi is uniform on its possible values. In this case ai (xi ) = k1 i and there is no need to study Mq , as it is simply 1, and the limits of the ratios (1) are therefore

Eµn (1) n

→ e−λ

and

Eµn (m) Eµn

=

N

n m



1 nm

λm (1 − λn )n−m

N (1 − (1 −

λ n ) ) n



λm e−λ . m!(1 − e−λ )

Note that in this case Eµn (1) is of order n, and hence the frequencies defined here form a sequence of large number of rare events in the sense of both d1 and d2. In practice, the questionnaires are rarely absolutely neutral but can be not too ‘‘far’’ from neutral. In other words, they are ‘‘nearly neutral’’. In this case, we assume that the sequence of measures Pq , which is defined by p(E xq ), is contiguous to the sequence of measures P0,q . In more general situations, where {ai (j)} were assumed to have an arbitrary distribution, the asymptotic behaviors of the ratios in (1) are more complicated. We will show that, under certain conditions, the limit theorems can still be established in both cases. 3. Limit theorems 3.1. The limit theorem for the contiguous neighborhood of neutral questionnaires As mentioned above, one reason for introducing the likelihood ratio Mq is its possession of good asymptotic properties. The asymptotic normality of the log-likelihood ratio (see, e.g., Oosterhoff and Zwet, 1979; Greenwood and Shiryayev, 1992) shows that if {Pq } is contiguous to {P0,q }, denoted by {Pq } C {P0,q }, the distribution of Lq converges to the normal distribution N (− 12 σ 2 , σ 2 ), i.e. the distribution of Mq converges to a log-normal distribution. The limit theorem under this condition can therefore be formulated. Define the Hellinger distance (see, e.g., Oosterhoff and Zwet, 1979) between Pqi and P0,qi as follows: H (Pqi , P0,qi ) =

Z  2−2

dPqi dP0,qi

! 12

 21 dP0,qi

  12 Z p = 2−2 ki ai (xi )dP0,qi .

H. Wu, G. Kvizhinadze / Statistics and Probability Letters 80 (2010) 1103–1110

1107

Theorem 1. If lim

q→∞

q X

H (Pqi , P0,qi )2 =

i =1

1

σ2 < ∞

4

(5)

and for every  > 0, lim

q→∞

q Z X i =1

p |ki ai (xi )−1|>

2

ki ai (xi ) − 1

dP0,qi = 0,

(6)

then

Eµn (m) n

1



h

λm!

E λm emL e−λe

L

i

(7)

and

Eµn (m) Eµn with L ∼ N



h

E λm emL e−λe





L

m! 1 − E e−λe

i

L

 ,

(8)

 2 − σ2 , σ 2 .

Proof. Conditions (5) and (6) imply {Pq } C {P0,q }, and guarantee the asymptotic normality of Lq (Oosterhoff and Zwet, 1979, Theorem 2),

  σ2 2 ,σ Lq = ln Mq −−−→ N − d(P0,q )

2

and combining this with Lemma 1 yields (7) and (8).



In this case, both ratios are strictly greater than 0, and Eµn → ∞. Hence the conditions of both definitions of large numbers of rare events are satisfied. Example. Suppose we have e

Pqi (j) = ai (j) =

1 + √ijq ki

,

where {eij } satisfies −1 6 eij 6 1 and ki q 1X 1 X

lim

q→∞

q i=1 ki j=1

with constraint

Pki

j=1

e2ij → σ 2 < ∞

eij = 0. Then the square of the Hellinger distance between Pqi and P0,qi becomes

H (Pqi , P0,qi ) = 2 − 2 2

Z p

ki ai (xi )dP0,qi = 2 − 2

ki r 1 X

ki j=1

eij 1+ √ . q

Using Taylor’s expansion we get

r

e2ij e3ij eij eij 1+ √ =1+ √ − + √ ··· , q 2 q 8q 16q q

and hence q X i =1

H (Pqi , P0,qi )2 =

1 4

  1 1 σ2 + O √ → σ 2. q

4

When q > 12 , we have |ki ai (xi ) − 1| <  for all i, and it is easy to see that (6) is satisfied. These implies the asymptotic normality of Lq .

Eq are independent. However, this is not Remark 1. In our treatment in this section, we assumed that the components of X Eq are dependent, we can simply replace ki a(xi ) by conditional a necessary condition. In the case where the components of X probabilities ki a(xi |E xi−1 ) to achieve the same result (see Greenwood and Shiryayev, 1992).

1108

H. Wu, G. Kvizhinadze / Statistics and Probability Letters 80 (2010) 1103–1110

3.2. The limit theorem for the general case In general, if {ai (j)} is an arbitrary sequence of distributions, then unlike for the contiguity case in the previous section Eq ) converges in distribution to a normal random variable, the expectation of Lq (XEq ) usually tends to −∞ while where Lq (X the variance goes to ∞. In this situation we can use techniques similar to those which typically are used in the theory of large L (X ) deviations (see e.g. Feller, 1971; Kallenberg, 2001). After applying Esscher’s transform (see, e.g. Feller, 1971), Yq = q√qq will

E

converge under the adjoint measure, and the distribution of Yq can be approximated by an Edgeworth series (see, e.g., Feller, 1971; Kolassa, 2006). Under necessary conditions, we shall see that, in this case, the limit theorem can be established and the result agrees with the Karlin–Rouault law (see, e.g., Khmaladze and Chitashvili, 1989; Baayen, 2001). For any fixed sequence {ai (j)} the cumulant generating function of ξi under P0,qi is

ψi (u) = ln EP0,qi e

uξi

! ki X u [ki ai (j)] − ln(ki ). = ln j =1

Eq ) is Then the cumulant generating function of Lq (X q X

ln EP0,q euLq (Xq ) = E

ψi (u).

i =1

Eq ) adjoint to P0,q can be defined as follows: By Esscher’s transform, the distribution Qu,q of Lq (X dQ u,q,Lq (XEq ) dP0,q,Lq (XEq )

(z ) = e

uz −

q P

ψi (u)

i=1

.

L (X ) Consequently, the logarithm of the moment generating function of Yq = q√qq under Qu,q is

E

ln EQu,q erYq =

q X

  X q r ψi u + √ − ψi (u). q

i=1

i=1

Eq ) = We can choose u = uq such that EQuq ,q Lq (X σq2 =

q 1X

q i=1

Pq

i =1

ψi0 (uq ) = 0. The variance of Yq under Quq ,q is

ψi00 (uq ),

L (X ) and therefore Yq = q√qq under Quq ,q becomes a random variable with mean 0 and variance σq2 .

E

Theorem 2. Assume uq is the solution of c<

q 1X

q i=1

Pq

i=1

ψi0 (u) = 0. If {ai (j)} is such that

ψi00 (uq ) < C ,

(9)

and if there exists δ > 0 such that

q P   [ψi (uq +r )−ψi (uq )] ei=1 = o √1 q

uniformly in r > δ > 0,

(10)

then

Eµn (m) Eµn



u∗ Γ ( m − u∗ )

Γ (m + 1)Γ (1 − u∗ )

,

(11)

where u∗ = limq→∞ uq . Proof. Applying Esscher’s transform we get

h

Eq )) m (−λMq (X

EP0,q (λMq (XEq )) e

i

q P

= ei=1

ψi (uq )

Z



−∞

λm e(m−uq )x e−λe dQuq ,q,Lq (XEq ) (x), x

(12)

H. Wu, G. Kvizhinadze / Statistics and Probability Letters 80 (2010) 1103–1110

1109

Eq ) by Yq : and then we replace Lq (X Z



λm e(m−uq )x e−λe dQuq ,q,Lq (XEq ) (x) = x

−∞



Z



λm e(m−uq )

√ qy −λe qy

e

dQuq ,q,Yq (y).

(13)

−∞

In Lemma 2, we will prove that under conditions (9) and (10),

Z





λm e(m−uq )

√ qy −λe qy

dQuq ,q,Yq (y) =

e



Z



λm e(m−uq )

√ qy −λe qy

e



1



q

−∞

−∞

dΦ0,σ 2 (y) + o

q



,

(14)

where Φ0,σ 2 is the normal distribution function with mean 0 and variance σq2 . Then by Lemma 3, q

Z





λm e(m−uq )

λuq

√ qy −λe qy

dΦ0,σ 2 (y) ∼ √ φ0,σ 2 (0)Γ (m − uq ) = O q q q

e

−∞

Combining (3), Lemma 1, (12)–(15), and noting that q P

Eµn (m) ∼ Ne

i=1

ψi (uq ) λuq

√ φ0,σq2 (0) q

1 n

=o





√1

q



1





q

.

(15)

, we conclude that for any m > 1,

Γ (m − uq ) . m!

Combining the last equation with ∞ X Γ (m − uq )

m!

m=1

=

Γ ( 1 − uq ) uq

and considering Eµn (m) > 0, we get q P

Eµn ∼ Ne

i=1

ψi (uq ) λuq

√ φ0,σq2 (0)

Γ ( 1 − uq )

q

and consequently (11).

uq



Lemma 2. If conditions (9) and (10) are satisfied, then (14) holds. √

Proof. Define g (y, q) = λm e(m−uq ) when uq < m, ∞

Z

g (y, q)dQuq ,q,Yq (y) − −∞

√ qy −λe qy

e

∂ g (y,q) and g 0 (y, q) = ∂ y ; then limy→∞ g (y, q) = 0 and limy→−∞ g (y, q) = 0



Z

−∞

g (y, q)dΦ0,σ 2 (y) = q

Z





−∞



Quq ,q,Yq (y) − Φ0,σ 2 (y) g 0 (y, q)dy. q

Under the conditions (9) and (10), the Edgeworth expansion (see Feller, 1971) shows q P

Quq ,q,Yq (y) = Φ0,σ 2 (y) − q

ψi(3) uq



i=1

H2 (σq y)φ(σq y) + o

3

6σq3 q 2



1





q

.

R∞

Here H2 (y) = y2 − 1 is the second Hermite polynomial. Since −∞ |g 0 (y, q)|dy is bounded, q P

Z







Quq ,q,Yq (y) − Φ0,σ 2 (y) g 0 (y, q)dy = −

Z

ψi(3) uq

i =1 3

q

−∞

∞ −∞

6σq3 q 2

 H2 (σq y)φ(σq y)g 0 (y, q)dy + o



1





q

q P

 ψi(3) uq Z ∞   1 i =1 =− H3 (σq y)φ(σq y)g (y, q)dy + o √ . √ 6σq2 q q −∞ 1 q

Here H3 (y) = y3 − 3y is the third Hermite polynomial. Now, since

Z



H3 (σq y)φ(σq y)g (y, q)dy = −∞

and limq→∞

Z



((σq y)3 − 3σq y)φ(σq y)λm e(m−uq )

√ qy −λe qy



e

−∞ 1 q

Pq

i =1

 ψi(3) uq < ∞, the right side of (16) is o( √1q ) and (14) holds.



dy → 0

(16)

1110

H. Wu, G. Kvizhinadze / Statistics and Probability Letters 80 (2010) 1103–1110

Lemma 3. Suppose uq is a solution of

Z





λm e(m−uq )

√ qy −λe qy

e

−∞

Pq

i =1

ψi0 (u) = 0; then λuq

dΦ0,σ 2 (y) ∼ √ φ0,σ 2 (0)Γ (m − uq ). q q q

Proof. Since for any β > 0 and m > 1 > uq , we have for large values of q

Z

−1 4

−β q

√ √ m (m−uq ) qy −λe qy

λ e

e

dΦ0,σ 2 (y) 6 λ e q

−∞

1

1

m −(m−uq )β q 4

e

−λe−β q 4

Z

−β q

−1 4

dΦ0,σ 2 (y) < o q

−∞



1





q

and

Z

∞ βq

√ √ m (m−uq ) qy −λe qy

−1 4

λ e

e

1

1

m (m−uq )β q 4

dΦ0,σ 2 (y) 6 λ e q

−λeβ q 4



Z

e

−1 4

βq

dΦ0,σ 2 (y) < o q



1





q

while

Z

−1 4

βq

−β q

−1 4

1

√ √ m (m−uq ) qy −λe qy

λ e

e

dΦ0,σ 2 (y) = λ

uq



uq

Z

βq 4 1

q

−β q 4

z

(λez )m−uq e−λe dΦ0,qσq2 (z )

1

Z

βq 4 1

z m−uq −λez

(λe )

e

−β q 4

λuq ∼ √ φ0,σq2 (0) q

Z



1



e √ σq 2π q

z2 2qσq2

dz

z

(λez )m−uq e−λe dz −∞

λuq = √ φ0,σq2 (0)Γ (m − uq ). q

Hence, we conclude that

Z





λm e(m−uq )

−∞

√ qy −λe qy

e

λuq

dΦ0,σ 2 (y) ∼ √ φ0,σ 2 (0)Γ (m − uq ). q q q



References Baayen, R.H., 2001. Word Frequency Distributions. Kluwer Academic, Dordrecht, Boston. Feller, W., 1971. An Introduction to Probability Theory and its Applications, vol. 2, second ed. Wiley, New York. Greenwood, P.E., Shiryayev, A.N., 1992. Contiguity and the Statistical Invariance Principle, second printing 1992 ed. Gordon and Breach, New York. Kallenberg, O., 2001. Foundations of modern probability. In: Probability and Its Applications, second ed. Springer, New York. Khmaladze, E.V., 1988. The statistical analysis of a large number of rare events. CWI Report MS-R8804. Khmaladze, E.V., 2009. Diversity of responses in questionnaires and similar objects, preprint. Khmaladze, E.V., Chitashvili, R.J., 1989. The statistical analysis of a large number of rare events and the related problem. In: Proc. Tbilisi Mathematical Institute 92, pp. 196–245. Kolassa, J.E., 2006. Series Approximation Methods in Statistics, third ed. Springer, New York. Oosterhoff, J., Van Zwet, W.R., 1979. A note on contiguity and Hellinger distance. Contributions to Statistics 157–166.