Efficiency of split questionnaire surveys

Efficiency of split questionnaire surveys

Journal of Statistical Planning and Inference 141 (2011) 1925–1932 Contents lists available at ScienceDirect Journal of Statistical Planning and Inf...

172KB Sizes 0 Downloads 43 Views

Journal of Statistical Planning and Inference 141 (2011) 1925–1932

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Efficiency of split questionnaire surveys James O. Chipperfield a,, David G. Steel b a b

Methodology Division, Australian Bureau of Statistics, Australia Centre for Statistical and Survey Methodology, University of Wollongong, Australia

a r t i c l e in f o

abstract

Article history: Received 1 April 2010 Received in revised form 23 November 2010 Accepted 3 December 2010 Available online 10 December 2010

We consider a general design that allows information for different patterns, or sets, of data items to be collected from different sample units, which we call a Split Questionnaire Design (SQD). While SQDs have been historically used to accommodate constraints on respondent burden, this paper shows they can also be an efficient design option. The efficiency of a design can be measured by the cost required to meet constraints on the accuracy of estimates. Moreover, this paper shows how an SQD provides considerable flexibility when exploring the balance between the design’s efficiency and the burden it places on respondents. The targets of interest to the design are analytic parameters, such as regression coefficients. Empirical results show that SQDs are worthwhile considering. & 2010 Elsevier B.V. All rights reserved.

Keywords: Split questionnaires Sample design Regression coefficients Multinomial

1. Introduction With few exceptions, survey designs involve collecting information on all K data items from a sample of n units; the information collected from the ith unit is denoted by yi ¼ ðy1i ,y2i , . . . ,yKi Þu, where i= 1,2,y,n. This design, called a single phase design (SPD), leads to simplicity in the survey design and analysis and the requirement that only one questionnaire or collection instrument is developed, pilot tested and, perhaps, printed. We will call a sample design that allows for different patterns, or sets, of information on data items to be collected from different sample units a Split Questionnaire Design (SQD). In a survey that collects information on K data items, an SQD potentially allows the use of J =2K  1 different combinations in which information on the K different data items can be collected. We consider an SQD that potentially selects J non-overlapping simple random samples, where the sample size for P the jth pattern is n(j), the allocation for the SQD is specified by n ¼ ðnð1Þ ,nð2Þ , . . . ,nðjÞ , . . . ,nðJÞ Þu, the total sample size is n ¼ j nðjÞ and j = 1,y,J. The data patterns for the case K= 3 are illustrated in Table 1. For example j =3 denotes the data pattern where only y1 and y2 are collected. A special case of an SQD, referred here to as a restricted SQD (also referred to as a multi-phase design, see Cochran, 1977, ¨ p. 327; Sarndal et al., 1992, p. 343), is when the allowable set of data patterns are restricted to follow a monotone pattern: when information on yk is collected, information on yk  1,yk  2,y,y1 is always collected (e.g. the set of patterns j= 1, 3 and 7 in Table 1 follow a monotonic pattern). SQDs have been historically been used when an SPD is impractical or would result in concerns about the quality of responses, due to respondent fatigue (see for example Shoemaker, 1973; Munger and Lloyd, 1988). For example, literacy at the school level could be measured by asking each student to spell 20 randomly selected words from a list of 500 words. Asking each student to spell 500 words in order to measure a school’s literacy would lead to respondent fatigue. Also, for a given level of respondent burden, an SQD allows more data items to be collected than an SPD. For example, for its Census the  Corresponding author.

E-mail address: james.chipperfi[email protected] (J.O. Chipperfield). 0378-3758/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2010.12.003

1926

J.O. Chipperfield, D.G. Steel / Journal of Statistical Planning and Inference 141 (2011) 1925–1932

Table 1 SQD data patterns. Data item pattern (j) 1 2 3 4 5 6 7

y1

y2

y3

(1)

X X

X X X

X X

Sample size

X

X X X X

n n(2) n(3) n(4) n(5) n(6) n(7)

Cost c(1) c(2) c(3) c(4) c(5) c(6) c(7)

Australian Bureau of Statistics is considering the use of an SQD with three questionnaire modules, where only two modules are asked of any respondent. In contrast, the sample design literature on has focused on optimality of an SPD or of restricted SQDs, where optimality is measured by the survey’s cost given that constraints on the accuracy of specific estimates are met. Further, relatively little work in the sample design literature considers designing samples for analytic purposes. Skinner et al. (1989) gives the following insights for why this may be the case, analysts yare somewhat removed from the survey design process. ymuch analysis of survey data consists of secondary analysis of survey data, which has been collected for descriptive purposes [e.g. population totals]. While SQDs provide flexibility to manage respondent burden, it is worth mentioning that an SQD has two obvious qualities that suggest it would be an efficient design option. Firstly, they allow information on data items with relatively high enumeration cost to be collected from fewer units than data items with relatively low cost. Secondly, the correlation between data items can be exploited to minimise the information loss due to not collecting all data items from all units in the sample. This paper compares the efficiency of an SQD (restricted and unrestricted) and an SPD when the targets of inference are analytic parameters. The parameters considered are those for the mean, linear regression and the multinomial distribution. It also illustrates how an SQD provides considerable flexibility when exploring the balance between the design’s efficiency and the burden it places on respondents. Methods are available for estimating such parameters and their variances given data collected by an SQD (i.e. given n). These methods are well-known within the field of analysis of missing data (see Rubin and Little, 2002). However, limited work has been done to exploit benefits from specifying n at the sample design stage (for limited exceptions see Shoemaker, 1973; Raghunathan and Grizzle, 1995; Chipperfield and Steel, 2009), which is the focus of this paper. Chipperfield and Steel (2009) consider the efficiency of SQDs but only when estimating univariate population totals (e.g. total number of employed people) within a design based framework, where the uncertainty is due to the sampling process alone. This paper considers the efficiency of SQDs when estimating parameters that are multivariate in nature (i.e. the multivariate normal and multinomial distributions, as well as regression parameters in linear regression). This methodology of this paper is developed within an analytic framework, where uncertainty arises from the model in question. Section 2 defines the distribution of the data that is to be collected and measures the cost and accuracy of estimates for an SQD. Section 3 measures the efficiency of an SQD (restricted and unrestricted) compared with an SPD when K =6. Section 4 illustrates how an SQD’s efficiency and the burden it places on respondents can be jointly considered. Section 5 addresses some practical issues. 2. Costs and variances for an SQD When designing a survey for analytic purposes it is necessary to make assumptions about the distribution of the data that is to be collected by the survey. Section 2.1 describes the data. Most sample designs are a result of some balance between the survey’s cost and the accuracy of the resulting estimates. Section 2.2 describes the cost of an SQD and Section 2.3 gives explicit expressions for the accuracy of an SQD’s estimate for a given allocation, n. 2.1. The data We now describe the distributional assumptions made in this paper. When all data items are continuous we assume the vector yi is Nðl, RÞ, where R has elements s2kku and m is a K column vector with elements mk , and yi and yiu are independent for iaiu. We refer to this model-based parameterisation of the data by (M1). Consider the standard linear relationship between the y1i and a vector of explanatory data items y~ i ¼ ðy2i ,y3i , . . . ,yKi Þu given by y1i ¼ b10y~ þ buðy~ i l~ Þ þ ei

ð1Þ

where l~ ¼ ðm2 , m3 , . . . , mK Þu, ei are independent with mean zero and variance s211y~ and b ¼ ðb2 , b3 , . . . , bK Þu is the parameter of interest. We refer to this parameterisation of the data by (M2).

J.O. Chipperfield, D.G. Steel / Journal of Statistical Planning and Inference 141 (2011) 1925–1932

1927

When all data items are categorical we consider the K vector yi of categorical data items where yki has lk levels, such that yi for i= 1,y,n defines a K-way contingency table with Q ¼ Pk lk cells. Again we assume that yi and yiu are independent for iaiu. We define Wi ¼ ðWi1 , . . . ,Wiq , . . . ,WiQ Þu to be a Q  1 vector where Wiq = 1 if unit i belongs to the qth cell of a contingency table and Wiq =0 otherwise, where q =1,y,Q. The distribution of the cell counts in the contingency table is assumed to be multinomial with parameter p ¼ ðp1 , . . . , pq , . . . , pQ Þu. We refer to this parameterisation of the data by (M3).

2.2. Cost of an SQD The cost of a survey can be defined in terms of payments incurred by the statistical organisation. We define the cost per sample unit, c0, to be the fixed unit cost that is independent of the cost of collecting the information about y. This incorporates the costs incurred before any information is collected from the sample unit. The marginal cost of collecting the pattern j data items from a unit is denoted by c(j) and the cost of collecting data item k from unit i is denoted by ck. Define g(j) to be the P number of data items collected by pattern j. We assume that cðjÞ ¼ k2uðjÞ ck where u(j) is the set of data items collected by pattern j. This means that the total cost of collecting all the data items is the sum of the cost of collecting each individual data items and is a reasonable assumption. This assumption can be relaxed if necessary. The total cost of the survey is X C ¼ c0 n þ cðjÞ nðjÞ j

Cost can also be defined in terms of the reporting load on the responding unit, measured in terms of interview time. The coefficients c0 and c(1) would then represent set-up time (e.g. time to make contact with the respondent and to explain the purpose of the survey) and the time required to collect only y1 from each unit, respectively.

2.3. Variance of estimates from an Split Questionnaire Design Let h be the vector of parameters that are of interest (e.g. when designing for means h ¼ l). Consider a complete data set, dc, where all K data items are collected from each of the n sample units (i.e. an SPD). In general maximum likelihood (ML) estimation of a vector of parameters h involves solving the score equation Scðy; dc Þ ¼ ð@=@hÞlðh; dc Þ ¼ 0, where lðh; dc Þ ¼ logLðh; dc Þ, and L is the likelihood of h based on dc. The expected information for the ML estimate of h, denoted by h^ , 1 is Infoðh; dc Þ ¼ ð@=@hÞScðy; dc Þ. In large samples, the estimate of h has variance Varðh^ Þ ¼ Info ðh; dc Þ (see Rubin and Little, 1987, p. 85). Now consider a set of data, d0, which arises from not collecting all data items from each of the n units. Here d0 represents the data collected by an SQD. Breckling et al. (1994) showed that the expected information matrix for the parameter y given d0 is Infoðh; d0 Þ ¼ Infoðh; dc ÞEd0 fVar dc jd0 ½Scðh; dc Þjd0 g,

ð2Þ

where Var dc jd0 ½Scðh; dc Þjd0  is the variance of the score function, over the distribution of the complete data conditional on the data collected by the SQD, and d0 denotes the distribution on the data collected by the SQD. It is a function of d0 which has a distribution determined by the selection of the SQD sample (see Section 1) and the parameterisation of the data, given in Section 2.1 (see also Rubin and Little, 2002). The expectation in (2) is then taken with respect to this distribution. The expected information must be used here rather than the observed information because at the design stage no survey data has been collected. The first term in (2) is the expected information matrix based on dc. The second term in (2) gives the expected reduction in the information due to not collecting all K data items from all units in the sample; an SQD must pay particular attention to this term to ensure that there is no undue loss of information. Next we derive expressions for the expected information for l, b and p for a given allocation, n. By varying n, we get an appreciation of the accuracy of the estimate arising from an SQD, which is important at the design stage.

2.3.1. Means Next we give an expression for Info0 ðl; d0 Þ, where d0 corresponds to the data collected by an SQD. From (M1) we use the fact that Infoc ðl; dc Þ ¼ nR1 and Scðl; dc Þ ¼ nR1 ðylÞ, where y ¼ n1 Si yi . It follows from (M1) and (2) that the ML estimator of l under an SQD has expected information matrix Info0 ðl; d0 Þ ¼ R1 ðnILmm R1 Þ

ð3Þ P

ðjÞ

ðjÞ

where I is the K  K identify matrix, Lmm ¼ j n R , and R has ðk,kuÞ th element skkuuðjÞ , skkuuðjÞ is the covariance between yk and yku conditional on the set of data items which are collected in the jth data pattern. To obtain the result corresponding to (3) for a restricted SQD we simply restrict the set of patterns to be monotonic. It is worthwhile noting that the variance of an estimate of the mean using data collected by an SQD is the same whether or not the multivariate normal assumption is made. As this paper is using the ML framework, a distribution for the data must be assumed. ðjÞ

1928

J.O. Chipperfield, D.G. Steel / Journal of Statistical Planning and Inference 141 (2011) 1925–1932

2.3.2. Regression coefficients Next we give an expression for Info0 ðb; d0 Þ, where again d0 corresponds to the data collected by an SQD. Define sb to be the P subset of the J data patterns where y1 and at least one explanatory data item in the model is collected and nE ¼ j2sb nðjÞ . Patterns of data where j= 2sb do not contain any information about the regression coefficients and so should not be considered for an SQD. The score function for b defined by (M2) is X ðy~ i l~ Þðy1i b10y~ buðy~ i l~ ÞÞ Scðb; dc Þ ¼ s2 11y~ i2s

It follows from (2) and assuming Edc jd0 is defined by (M1) that Info0 ðb; d0 Þ ¼ s2 11y~ ½nE Ry~ y~ L bb 

ð4Þ

where Lbb is a ðK1Þ  ðK1Þ matrix with ðl1,lu1Þth element Lbb ðl1,lu1Þ, where Lbb ðl1,lu1Þ ¼ s2 11y~ LðjÞ bb ðl1,lu1Þ, l, lu ¼ 2, . . . ,K, ðjÞ 2 LðjÞ bb ðl1,lu1ÞÞ ¼ sll buRy~ y~ b

ðjÞ if yl ,ylu 2 uðjÞ ¼ sl1 buV1lu þ buVðjÞ b 2llu

ðjÞ ðjÞ ðjÞ ¼ s2ll s211y~ þ buV3llu bbuV4llu buV4lul

P

j2sb n

ðjÞ

if yl 2 uðjÞ ,ylu2 = uðjÞ

if yl ,ylu2 = uðjÞ

where, r, s= 2,y,K and ðjÞ V1lu ðr1Þ ¼ srluuðjÞ ðjÞ V2llu ðr1,s1Þ ¼ srsuðjÞ bðjÞu RuðjÞ l þ srluuðjÞ bsðjÞu RuðjÞ l lu

if yr ,ys2 = uðjÞ ¼ slr ssluuðjÞ

ðjÞ ðjÞ ðr1,s1Þ ¼ slluuðjÞ srsuðjÞ þ srluuðjÞ slsuðjÞ þ 4 trðbðjÞ RuðjÞ uðjÞ bðjÞu ArlðjÞ RðjÞ Aslu Þ V3llu ðjÞu lsuðjÞ blu RuðjÞ r þ

¼s

ðjÞ V4llu ðr1Þ ¼ sr1 slluuðjÞ

ðjÞu lluuðjÞ bs RuðjÞ r

s

ðjÞ

if yr 2 uðjÞ , ys2 = uðjÞ ¼ 0

if yr ,ys2 = uðjÞ

ðjÞ

if yr 2 u , ys2 = u ¼ srs slluuðjÞ

if yr 2 uðjÞ ¼ brðjÞu suðjÞ 1 skkuðjÞ þ bluðjÞu suðjÞ 1 skruðjÞ

otherwise

if yr ,ys 2 uðjÞ

if yr2 = uðjÞ

(j) where blðjÞ ¼ R1 collected data items of pattern j, RðjÞ has uðjÞ uðjÞ RuðjÞ l , and RuðjÞ uðjÞ is the same as R but is restricted to the g y~ y~ ðjÞ (r  1,s 1)th element srsuðjÞ , RruðjÞ ¼ RuuðjÞ r is a g(j) row vector with s 1th element srs where ys 2 uðjÞ , A(j)  g ðjÞ matrix of rs is a g zeros except for 1/2 in the (r 1,s  1)th and (s  1,r  1)th elements if ras and for a 1 in the (r  1,s  1)th element if r = s, and bðjÞ is a g ðjÞ  g ðjÞ matrix with lth column blðjÞ , where l 2 uðjÞ . Other terms. The second term in (4) represents the loss of information due to not collecting all data items from all n units. Eq. (4) is novel: other expressions in the literature involve a substantial number of terms are only approximations (see Beale and Little, 1975) or only consider only a very restricted set of patterns (Little, 1992).

2.3.3. Multinomial distribution We now consider the parameter p~ ¼ ðp1 , . . . , pc , . . . , pQ1 Þu defined by (M3), where c= 1,2,y,Q 1. As the elements of p are P 1 ~ constrained to sum to 1, we note that pQ ¼ 1 Qc ¼ 1 pc . The log-likelihood and score equations for p (see Agresti, 1996) under an SPD and (M3) are given by (5) and (6): lðp~ ; dc Þ ¼

1 n Q X X

Wic lnpc þ

i¼1c¼1

X WiQ lnpQ

ð5Þ

i

and Scðp~ ; dc Þ ¼ ðScðp1 ; dc Þ, . . . ,Scðpc ; dc Þ, . . . ,ScðpQ 1 ; dc ÞÞu Scðpc ; dc Þ ¼

X X Wic p1 WiQ p1 c  Q i

ð6Þ

i

It is easy to show that the information on p~ based on dc is the ðQ 1Þ  ðQ 1Þ matrix Infoðp~ ; dc Þ which has (c,c)th element 1 1 nðp1 c þ pQ Þ and ðc,cuÞ th element npQ where cacu. Define y(j) to be the g(j) vector of data items that are collected from a respondent who is allocated to pattern j. Also define S(y(j)) to be the subset of the Q categories to which a sample unit could belong given y(j). To illustrate, let K= 2, y1 take the values 1 or 2 (l1 = 2), y2 take the values 1, 2 or 3 (l2 = 3), and j =1 correspond to the pattern where only y1 is collected. Therefore, when j = 1 , y(j) =y1, S(y1 = 1)= {(y1,y2): (1,1), (1,2), (1,3), (1,4)} and S(y1 = 2)= {(y1,y2): (2,1), (2,2), (2,3), (2,4)}.

J.O. Chipperfield, D.G. Steel / Journal of Statistical Planning and Inference 141 (2011) 1925–1932

1929

It follows from (M3) and (2) that the expected information for p~ for an SQD is Infoðp~ ; d0 Þ ¼ Infoðp~ ; dc ÞLp~ p~

ð7Þ

where Lp~ p~ has elements Lp~ p~ ðc,cuÞ ¼ ðjÞ

1 c þ

Eccu ðy Þ ¼ p

1 Q

p

1 ¼ p1 Q pqðjÞ

P

jn

ðjÞ

P

yðjÞ Eccu ðy

ðjÞ

ðjÞ

Þ,

1 1 c  qðjÞ

if c,Q 2 Sðy Þ ¼ p

p

if Q 2 SðyðjÞ Þ, c= 2SðyðjÞ Þ ¼ 0

P

yðjÞ

(j)

is the sum over all possible values for y , for c ¼ cu

if c 2 SðyðjÞ Þ, Q2 = SðyðjÞ Þ if c,Q2 = SðyðjÞ Þ

and for cacu Eccu ðyðjÞ Þ ¼ p1 Q ¼p

1 Q

if c,cu,Q 2 SðyðjÞ Þ ¼ p1 qðjÞ ðjÞ

if cu,c 2 SðyðjÞ Þ, Q2 = SðyðjÞ Þ ¼ p1 Q

ðjÞ

if cu,Q 2 Sðy Þ, c= 2Sðy Þ ¼ 0

if c,Q 2 SðyðjÞ Þ, cu= 2SðyðjÞ Þ

otherwise

The second term in (7) represents the loss of information due to not collecting all data items from all n units. P 1 Using the fact that pQ ¼ 1 Qc ¼ 1 pc , the information on pQ based on d0 can be obtained from 1

Info

1

ðpQ ; d0 Þ ¼ 1u Info

ðp~ ; d0 Þ1

where 1u is a Q 1 column vector of 1 s. 3. Optimal allocation for an SQD: example with means 3.1. Design problem The SQD design problem is to find the optimal allocation, n, that minimises cost, C, subject to some constraint on the variance of the estimates of l, b or p. Given an arbitrary value of n, the variances are obtained from their respective information matrices (3), (4) and (7). The optimal value for n can be found using standard optimisation techniques (e.g. Newton’s method). By way of example, we consider the optimal SQD allocation for means. Formally, the design objective for estimating l is to find n that minimises C subject to the constraint m

CV 2 ðm^ k Þ ovk

for all k

ðC1Þ

where CV m m m Vðm^ k Þ is the variance of the ML estimator of m^ k obtained from (3) and vmk is a pre-specified design constraint. For this design objective, the efficiency of an SQD relative to an SPD is a function of a set of scale free design parameters listed below: 2

2 ð ^ k Þ ¼ Vð ^ k Þ ^ k ,

1. The K  K correlation matrix q ¼ frkku g. 2. The proportion of the unit cost that is fixed under an SPD. This is given by c~ 0 ¼ c0 =ðc0 þ cðJÞ Þ, where j = J corresponds to the pattern where all K data items are collected. 3. The cost of collecting only yk relative to the cost of collecting the other K  1 data items. Without loss of generality this can be expressed as ck/K = ck/cK for all kaK. 4. The relative sample sizes required to meet the constraint on each of the means under an SPD. These are given by m m m m m nk=K ¼ nk =nK for all kaK where nk ¼ CV 2 ðm^ k Þ=vk is the sample size required to meet the constraint on Varðm^ k Þ under an SPD. At the design stage, some design parameters will not be known exactly. It is therefore important to consider the sensitivity of the efficiency of an SQD to these design parameters (for more on this see Section 5.2). 3.2. Evaluation We now consider the optimal allocation problem, described in Section 3.1, when the target of inference is l. We fix 0 1 1 B 0:01 1 C B C B C B 0:97 0:04 C 1 C q¼B B 0:45 C 0:67 0:55 1 B C B C @ 0:81 A 0:08 0:73 0:21 1 0:30

0:71

0:31

0:60

0:07

1

1930

J.O. Chipperfield, D.G. Steel / Journal of Statistical Planning and Inference 141 (2011) 1925–1932

Table 2 Means: percentage reduction in C relative to an SPD. Scenario 1a: am ¼ ð1,1Þ, gm ¼ ð1,1Þ, qB Design c~ 0 ¼ 0% Restricted SQD 15 Unrestricted SQD 33

c~ 0 ¼ 10% 9 28

Scenario 2a: am ¼ ð1,1Þ, c~ 0 ¼ 10%, qB , Design gm ¼ ð1:1,0:9Þ Restricted SQD 2 Unrestricted SQD 24

gm ¼ ð1:2,0:8Þ

gm ¼ ð1:4,0:6Þ

gm ¼ ð1:6,0:4Þ

2 24

5 24

9 26

Scenario 3b: gm ¼ ð1,1Þ, c~ 0 ¼ 10%, qB Design am ¼ ð0:9,1:1Þ Restricted SQD 17 Unrestricted SQD 29

am ¼ ð0:8,1:2Þ

am ¼ ð0:6,1:4Þ

am ¼ ð0:4,1:6Þ

30 33

42 45

51 51

a b

c~ 0 ¼ 20% 5 20

c~ 0 ¼ 30% 4 13

am ¼ ðam1 , am2 Þ, am1 ¼ nm1=6 ¼ nm2=6 ¼ nm3=6 , am2 ¼ nm4=6 ¼ nm5=6 . gm ¼ ðgm1 , gm2 Þ, gm1 ¼ c1=6 ¼ c2=6 ¼ c3=6 , gm2 ¼ c4=6 ¼ c5=6 .

and vary the design parameters 2, 3 and 4. Table 2 gives the gains (i.e. reduction in cost) of a restricted and unrestricted SQD relative to an SPD for a range of different values of the design parameters. Scenario 1 shows that when both the cost of collecting each data item is the same and the sample size required to meet the variance constraint on the means is the same then an unrestricted SQD is significantly more efficient than both a restricted SQD and an SPD. For example when the fixed unit cost is negligible (i.e. c~ 0 ¼ 0) an unrestricted SQD and a restricted SQD are 33% and 15% more efficient than an SPD, respectively. Even if the fixed unit cost per unit were to increase substantially (i.e. c~ 0 ¼ 30%), an unrestricted SQD is still noticeably more efficient than a restricted and unrestricted SPD. Scenario 2 shows that an unrestricted SQD is substantially more efficient than both an SPD and a restricted SQD when the cost of collecting the data items is allowed to vary. Scenario 3 shows that as the sample sizes required to meet the constraint on the means are allowed to vary the difference between a restricted and unrestricted SQD can be small (though both are much more efficient than an SPD). The results that an SQD can be much more efficient than an SPD and that an unrestricted SQD can be somewhat more efficient than a restricted SQD. Importantly, a restricted SQD has K! different possible monotonic patterns. The optimal allocation for a restricted SQD requires finding the optimal allocation for each of the K! monotonic patterns. This makes optimal allocation for a restricted SQD somewhat impractical. It should be noted that while J =63 when designing for means in this example, most optimal allocations were such that fewer than 5 data patterns were ever assigned a non-zero allocation by the optimisation algorithm.

4. Balancing respondent burden and efficiency In this section we consider how the balance between burden and efficiency can be explored through the concept of an effective sample size. Under a particular SQD allocation, n, define the variance of a single parameter of interest to the design, f, by vðf,nÞ. The effective sample size of an SQD’s estimate of f based on the allocation n is equal to the sample size for which an SPD also has variance vðf,nÞ. Below we use the information matrices for b and p, given by (4) and (7), respectively, to calculate the effective sample size for some data patterns. Consider a situation where a survey planner decides to collect all data items from 100 units. In the past the response rates for such a sample were uncomfortably low which, the survey planner believes, is because of the high burden the survey imposes on its respondents. To address this issue, the survey planner is interested in selecting a further 100 units but to only collect some, but not all, of the data items. Consider the case where K =3 and lk = 2 for k=1,2,3 which corresponds to a 2  2  2 contingency table. Here we use the data (from Little, 1988, Table 9.8, p. 187) where the significant interactions are y1y2 and y1y3. Table 3 shows that when y1 and y2 are collected from 100 additional sample units the increase in the effective sample size for estimates of p1 , p3 , p5 , and p7 are 96, 95, 80 and 85, respectively; however, the increase in the effective sample size for estimates of p2 , p4 and p6 are not noticeably increased. It is easy to see that if p1 , p3 , p5 , and p7 are key parameters of interest then it may be beneficial to collect only y1 and y2 from some sample units, especially if y1 is particularly burdensome. If, however, only p6 is of interest, clearly most of the data patterns would be very inefficient choices for an SQD as they make only very small contributions to the effective sample size. If p6 is burdensome and important to the design, an SQD may not be worthwhile and other ways of dealing with high non-response rates should be persued (e.g. introduction of monetary incentives for respondents). Table 4 considers a similar set up but for regression parameters assuming the correlation matrix q and K= 6.

J.O. Chipperfield, D.G. Steel / Journal of Statistical Planning and Inference 141 (2011) 1925–1932

1931

Table 3 Effective sample size for p~ of data patterns of size 100. Data items collected

y1,y2,y3 y1,y2 y1,y3 y2,y3 y1 y2 y3

Parameters

p1

p2

p3

p4

p5

p6

p7

100 96 10 17 9 15 60

100 0 40 50 0 0 8

100 95 23 78 21 73 1

100 3 80 73 2 1 47

100 80 74 21 60 18 0

100 4 5 26 2 0 3

100 85 4 2 3 2 0

Table 4 Effective sample size for b of data patterns of size 100. Data items not collected

No missing y6 y4 y3 y3,y4 y3,y4,y5,y6

Parameters

b12y~

b13y~

b14y~

b15y~

b16y~

100 64 83 91 75 41

100 96 85 16 13 6

100 95 16 86 13 6

100 85 91 89 83 19

100 4 89 90 83 1

5. Discussion 5.1. Is an SQD practical? We now consider the applicability of SQDs and argue that it can be an efficient and practical approach to sample design. One possible concern is that allowing all J data patterns to be considered in an SQD will make the sample design, questionnaire design and analysis too complicated. This concern can be addressed by two different approaches. First, one additional pattern can significantly improve the efficiency of a design. It has been well established that a TwoPhase Design (a restricted SQD with J= 2) can be significantly more efficient than an SPD (J=1) and, in the case of estimating population totals for K = 2, that an SQD (J= 3) can be significantly more efficient than a Two-Phase Design (see Chipperfield and Steel, 2009). Ranking the relative efficiency of the J patterns is discussed in Section 5.3. Second, practical issues can quickly restrict the set of data patterns under consideration. For example, if two data items require invasive measurements (e.g. require a blood sample and a costly examination) then a reasonable approach would be to only consider data patterns which collect one, both and none of these items for an SQD. To maximise response rates from respondents who are asked to report on both data items, monetary incentives, special interviewer procedures, and well trained interviewers could be used. For the remaining respondents a less costly approach could be used. Another issue is that for regression modelling analysts’ have a range of different models (and hence regression coefficients) in which they are interested, so designing an SQD for a specific model does not make sense. Consider the situation of designing an SQD collecting two relatively expensive data items which, for a range of models, would tend to be used as independent variables. The expense of collecting these items could be such that data patterns collecting both, one or none of these data items could be efficient for a range of models. Again, an approach similar to the one in Section 4 can be taken to make this assessment. Another concern is that the optimal SQD allocation problem requires a number of unknown design parameters (for means see Section 3). Any sample design (including SPDs) requires estimates of design parameters. These estimates are typically obtained from pilot samples or similar surveys. As there will naturally be some degree of error in these estimates, it is advisable to consider the sensitivity of the design’s optimum to such errors. This is discussed in Section 5.2. 5.2. Sensitivity of optimum to the design parameters At the design stage some design parameters are likely to be unknown, and so would need to be estimated. As in any sample design it is important to appreciate the uncertainty in the estimated design parameters through sensitivity analysis. Table 5 gives the effective sample size for estimates of regression parameters when K =3 with different correlation matrices, q ¼ ðr12 , r13 , r23 Þ, for a given allocation. This allows us to see how sensitive the effective sample size is to q. This is important because the correlations are not known at the design stage and so must be estimated, either through a pilot study

1932

J.O. Chipperfield, D.G. Steel / Journal of Statistical Planning and Inference 141 (2011) 1925–1932

Table 5 Effective sample size. Correlation q

Parameter

(0.58,0.6,0.28) (0.5,0.7,0.3) (0.54,0.65,0.3) (0.58,0.6,0.28) (0.5,0.7,0.3) (0.54,0.65,0.3)

b2 b2 b2 b3 b3 b3

Allocation A = (n(3),n(6), n(7)) (0,0,100)

(0,50,100)

(50,0,100)

(50,50,100)

100 100 100 100 100 100

120 115 118 134 138 136

121 125 123 131 128 130

154 142 148 155 168 160

or using data from a similar survey. For example, Table 5 shows that for an allocation n(3) = 50, n(6) = 50 and, n(7) = 100 the effective sample sizes for b2 and b3 range between 142–154 and 155–168, respectively, for the different correlations considered. This highlights the importance of reasonably good estimates of q at the design stage. In some cases the effective sample size will be too sensitive to the uncertainty in the design parameters. In other cases the cost and respondent burden constraints will outway such concerns about sensitivity. A survey planner will need to make a judgement about this on a caseby-case basis. 5.3. Reducing the number of data patterns under consideration In general it is impractical for all J possible patterns to be considered at the design stage, unless J is small. Next we consider one way to rank the relative efficiency of each of the J patterns. If only J0 patterns are to be considered, where J0 5J, then only the J0 patterns with the highest rank can be considered. We suggest the following efficiency measure for pattern j, when designing for a vector of parameters h ¼ ðyk Þ, is X EðjÞ nyk Infou ðh; d0 Þðk,kÞðjÞ =cðjÞ h ¼ k

where Infou ðh; d0 Þðk,kÞðjÞ is the contribution of a single unit (emphasised by the subscript u) with data pattern j to the kth diagonal ðjÞ 2 4 y element of the expected information matrix of h. For example, when y ¼ b then Infou ðb; d0 Þðk,kÞðjÞ ¼ s2 11y~ skk s11y~ L bb ðk,kÞ. Here nk y is the sample size under an SPD that would be required to meet the constraints on estimates of the kth parameter in h. So nk aims to reflect the different accuracy constraints on the parameters and c(j) aims to reflect the different data patterns’ collection costs. Practical considerations, such as questionnaire sequencing, will also restrict the set of patterns under consideration. For example, the question do you smoke can be followed up with how many cigarettes do you smoke. Because it is not meaningful to ask the second question without also asking the first, the two questions should be in the same block, where all questions in a block are either all collected or none are collected at all.

Acknowledgements The authors would like to thank the Australian Bureau of Statistics for supporting this research and Professor Ray Chambers for some useful comments. References Agresti, J., 1996. Analysis of Categorical Data. John Wiley and Sons. Beale, E.M.L., Little, R.J.A., 1975. Missing values in multivariate analysis. Journal of the Royal Statistical Society 37, 129–145. Breckling, J.U., Chambers, R.L., Dorfman, A.H., Tam, S.M., Welsh, A.H., 1994. Maximum likelihood inference from sample survey data. International Statistical Review 62, 349–363. Chipperfield, J.O., Steel, D.G., 2009. Design and estimation for split questionnaire surveys. Journal of Official Statistics 25, 227–244. Cochran, W.C., 1977. Sampling Techniques. John Wiley and Sons. Little, R.J.A., 1998. Robust estimation of the mean and covariance matrix from data with missing values. Applied Statistics (37), 23–38. Little, R.J.A., 1992. Regression with missing x s: a review. Journal of the American Statistical Association 87, 1227–1237. Munger, G., Lloyd, B.H., 1988. The use of multiple matrix sampling for survey research. Journal of Experimental Education (56), 187–191. Raghunathan, T.E., Grizzle, J.E., 1995. A split questionnaire design. Journal of the American Statistical Association (90), 54–63. Rubin, D.B., Little, R.J.A., 1987. Statistical Analysis of Missing Data. John Wiley and Sons. Rubin, D.B., Little, R.J.A., 2002. Statistical Analysis of Missing Data, second ed. John Wiley and Sons. ¨ Sarndal, C., Swensson, B., Wretman, J., 1992. Model Assisted Sampling. Springer-Verlag. Shoemaker, D.M., 1973. Principles and Procedures of Multiple Matrix Sampling. Ballinger, USA. Skinner, C., Holt, D., Smith, T.M.F., 1989. Analysis of Complex Surveys. John Wiley and Sons.