Journal of Statistical Planning and Inference 7 (1983) 307-316 North-Holland
SCRAMBLED RANDOMIZED RESPONSE OBTAINING SENSITIVE QUANTITATIVE
307
METHODS DATA
FOR
Benjamin H. EICHHORN Rider College, Lawrenceville, NJ 08648, USA
Lakhbir S. HAYRE University of Pennsylvania, Philadephia, PA 19104, USA Received 30 July 1981 Recommended by M.L. Puri
Abstract: We study a multiplicative randomized response method for obtaining responses to sensitive questions when the answers are quantitative. The method involves the respondent multiplying his sensitive answer by a random number from a known distribution, and giving the product to the interviewer, who does not know the value of the random number and thus receives a scrambled response. Some particular distributions for the random scrambling number are proposed and studied, and ways of generating the scrambling numbers are discussed. Some modifications for increasing the efficiency of the method are proposed, and numerical results are given that show the scrambled response method is generally superior to the previously used method of randomizing questions. AMS 1970 Subject Classification: 62D05. Key words and phrases: Multiplicative randomized responses; Scrambled responses; Unrelated questions.
1. Introduction The randomized response method for estimating the proportion of people with a sensitive characteristic has been extensively studied since its introduction by Warner (1965). The object is to reduce the frequency of false answers by giving the respondent two questions, one of which is the sensitive one. The respondent then selects one of the questions using some specified randomization device. The interviewer does not know which question has been selected, and so does not know whether the answer he receives is for the sensitive or non-sensitive question. This hopefully will make the respondent less likely to give a false answer. A survey of the field may be found in the paper by Horvitz, et al. (1975). Greenberg, etal. (1971) extended the method to the case where the responses to the sensitive question are quantitative, rather than a simple ‘Yes’ or ‘No’. The 0378-3758/83/$03.00
0 1983 Elsevier Science Publishers B.V. (North-Holland)
308
B.H. Eichhorn,
L.S. Hayre / Randomized
response methods
respondent selects, by means of a randomization device, one of two questions; the sensitive one, and an unrelated question, the answers to which are of about the same order of magnitude as for the sensitive question. However, there are several difficulties which arise when using this unrelated question method. The main one is in choosing the unrelated question. As Greenberg et al. (1971) note, it is essential that the mean and variance of the responses to the unrelated question be close to those for the sensitive question; otherwise, it will often be possible to recognize from the response which question was selected. However, the mean and variance of the responses to the sensitive question are unknown, making it difficult to choose a good unrelated question. A second difficulty is that in some cases the answers to the unrelated question may be more rounded or regular, making it possible to recognize which question was answered. For example, Greenberg, et al. (1971) considered the sensitive question: “How much money did the head of this household earn last year?” This was paired with the question: “How much money do you think the average head of a household of your size earns in a year?” An answer such as $26 350 is more likely to be in response to the sensitive question, while an answer such as $18 618 is almost certainly in response to the sensitive question. A third difficulty is that some people will be hesitant in disclosing their answer to the sensitive question (even though they know that the interviewer cannot be sure that the sensitive question was selected). For example, some respondents may not want to reveal their income even though they know that the interviewer can only be $ certain, say, that the figure given is the respondent’s income. In this article we study an alternative method for obtaining quantitative responses which is not subject to these difficulties. This involves the respondent multiplying his answers to the sensitive question by a random number from a known distribution. The product is given to the interviewer, who does not know the value of the random number. This method, which we will denote as scrambled responses, is a special case of the general linear randomized response model introduced by Warner (1971), which incorporated all randomized response models as special cases. This particular special case was mentioned but not studied by Warner (1971), and it was briefly discussed by Pollock and Bek (1976). We describe the method in Section 2. In Section 3, we discuss the choice of the distribution of the random multiplying number (or scrambling variable), and consider ways of generating this variable. In Section 4 we propose two ways of modifying the procedure to improve its efficiency. Finally, in Section 5, numerical comparisons are made with the unrelated question method, and these suggest that the scrambled responses method is generally more efficient.
2. Scrambled responses Let X denote the answer to the sensitive question, and let S be a random variable independent of X and with finite mean and variance. We assume that XrO and
B.H. Eichhorn, L.S. Hayre / Randomized response methods
309
S>O. The situation where X can take negative as well as positive values is briefly discussed in Section 4. The respondent generates S using some specified method and multiplies his sensitive answer X by S. The interviewer receives the scrambled answer Y=XS. The particular values of S are unknown to the interviewer, but its distribution is known. Let S=E(S), y2 = V(S), P =E(X), and o2 = V(X), where 0 and y2 are known and p and a2 are unknown. Then E(Y)=E(XS)=E(X)E(S)=@ and V( Y) = E(P)E(S2) = y%(P)
- (EX)2(ES)2
+ cr2e2
=y2(~2+p2)+02e2. Hence, if we have observations p
(1) y,, . . . , yn, an unbiased estimator for p is
=jve
(2)
and VP) = -& V(Y) = + (a2 + y2E(X2)/e2) = i
from (1)
(a2+ e2E(P))
(3)
say, where Q’=(Y/~)~. A confidence interval for p can be obtained if we estimate WY) by
s2 =
& $*(Y;-Y)2.
In most applications n will be at least 30, so assuming p is approximately a lOO(1 - ~$70 confidence interval for p is approximately
normal,
(4) where .za12is the (1 - o/2)-th percentage point of the standard normal distribution. Before discussing the choice of S, we note that it is reasonable to expect fewer false or evasive responses using this method rather than the unrelated question method, as the respondent does not have to reveal his answer to the ‘sensitive question (since we will generally have P(S= 1) = 0). Also, only one sample is required, instead of the two generally necessary for the unrelated question method, and each observation taken gives information about the desired variable X.
310
B.H. Eichhorn, L.S. Hayre / Randomized response methods
3. Choice of the scrambling distribution 3.1. General considerations In choosing the distribution of S, we have a number of often conflicting objectives. To keep the frequency of evasive answers low, S should be able to assume a wide range of values with high probability. To estimate p as accurately as possible, we should from (3), make Q= y/B as small as possible. Furthermore, it should be possible for the respondent to generate values of S in a way which is reasonably simple and which assumes him that the interviewer cannot know the value generated. Finally, it seems reasonable to require that median (S) = 1. In Section 3.2, we consider a discrete distribution and a continuous distribution which satisfy these requirements. In Section 3.3, we discuss a measure of the protection offered to a respondent by a given scrambling distribution. In Section 3.4, some methods of generating S values are discussed. 3.2. Two particular distributions Suppose we take S to be the random variable with P(S=n)=P(S=n-‘)=2-“,
for n=2,3,....
(5)
This distribution can be generated as follows: the respondent tosses a fair coin until he has obtained both a head and a tail. The probability that n tosses will be required is 2-@-l) for nz2, and if we set S= n if the first (n - 1) tosses were heads and S = n-l if’the first (n - 1) were tails, then S has the distribution (5). An alternative method of generating S is to have a bag with equal numbers of, say, red balls and blue balls, with the respondent drawing balls with replacement until both colors have been drawn. Now, 8= 5 (n+n-‘)2-“=
1 +log2,
n=2
using binomial and log expansions y*= V(S), we note that
and straightforward
algebra.
To calculate
E(S*) = 2 (n* + n-*)2-“. n=2
Again using binomial expansions and algebra, C,“,, n*2-” = 1l/12, while simple calculation yields C,“=, n-*2-“~0.08 to two decimal places. Hence y*=&+&
5.58-(1.6931)*=2.7134
so that Q*= (v/B)* = 0.9465
(6)
B.H. Eichhom, L.S. Hayre / Randomized response methods
311
The distribution (5) is simple to generate, has median close to 1, and has a reasonable looking value of Q*. In 15 out of 16 cases, on the average, S will take one of the values f, f, +, +, 2, 3, 4, and 5, but this seems a wide enough range to reassure most respondents. However, if a larger number of likely values for S are desired, we can consider a continuous distribution. On candidate is an F distribution with (5,5) degrees of freedom. For this distribution, 8=5/3,
42 = 80/9,
.~*=3.2.
(7)
S has median equal to 1, and the probability that S is between + and 5 is about 0.90. 3.3. A measure of protection To compare scrambling distributions with respect to the protection they give to the respondent, we may consider the following protection measure. For a given scrambling distribution, let [a, b] be a lOO(1 - a)% confidence interval for S. Then if Y =XS is the observed response, [Y/b, Y/a] is a lOO(1 - a)Vo confidence interval for X. The ratio b/a together with the probability o provide a measure of the protection given to a respondent; for a given (Y,the larger the ratio, the greater the protection. If the scrambling distribution is discrete, then S cannot take every value in [a, b], and this should be taken into account when comparing discrete and continuous distributions. As an example, for the F,,, distribution, for cr=O.l;
b/a=25,
for a=0.2,
b/a=11.9,
for cr=O.5,
b/a= 3.6,
which seems adequate protection
in most cases.
3.4. Generating the values of S A major practical problem in using a continuous distribution such as F,,, is generating values of S in such a way that the respondent will believe it’s properties and feel protected. A calculator can be programmed to generate an indefinite sequence of random values from the F5,s or any other distribution. This process can be demonstrated to and tried out by the respondent until he feels secure with it. Then the calculator is locked by setting a counting device to 0, the respondent generates an S value, multiplies it by X, and gives the product XS to the interviewer. The locking is needed to assure that the respondent does not generate successive S values until he finds an S he likes, but will take the next randomly generated value after the demonstration is over. A simpler method is the following, which gives an approximation to any distribu-
312
B.H. Eichhorn, L.S. Hayre / Randomized response methods
tion G. We have an urn with, say, 999 marbles inscribed with the values Si=G-‘(i/1000), i=l,..., 999. The respondent randomly chooses a marble, notes the value Si on the marble, and multiplies his X answer by Si before giving it to the interviewer.
4. Methods for reducing
I-‘(,@
In this section we propose two methods of reducing the variance of ,ii. First, note that, from (3),
It is clear that if p =E(X) is large, then V(p) may be large even if Q’ is small. Suppose we replace X by X-A; that is, the respondent subtracts A from X, then multiplies by S, so that the interviewer receives the answer Y= S(X- A). The we estimate p by $ = A +J/L~ and now V(@)=TZ-‘(~~+Q~E[(X-A)~]).
(8)
A judicious choice of A can considerably reduce V(L); in fact, if we could choose V@) = n-‘(a2 +~‘a~), so that the variance has been reduced by e2p2/n, which may be a considerable saving if p2 is large. In practice, of course, we do not know p, but a good guess from prior information can still lead to considerable savings. In replacing X by X-A, we may face one difficulty. If S can take only positive values, as was the case in Section 3, then if Y> 0, we know that X> A, and if Y< 0, we know that X
O, the sign of Y tells us the sign of X.) This difficulty can be removed by allowing S to assume negative values. Suppose we use the scrambling variable A =,u, then
s*=Is, where S>O with median (S)= 1, and I= 1 with probability p and I= -1 with probability 1 -p, where p > 3. Using simple algebra, we obtain e*=E(s*)=(2pso
l)e,
y*2= V(s*)= y2+4p(l -p)@,
that @*2=y*2/8*2=(2p-l)-2[@2+4p(l
-p)].
(9)
Comparing (3) with (8), there is a reduction in V(p) if Q*~E[(X-A)~]
??)p(l
-p)/(2p-
I)?
(10)
B.H. Eichhorn,
For example if
p = 4,
@/a)+
L.S. Hayre / Randomized
response methods
313
then (10) becomes
3(l+e-2).
It is clear that there is an improvement only if p is several times larger than cr. An alternative to replacing S by S*=ZS is to choose A so that the chances of X being less than A are negligible. This feasible when p is much larger than o. Of course, if whether X>A or XC A is not considered sensitive information, we should always replace X by X-ZQ, where p. is a prior estimate for ZJ, and use a positive scrambling variable. This may be the case, for example, with the question on incomes; people may not mind revealing whether their incomes are below average or above average, although they do not want to reveal the exact amount. The second method we consider is taking several observations on each respondent. Respondent i generates random numbers Si,, . . . , Sim from the distribution of S, and gives the m answers XiSir, XiSi2, . . . , X;S,, where Xi is his answer to the sensitive question, and m is a positive integer. There are two reasons for taking several observations; firstly, as we see below, V(F) is considerably reduced, and secondly, if m is small, say 2, 3, or 4, the respondent may actually feel more protected giving several scrambled responses rather than one, in the belief that this will make “guessing” more difficult. Let yy=XiSij, for i=l,..., nandj=l,..., m. Let ~==(yil+...+Y,,)/m=XiSi, say. We use Yi as our ith observation, so that the scrambling variable is Si, which has mean 8 and variance y2/m. Hence, from (3), V(p) = n-‘(a2+ m-‘&&rz(X2)). If we replace X by X-A V(p) = &(a2
and use S* in place of S, then + m-‘Q*‘E[(X-
A)2 J)
(11)
where Q* is given in (9). There is clearly an advantage in using m> 1; however, if m is greater than 4 or 5, the respondent may feel uneasy in giving so many answers. A value of m around 3 seem best. Furthermore, having m> 1 allows us to choose a scrambling distribution with a larger value of ,Q~,and thus more reassurance for the respondents.
5. Comparison
with the unrelated
question
method
In this section, V(p) is compared for the scrambled response method and the unrelated question method of Greenberg, et al. (1971). The different parameters involved make it difficult to give absolute comparisons, but we have attempted to make the situations comparable. A third possible method is the additive model, where a random term is added to the sensitive answer. Pollack and Bek’(1976) have given a comparison of the additive and unrelated question methods. We will not consider the additive model here; although it has the advantage that V(fi) does not
314
B.H. Eichhorn,
L.S. Hayre / Randomized
response methods
depend on p, it has the disadvantages that extreme values of X will be relatively unaffected by the addition of a random term, and the magnitude of X will still be known. For the scrambled response method, we took e2 = 1. The discrete example in Section 3.2 shows that it is possible to choose a reasonable scrambling distribution with e2 < 1, so Q’ = 1 seems a justifiable choice. We assume that X is replaced by X-A, where A is a prior estimate for ,u. Let p-A = da, where d is a constant. let Z&be the estimate obtained using a positive S, and let & be the estimate if we use S*=ZS, where S*=ZS, where Z= 1 with probability p and Z= -1 with probability 1 -p, where 0.5 cp < 1. As discussed in Section 4, in many cases we may be able to use S even though X is replaced by X-A, as the information that X>A or X
+m-‘~~(1 +d2)]5n, +m-‘(2p-
l)“(e2+4p(l
-p))(l
+d2)]/n.
For the unrelated question method, let p,, and 0; be the mean and variance of the responses to the unrelated question. Let aY= ko and p -p,, = co, where k> 0 and c are constants. In general, a,, and p,, are unknown, and two samples will be needed. let these be of sizes ni and n2, with nl + n2 = n. Let pi be the probability that the sensitive question is selected by a respondent in sample i, i = 1,2. Greenberg, etal. (1971) recommend that we should have p1 +p2 = 1, nl/n2 =p1/p2, with p1 (or p2) between 0.70 and 0.80. Let PR be the estimate obtained using the unrelated question method. Then from equations (3.4) and (3.5) in Greenberg, et al. (1971), with p1 +p2= 1 and nl/n2=p1/p2, we obtain (12) Table 1 below gives values of V(J) for the scrambled responses and unrelated methods, for selected values of the parameters. As can be seen from Table 1, the scrambled responses method using S>O generally leads to much smaller values of VW), especially if m, the number of answers from each respondent, is 2 or 3. Even if we use S*, V(p) is still usually much smaller if m is 2 or 3. As k decreases, V(p) decreases for the unrelated question method (the scrambled responses method, of course, does not require guessing 02), but if k is much less than 1, it will be often possible to guess which question was selected. For the most meaningful comparison with the scrambled responses method, we should use the k= 1 row.
B.H. Eichhorn, L.S. Hayre / Randomized response methods
Table 1 Variance of fi. This table gives values of C, where
V(p) = Cu*/n.
.$=
used;
1; S and S* denote
the scrambling
variable
For scrambled
m is the number
315
rescrambled
of answers
taken
responses
method
responses, from
each
respondent. Unrelated
Question
Method
k
p, =0.70
0.75
0.80
c=d 0
0.5
1.00
Scrambled m
s
s*,p=o.90
S*,p=O.80
0.75
5.10
3.34
2.39
1
2.00
3.13
5.55
1.00
6.25
4.00
2.78
2
1.50
3.28
1.25
7.73
4.84
3.28
3
1.33
2.06 1.71
2.52 6.70
0.75
5.43
3.53
2.50
1
2.25
3.66
1.00
6.58
4.19
2.89
2
1.63
2.33
3.85
1.25
8.05
5.03
3.39
3
1.42
1.89
2.90
0.75
6.41
1
3.00
5.25
10.11
7.56
4.09 4.75
2.84
1.00
3.22
2
2.00
3.13
5.56
1.25
9.04
5.59
3.73
3
1.67
2.42
4.04
6. Concluding remarks We have studied an alternative to the standard randomized response method for obtaining sensitive quantitative data. Although it remains to be tried out in practice, it is reasonable to think that it will lead to fewer false or evasive answers, and the results of Section 5 indicate that it has superior theoretical properties. We have suggested two possible scrambling distributions in Section 3, but doubtless there are others which may be superior in some respects. Generation of the scrambling values in a simple and effective manner is also an open and important problem, and we have proposed a few possible methods in Section 3. A more general method would be to combine the multiplicative and additive models; that is, the interviewer receives the response Y = SX+ U, where S and U are independent random variables. This may be necessary with extremely sensitive questions, such as X= # of abortions a woman has had. In this case, the respondent may not want ot reveal whether X+ 0 or X> 0, and the response SX would reveal this, while the response SX+ U would not. An alternative to this combined model is to use X-A and S*, as in Section 4. Then Y= (X- A)S*, which will be non-zero even if X=0.
References Greenberg, B.G., Kuebler, R.R., Abernathy, J.R., and Horvitz, D.G. (1971). Application of the randomized response technique in obtaining quantitative data. J. Amer. Statist. Assoc. 66, 243-250. Horvitz, D.G., Greenberg, B.G., and Abernathy, J.R. (1975). Recent developments in randomized response Holland,
designs. In: J.N. Srivistava, Amsterdam-New York.
Ed. A Survey of Statistical Designs and Linear Models. North-
316
B.H. Eichhorn, L.S. Hayre / Randomized response methods
Pollock, K.H. and Bek, Y. (1976). A comparison of three randomized response models for quantitative data. J. Amer. Statist. Assoc. 71, 884-886. Warner, S.L. (1965). Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 60, 63-69. Warner, S.L. (1971). The linear randomized response model. J. Amer. Statist. Assoc. 66, 884-888.