Comparison of statistical graph-size estimators

Comparison of statistical graph-size estimators

A&met: We investigateseveraIunbiasedestimatorsof the size of a graphG which are baseden sampledinduced subgraphs*starsand dypds. By comparingtheir var...

1MB Sizes 0 Downloads 42 Views

A&met: We investigateseveraIunbiasedestimatorsof the size of a graphG which are baseden sampledinduced subgraphs*starsand dypds. By comparingtheir varianceswe find some general dominancerelationsbetweenthem, and we also give some conditions on C which guaranteesome other instancesof dominance,

Key WV&: Graph inference;Network sampling.

1. Introduction

Partial information from a large empirical graph C can be coltected by sa~pI~ng, and the sample information can be used to make statistical inferences about G. genera problems in statistical graph theory have been discussed by Frank
88

meter which measures the mean number of commwnkation channh per node. If G is a model of a reliability network, then its size R cat1be used as a crude reliability m~as~~resince, for instance, R (&; I) implies that G is connected. Large empirical graphs can be sampled in vrarious ways reflecting the ackti application, and a rational choice between the approaches avaiIable_rqufres a at&+ tical evaluation of the sampling and estimation pr~~~ed~r~.Here we Shariin gaxea few alternative graph sampling procedu,resand compare some different estimators of R. In Section 2 we introduce the necessary concepts armdnotation. Section 3 considers some unbiased estimators based on the: s~~bgraphinduced by a node sample. Section 4 discusses two estimators which can be used if labeled or unlabeled stars are sampled, and Section 5 gives an estimator based on sampled dyads. Finally, Section 6 compares the variances of different estimators and gives some general and conditional dominance refations between the estimat~~rs,

For general, graph cmqfs, we adopt the terminology of Harary (1%9) and Capobianco and ~oIIuzzo (I 978). Let G be a graph of order N and size R with node set V. The nodes are labeled 1, . . . , N and the adjacency matrix has entries Qij

=

1 if there is an edge between nodes i and j, otherwise

0

with ai/ = 0 and a;j = ti,i for i, j = 1, . . . , AL The degree of node i is di =*j$,

Uij

l

The sum of the degrees is twice the size of G, i.e.

and the sum of the squares of the d rees is denoted by

The mean and variance of the degrees are denoted by p = ~~/~~

& = QAV- 4~~~~~~

If’Sis froitl’)lY, size R(S)==c An unbi

(3.1)

‘a

to

have t*

chosen without replacement by simple random sampling subgraph C?(S) has a fixed order N(S) = n and a stochastic estimator of W can be shown to be given by

rice

for k = 2,3,4 and n > 1. See Frank (1977b‘)for a proof. We see that pk is the probability that k specified nodes are included in S. If we approximate pk by (nAV)k, then the variance (3.2) will be simplified to

where p = I- q = nAV. The same variance expression can be obtained by considering ator R = r/p2 for a lyernoulli stmpling scheme with selection probability p, pling scheme according to which leach node in V is ind for the sample S with probability p. In what follows we shall use as an alternative to simple random sampling, since it simplifies the theoretical treatU ment of the estimators, and it is usually sufficient for the degree of approximation needed in practice for sample surveys. For smbli p, it follows from (3.4) that Var R can be approximated with R/p2, For arbitrary fixed p, we may approximate Var &!by assuming a relationship between R and Q in G. One way of doing this is to introduce a Bayesian prior model for G and investigate the expected behavior of R and Q; another is to introduce an increasing vior of R and Q, nk and Ringstr6m (1979) urrences with a common n, according to rank and Ringstr6m

and the Bayes risk of R is equal to

folc large N. Now, according to (3.5) and (3.6), a2 hats m unbi (Q - 2R)/6(P;‘), and it follows from (3.7) that Var R is eapected to be close to (C ‘- 2R BQr’p

(3.8)

for large N. Capobianco (1976, p. 7) has approximated (3.2) with R(N2 - n2)h2 when N and n are large. This corresponds to approximating (3.4) with R(1 -

p2)/ru2.

(3.9)

In order to investigate f’urther the appropriateness of various approximations to Var R, we note that (3.4) is the sum of (3.5) and (3.9), i.e. VarR=R(l

-p2)/p2+-(Q-2R)q&

(3. IO)

It follows that Var R = R(I -p2)/p2+o(R)

(3.11)

if G increases in such a way that Q/2R tends to 1. Since Q.W? is an upper bound to the mean degree I~~in G, the approximation (3.9) cannot be supposed to be satisfactory if p > 1 The condition Q./‘2R-+ 1 means that the degrees in 0 have a variance q2 which is asymptotically equA to ~(1 -p). If we assume instead that the degrees in G have a variance a2 which is asymptotically equal to JI, and ~1tends to infinity, then it fo!lows that Q - 2112 is asymptotically equal to 4R 2/’ and l

Vat R = 4R 2q/Np e o(R 2/N).

(3.12)

rhe Bernoulli sample S has a stochastic order N(S) = n which is binomially distrif for G tde!gribed bui ti*dwith parameters N and p. If t&e thornan alternative to the estimator R i (3.13) Fr:,:lnkand Ringstrom (1979) have shown that R* is unbiased an Var R* =

as well 8s smaller than the variance of &. Using the prior model, iance of I?@is expected to be smaller than the variance of ff; in ual to - u)(I+)

+)=EvarR*= and this L$ p>o.

P2

iously smaller th n the Bayes ri

t

(3.15)

of ff given by (3.7) if a >O and

of a star means that a node is sampled and a 1 its adjacent observed. If unlabeled stars are sampled, the sample information consists of the degrees ~!iin G for the nodes i in the sample S. If labeled stars are sampied, the sample information consists of the adjacencies au for i E S and i E V. Let USdenote by t and s the number of edges between two sampled nodes and IDetweena sampled and a non-sampled node, respectively. Then 26= C C au, ies jrs

(4.1)

and 2r+s= C C itS

aiJ=

je5V

(4.2)

C din

res

We note that r is equal to the size of the subgraph induced by S, i.e. R(S) = r, as in e previous section. Furthermore, r+s is equal to the number of edges which are not in the subgraph induced by the complement S of S, i.e. r+s=.R -R(S).

(4.3)

This relationship makes it possible to use the results in the previous section in order to find the expected v lue and variance of r +s. Introduce 64.4) for k = 2, 3, 4, which is the probability of irtclusion of k specified nodes in S, where btain . Fwm (4.3) w S is select

(4.5)

E(r + s) ==R - Rq2, nd it

estimator

Moreover, applying (3.%),

and by approximating qk with qk, where q = 1 - p and p :=n/N as before, we find

that

Var I?’ = (f?p -t Q&/p(

1 + q)2.

(4*8)

This is the same expression th.at would be obtained for the v&ance of the estimator

R’=(r+s)/(l

-q2)

WI

in the case of Bernoulli sampling with selection probability p. From (4.2) we can obtain another unbiased estimator of R. For simple ran&m sampling, (2ra s)/n is the mean of the degrees 01’the sampIlenodes, and it has expectation p and varknce --is).

(4.10)

R * = N(2rm c s)/2n

(4.11)

f(l Therefore,

is,ur,biased and has variance VarRa=(N-n)(QN-4R2)/4n($I-1).

(4.12)

For Bernoulli sampling, we obtain from (4.2) t.hat E(2r f s) = 2Rp

(4.13)

Var(2r-k s) = Qpq,

(4614)

and and this implies that the unbiased estimator s)/2p

(4.15)

Var I?” = Qq/4p

(4.16)

R” = (2t

+

has variance

I%: note that the first estimator R’ can be used only if labeled st but the other, I?“, is possible aiso for unlabeled stars.

CIJW~ is an induced subgraph of order two. ChOS~31 s in e th e c>

by simple random sampling, then ? is hypergeometrically

oulli-sampled dyads with selection probability p, then the yads is binomially distributed with parameters (T) and p$ and t is binomially distributed with parameters R and p* It follows that R has an

which has vari Var R” = Rq/p.

(5.4)

We note that the use of R’@as an estimator of R does not require the sampled eled. If labeled dyads were sampled it would be possible to use estimators based on further information from the subgraph of G generated by the sampled dyads. We shall not consider such possibilities here and refer to Frank (1977a, Section 12) for further material pertaining to this case.

6. Comparisons

We shall now investigate some dominance relations between the estimators given above. We assume that the sampling fractions are the same in the various sampling schemes; this means that all I3ernoulli sampling schemes have the same selection probability p, and the simple random sampling of n nodes or stars corresponds to simple randurn samplin of n(N - 1)/2 dyads. We shall assume N large and denote

above, the same

94

This Iexpression is smaller than or equal to (4.16), and they are asymptotically equal if and only if Rz/NQ tends to zero, i.e. if and only if @a tends to zero, Also, the estimator R’@based on sampled dyads has different variances for simple random sampling and Bernoulli sampling. For Bernoulli sampling the variance is given by (5.4), and for simple random sampling we find from (5.2) that the variance is asymptotically equal to

which is smaller than or equal to (5.4). The two variances dre asymptotically equal if and only if R/N2 tends to izero,i.e. if and only if &N tends to zero. Let us now compare the variances of R, I?‘, ffn and ff@’for Bernoulli sampling. We find from (4.8) and (4.16) that Var f’ 4q(Rp + Qq) VarR”= Q(l+q)” ’

(6.3)

2R=&i&#=Q,

(6.4)

and since

i=I

i=l

it follows that VarXI’( 2q VarPl+q

<1

(6.5)



Thus, if labeled stars are Bernoulli-sampled, then R’ dominates RN.From (4.16) and (3.4) we obtain Var R” ii&X-=

QP

1

4(Qp+Rq)s?

(64

so that R” dominates R. From (5.4) and (3.4) we obtain Var R” var-=

Rp Rq+QP’

(6.7)

c;nd according to (6.4) this is less than one; i.e. ff 1w dominates R. By comparing (5.4) and (4.16) we find that Var R”I< Var p if and only if 4R ”Q; i.e. if and only if o2 >&2 -_EL).It follows that p >2 is a sufficient condition for #yto dominate p. By comparing (5.4) und (4.8) we find that Var J?‘< Var k’ if and only if (1 +q+2qW
W3)

i.e. if and only if

cr2>J.4[(1 +q+2q2)/2q2-/.4].

(6*%

is su~~cieRt for I? to dominate l?. These dominance relations sampling are summarized in Fi

Fig. 1. Dominance relations between estimators based on Bernoulli sampling (a =c((2 - p), 6 = PW +4+2q2V2q2-AL

For simple random sampling the variances are unchanged for I? and .t?’ and no larger for J? and @” than the corresponding variances for Bernoulli sampling. ?‘his implies that ff is dominated by each of &‘, lciW and $?” for simple random sampling. Frsm (4.8) and (6.1) we obtain Var R’ ~N~(RP + Qq) VarP = (NQ-4R”)(I +q2)

(6.11) ’

and this is fess than one if and only if (6.12) From (6.2) (6.13)

and this is less than one if and only if ((1 -=Jl/N)(l +q)*/2q*'~~(p+2qru),/2

(6.14)

It follows that (6.10) is sufficient for .$?”to dominate .@‘.From (6. II) and (6.2) we obtain (6.15)

and th!s is iess than onrr: if and Onlyif (6.16) Consequently, ~9 > 2~ is sufficient for RI” to dominate P. Fig. 2 summarizes some dominance relations for estimators based on simple random samplia@.

Fig. 2. Dominance relations between estimators based on simple random sampling (a s 2p(] w~,+@~), ~=CIt~-~~~)(~+q)*~2qf-~(p+2~~)/~~,c=2qlr~p+2q~)/p(l+3q)).

The comparisons above assulne that a!1 the sampling fractions are the same - an assumption that might be appropriate if, for instance, repeated use of sampling its planned in 8 sequence of graphs. If, however, a single graphsamplingsituationis considered, it might be more appropriate to make comparisons based on wst assumptions. We will elaborate a little on this. Assume that there is a cost c1 per sampling unit and a cost c2 per observation unit. If a subgraph induced by n sampled nodes is chosen, then (i) dyads mist be examined. If n labeled or unlabeled stars are sampled, then (y)-&‘) and n(N- 1) dyads, respectively, must be examined. If n unlabeled dyads ar sampled, then &go n dyads must be examined. Therefore, an expected total cost c will be obtained by choosing the samplinb fraction p as a solution to the equation

if an induced subgraph is sampled,

if labeledstars are ssmpkd, (6.19)

(6.20)

dyads are sampled. Comparisons between the variances of the .#* and RMfor these pvalues can then be madein orderto find the minjmrun-variance timatorsubject to a fixed tota cost c. I!‘,

Capobianco, M. (1970). Statistitial inference in finite populations having structure. Trans. New York Acurd.SCi. 32.401-413. Capobianco, M. (1972). Estimating the connectivity of a graph. Graph Theory and Applications, ed. by Y. Alavi, D.R. Lick, and A.T. White. Springer, Reslin, 64-74. Capobianco, M. (1974). Recent progress in stagraphic;. Ann. fVew York Acad. Sci. 231, 139-141. Capobiinw, M. (1976). Introduction to statistical inference in graphs and its applications. Unpublished manuscript. Capobianco, M. and J. Molluzzo (1978). Examples and Counterexamples in Graph Theory. NorthHolland, New York. Frank, 0. (1969). Structure inference and stochastic graphs. FOA Reports 3 : 2, I-8. Frank, 0. (1971). StatistW It@wwe in Graphs. FOA-Repro, Stockholm. Frank, 0. (1977a). Survey sampling in graphs. J. Statist. Planning and Irlference 1, 235-264. Frank, 0. (1977b). Estimation of graph totals. Scatrd. J, Statist. 4, 8 l-89. Frank, 0, (1977c). A note on Bernoulli sampling in graphs and Horvitz-Thompson estimation. Scund. J. Statist. 4, 1’/8-180,

Frank, 0. (1978). Sampling and estimation in large social networks. Social Networks 1, 91-101. Frank, 0. and J. Ringstrom (1979). Rayesian graph&e estimation. Paper presented at the 12th EuropeanMmting of Statisticians in Varna, Bulgaria, Harary, F. (1969). Graph Theory. Addison-Wesley, Reading, MA. Proctor, CH, (1967). The variaiicc of an estimate of linkage density from a simple random sample of’ graph nodes. Proc. Social Statist. Sect., Am. Statist. Assoc., 342-343.