Social Networks 30 (2008) 309–317
Contents lists available at ScienceDirect
Social Networks journal homepage: www.elsevier.com/locate/socnet
Weight matrices for social influence analysis: An investigation of measurement errors and their effect on model identification and estimation quality ´ a,∗ , Darren M. Scott a , Erik Volz b Antonio Paez a b
Centre for Spatial Analysis/School of Geography and Earth Sciences, McMaster University, Canada Section of Integrative Biology, University of Texas at Austin, Austin, United States
a r t i c l e
i n f o
Keywords: Weight matrices Social influence Network autocorrelation Measurement error Simulation experiments
a b s t r a c t Weight matrices, such as used in network autocorrelation models, are useful to investigate social influence processes. The objective of this paper is to investigate a key topic that has received relatively little attention in previous research, namely the issues that arise when observational limitations lead to measurement errors in these weight matrices. Measurement errors are investigated from two perspectives: when relevant ties are omitted, and when irrelevant ties are erroneously included as part of the matrix. The paper first shows analytically that these two situations result in biased estimates. Next, a simulation experiment provides evidence of the effect of erroneously coding the weight matrix on model performance and the ability of a network autocorrelation test to identify social influence effects. The results suggest that depending on the level of autocorrelation and the topology attributes of the underlying matrix, there is a window of opportunity to identify and model social influence processes even in situations where the ties in a matrix cannot be accurately observed. © 2008 Elsevier B.V. All rights reserved.
1. Introduction The issue of autocorrelation in regression analysis is so pervasive in the analysis of spatial and network data that it has been called a foundational problem in the social networks literature (Dow et al., 1982), and forms much of the basis of contemporary methods in spatial statistical analysis (e.g. Anselin, 1988; Cliff and Ord, 1981; Cressie, 1993; Griffith, 1988; Haining, 1990). A model such as (in matrix notation): Y = WY + Xˇ + ε
(1)
has been proposed as a way to deal with potential network and spatial autocorrelation in regression analysis, with most recent technical developments following seminal work in the spatial statistics and econometrics literature in the 1980s and 1990s. Technical issues with autocorrelation aside (the focus of much of the literature until relatively recent times), the model shown in (1) above is attractive because, as discussed by Leenders (2002) and Marsden and Friedkin (1994), the lagged term WY can be interpreted as a form of social influence, and thus provides a clear bridge between statistical analysis and social theories where comparison
∗ Corresponding author at: 1280 Main Street West, Hamilton, ON L8S 4K1, Canada. Tel.: +1 905 525 9140x26099; fax: +1 905 546 0463. ´ E-mail address:
[email protected] (A. Paez). 0378-8733/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.socnet.2008.05.001
and reference processes are important. Not surprisingly, given this interpretation, the topic of autocorrelation has long attracted the attention of researchers interested in social networks (e.g. Doreian, 1980; Doreian et al., 1984; Dow et al., 1982, 1984; Hummon and Doreian, 1990). More recently, in addition, the number of studies in the literature that investigate issues where social patterning and referencing are key elements of the research question is on the rise. For example, Greenbaum’s (2002) paper frames research on teachers’ salary levels by arguing that parameter will be zero only if pattern bargaining and social comparisons do not affect salaries. McVeigh et al. (2004) use the same type of model to study structure and framing in social movements, and note the existence of information spillovers regarding the operation of organizations such as KKK. And Worrall (2004) finds that some forms of crime have spillover effects that are relevant for interpretation and inferential purposes. Despite the rising popularity of network models such as (1), and their increasing use from a variety of disciplinary perspectives, there have been few investigations regarding one of the key elements of the model, namely the weight matrix W, and in particular the impacts on model identification and statistical goodness of fit when the ties in this matrix are incorrectly observed or coded. In the specific case of social networks, besides the early work of Doreian and Dow cited above, there are very limited examples of this type of research, until the recent, mainly conceptual work of Leenders (2002). While the relevance of correctly defining the weight matrix
310
A. P´ aez et al. / Social Networks 30 (2008) 309–317
has been recognized in the field of geographical analysis (e.g. Florax and Rey, 1995), the question is if anything more critical in social networks applications given the observational challenges involved in coding social ties. Furthermore, additional complexity derives from the fact that many social networks, being spatially unconstrained, tend to present richer variations of network topology than most geographical systems (Farber et al., in press). Keeping the preceding remarks in mind, the objective of the present paper is to approach a problem that has received relatively limited attention in the literature, namely investigation of the issues that arise when, due to observational limitations, the weight matrix W is incorrectly measured, either by coding existing ties as 0, or by coding non-existing ties with values other than 0. These two situations, which are closely related to the issue of sampling network ties, are expected to have important but as yet largely unexplored consequences for model identification and statistical goodness of fit. Two lines of research lead to this paper. One is the discussion by Leenders (2002) of the definition of the weight matrix in models of social influence. In his paper, Leenders is concerned with the implication of defining the connections between individuals in different ways in a fully referenced system, under the implicit assumption that the network has been completely and accurately observed. The second is the work by Stetzer (1982), Florax and Rey (1995), Griffith (1996), and Griffith and Lagona (1998) on the specification of the weight matrix, as the problem is termed in that stream of literature. While this small body of research has addressed questions similar to those posed for this paper, previous work has been mostly based on the use of regular tessellations commonly used in experimental spatial statistics, as opposed to networks structures of varying topological attributes.
2. Brief review of relevant literature The literature on the impacts of using an inaccurate matrix W in modeling is brief and limited to work conducted in the field of spatial and geographical analysis stemming from research by Cliff and Ord (1981), and in particular papers by Stetzer (1982), Florax and Rey (1995), Griffith (1996), and Griffith and Lagona (1998). Several inferences are derived from these studies. First, Stetzer, working with a simple autoregressive model, compares a number of different ways of formulating the weights, including the use of diverse functional forms such as binary, power, inverse distance weights, contiguity, etc. The results of Stetzer’s simulation experiments suggest (among other findings) that: (1) bias results when the weights implemented in the matrix cover an area that is shallower or larger than that indicated by the true weights; (2) in terms of the quality of estimators, it is better to err on the side of steep or small weights (i.e. by using a relatively sparse matrix); and (3) matching the effective area or coverage of the true matrix is in general more important than matching the functional form of the weights. Later work by Florax and Rey (1995) improves on the pioneering work of Stetzer by using a more general model, maximum likelihood estimation techniques, and by considering the issue of testing for autocorrelation. Their results lend support to the main conclusions of Stetzer. In particular, Florax and Rey find that the effects of introducing irrelevant weights tend to be larger than those associated with omitting relevant weights from the matrix. Further, investigation of the power of different tests used to identify the presence of autocorrelation, led these investigators to conclude that incorrectly defining the matrix by coding non-existing connections between zones reduced the power of the tests, and vice versa. This latter finding is confirmed by recent work by Farber et al. (in press), Mizruchi and Neuman (2008), and Smith (2007), and is linked to the
topology properties of the weights matrix, in particular the degree of connectivity. The research of Griffith (1996) and Griffith and Lagona (1998) differs from previous studies in that it is concerned with a model of spatial error autocorrelation, as opposed to the spatially lagged dependent variable of model (1). Thus, although the results are not directly comparable, it is worth noting that these researchers find that use of a matrix W that contains measurement errors results in mean estimators that are unbiased even if inefficient, while the error variance estimator is both biased and inefficient. In this research based on a model of spatial error autocorrelation, the most serious problem associated with the use of an incorrect matrix appears to be loss of efficiency. On the other hand, these researchers also find that desirable distributional properties can be regained from increasing domain asymptotics. In social network applications this would imply increasing the size of the sample, a possibility in applications to large networks (e.g. consumer networks, the internet), but something difficult or impossible to achieve in other situations where the networks may be bounded geographically or otherwise (e.g. inter-organization networks). While the parallels between autocorrelation in geographical systems and autocorrelation in social networks have been noted by numerous researchers in the past (e.g. Anselin, 1988; Dow ´ et al., 1982; Leenders, 2002; Paez and Scott, 2007), it is worth noting at this point at least two notable differences in the application of autoregressive models in social network and geographical research problems. The first difference concerns the processes that conceptually lead to substantive autocorrelation, that is, a lagged dependent variable. These processes are usually of an urban or regional economic nature in the analysis of spatial data, including externalities and multiplier effects (Anselin, 2003). As discussed by Leenders (2002) and others, in social analysis autocorrelation is a result of processes that include social influence, imitation, status seeking and other similar or related behaviours. A second difference, of greater relevance for the question at hand, is closely related to the type of data used in research. In geographical systems, the units of analysis are typically zones sharing borders or other types of connections with other zones (e.g. roads in a transportation system, physical flows, etc.). In social systems, on the other hand, the units of analysis are individual social entities linked to other individuals by affective, professional, assistance, and other types of social ties. The ties in a structure of this type may remain constant or not over time, but geographical fixity is not necessarily a characteristic of the elements of the network, and there may even be some interesting social–geographical feedback effects that could affect the structural or compositional characteristics of the network. This lack of fixity however points to an important practical distinction, between what could be called geo-referencing and social-referencing. In spatial analysis, geo-referencing is accomplished by means of surveys, positioning systems, and other observational approaches. Once the zones in a system have been fully referenced, delimitation of a border is a relatively straightforward task, and the relative position of the units of analysis is known, even if the connections between them cannot be unambiguously defined (e.g. possible alternatives include contiguity, length of common border, distance between centroids, etc.) Social-referencing in contrast implies, in addition to the ambiguities involved in defining the connections between individuals, the complexity of observing the ties, defining the borders of the system, and dealing with missing observations—as the analyst may be unaware that observations are missing or where do they belong in the network. The present paper is concerned with the second issue listed above, and aims at investigating the implications of using matrices that include measurement error. The error could be of one of two forms, when the measurement scheme fails to identify
A. P´ aez et al. / Social Networks 30 (2008) 309–317
existing ties (i.e. the matrix fails to include all relevant ties), or when the researchers err by observing non-existing ties (i.e. the matrix includes irrelevant ties).
311
3.2. Case 2: inclusion of irrelevant ties Next, suppose that the true model is again the one in Eq. (2): Y = WT Y + Xˇ + ε
3. Analytical results Previous research, as outlined above, has approached the problem of defining matrix W largely parting from a simulation perspective (but see Griffith and Lagona, 1998). In this section, we provide some analytical results linking two forms of measurement error resulting in the omission of relevant ties or the inclusion of irrelevant ties. These results suggest various parameters to control in the simulation experiments to follow, and help to explain some previous findings reported in the literature. 3.1. Case 1: omission of relevant ties Consider the following mixed autoregressive model that incorporates a lagged dependent variable: Y = WT Y + Xˇ + ε
However, in this case, instead of the model above, the following is estimated as if the weights matrix WX , which includes a number of irrelevant ties, is correct: Y = WX Y + Xˇ + ε
(9)
with: WX = WT + WI
(10)
where WT is again the true weights matrix, and WI is a matrix which includes irrelevant weights only (i.e. the measurement error component of matrix WX ). The maximum likelihood estimators can now be shown to be: b = (X X)
−1
−1 X [I − (W ˆ T + WI )][I − WT ] (Xˇ + ε)
(2)
= [(X X)
−1
X −(X ˆ X)
The error terms in Eq. (2) are defined under the usual assumption that:
= [(X X)
−1
X [I − W ˆ T ] − (X ˆ X)
ε ∼ N(0, 2 I)
(3)
To explore the case where the matrix fails to include all the relevant ties, the true matrix WT is decomposed as follows: WT = WS + WO
(4)
where WS is the sampled set of weights (i.e. ties recorded as observed), and WO includes all omitted, but relevant, weights. Rearranging the terms, the true model can be rewritten in the following fashion: Y = [I − (WS + WO )]−1 (Xˇ + ε)
(5)
Suppose now that the model is estimated using the incorrect structure, that is, using only sampled weights WS . In this case, the maximum likelihood estimators of the model are given by: b = (X X)
−1
X [I − W ˆ S ]Y
= (X X)
−1
X [I − W ˆ S ][I − (WS + WO )]−1 (Xˇ + ε)
(6)
where ˆ is the estimated (and possibly biased) value of parameter . If the weight matrix is correctly specified, then WO = 0, and ˆ will be estimated without bias (i.e. E[] ˆ = ). Therefore, and since E[ε] = 0, the estimators b are unbiased: E[b] = E[(X X)
−1
X [I − W ˆ S ]Y ]
= E[(X X)
−1
X [I − W ˆ S ][I − WS ]−1 (Xˇ + ε)] = ˇ
(7)
On the other hand, if the weight matrix is completely omitted (i.e. if WS = 0): E[b] = E[(X X) = (X X)
−1
−1
X Y ] = E[(X X)
−1
X [I − WO ]−1 Xˇ
X [I − WO ]−1 (Xˇ + ε)] (8)
Clearly in this case, unless = 0 (zero autocorrelation) or equivalently WO = 0 (there is no network structure to the problem), the estimators will be biased as a consequence of an under-specified weight matrix. Bias, furthermore, will be a non-linear function of the underlying level of autocorrelation, and the composition of matrix WO , that is, the unobserved but relevant links.
= (X X)
−1
−1
X WT −(X ˆ X) −1
−1
X WI ][I−WT ]−1 (Xˇ+ε)
X WI ][I − WT ]−1 (Xˇ + ε)
X (I − W ˆ T )(I − WT )−1 (Xˇ + ε)
−(X ˆ X)
−1
X WI [I − WT ]−1 (Xˇ + ε)
(11)
If WI = 0, that is, if the matrix is correct, the estimators will be unbiased. Otherwise, as should be clear from Eq. (11), whenever = / 0, bias will result when the matrix includes irrelevant links, possibly including the case when the true weight matrix is WT = 0 depending of the amount of bias introduced by the estimate of . ˆ This result is partly a consequence of the non-linear structure of the model, as this form of bias is common in non-linear specifica/ 0 the situation tions even if absent in linear models. When WT = is further complicated since bias in b will depend on the true level of autocorrelation , the bias (if any) in the estimated value of , ˆ the degree of measurement error embedded in matrix WI , and the underlying structure of the true network WT . If WI is very sparse, the bias will be small. The results above indicate that measurement error in the case of matrix W is problematic regardless of whether ties are coded more or less conservatively, and the magnitude of the problem depends both on the level of autocorrelation and the composition of the network that gives rise to the weights matrix. This helps to explain findings by both Stetzer (1982) and Florax and Rey (1995) regarding the deleterious effects of using a matrix W that covers a wider area (i.e. includes more connections) than indicated by the true matrix. An unfortunate limitation of the derivations above, on the other hand, is the impossibility to express analytically the estimate for (Anselin, 1988; p. 181; Griffith, 1988, p. 86), even if approximations have been obtained for computation of the Jacobian term in the likelihood function based on irreducible matrices (Griffith, 2004). The lack of an exact analytical expression complicates further attempts to assess the magnitude of the problem under the two types of error considered. In order to further investigate this question, a set of numerical experiments are performed next, using the results in this section to inform the selection of parameters to control in the simulations. 4. Monte Carlo simulation: experimental design As the results above indicate, bias in the estimators is a function of several parameters, prominently the true level of autocorrelation,
312
A. P´ aez et al. / Social Networks 30 (2008) 309–317
bias in the estimated value of , and the composition of the network that gives rise to matrix W. In this section, the consequences of using an incorrectly coded weights matrix are investigated while controlling for these parameters.
4.2. Data generating process Data for the experiments are generated using the following mixed regressive autoregressive model: Y = WT Y + ˇ0 + Xˇ1 + ε
4.1. Networks with controlled topologies The composition of a network has long been a topic of interest in the geographical literature (e.g. Boots, 1984; Kanski, 1963). For the purpose of this paper, an important control element in the impact of measurement error is the structural composition of matrix W. Two measures of network topology have been singled out by recent research as important in determining the structure of networks, namely degree distribution and clustering coefficient (e.g. Barabasi and Albert, 1999; Watts and Strogatz, 1998). Given these two statistical descriptions of complex networks, a natural problem arises as to how to simulate networks which possess arbitrary combinations of these measures. In order to generate the networks required for our experiments, in this paper we employ the approach developed by Volz (2004) for generating random networks with any desired combination of degree distribution and clustering. Such networks allow us to control two aspects of network composition that may affect model behaviour. Simulation of the networks is implemented by combining dynamic growth and preferential-attachment, two mechanisms commonly used in random network models. Volz’s algorithm begins by assigning a degree to each node from the desired degree distribution. Then a single node is selected as a starting point. Nodes are matched with one another in a two-step process: first, a list of nodes two steps away from the current node is formed, and then for each node in the list with probability equal to a tuning parameter c that approximates the desired clustering coefficient, a connection is formed between the current node and the node two steps away; and secondly, if no nodes at distance two are selected for a connection in this way, then a subsequent node is added to the network. The network continues to grow in this fashion until all nodes are exhausted. A factor to consider in this process is how to select new nodes to connect to during network growth. Clustering imposes a constraint on the joint-degree distribution of two neighboring nodes in the following way. Given a clustering coefficient c, and a node a with degree z, a neighbor b of a should have a degree on average of at least c(z − 1) in order to satisfy all of the joint connections with common neighbors of both a and b. Thus, it is not possible to select neighbors uniformly at random when growing the network. In practice, Markov Chain Monte Carlo techniques are used in order to sample from an alternative distribution which satisfies the clustering constraint (for technical details, see Volz, 2004). For our application, we generate networks with n = 100 nodes, and design them to exhibit a diverse range of topologies, using both large and small levels of clustering, and distribution degrees. The distribution used for generating the networks is the Poisson distribution: pk =
z k e−z ≥0 k!, k
(12)
The tuneable parameters in this distribution are z (mean degree) set to vary between 1.5 and 7.5 in 2.0 step increments, and c (clustering coefficient) set to vary between 0.2 and 0.7 in 0.1 step increments. This gives 24 unique networks with controlled topologies for the experiments. The binary matrices thus obtained are row-normalized prior to their use in the data generation and model estimation processes described below.
(13)
A total of D draws of the error terms e are obtained from a standard-normal distribution, while the independent variable is obtained from a uniform distribution over the range [0, 10]. The intercept term and regression coefficient are set to 2.0 and 1.0, respectively. Based on these randomly drawn variables, dependent observations are generated by isolating variable Y as follows. This variable is also a function of correlation coefficient , which ranges from 0.0 to 0.5 in 0.1 step increments: Y = (I − WT )−1 (ˇ0 + Xˇ1 + ε)
(14)
The only test considered here for the purpose of identifying the presence of a lagged dependent variable is the likelihood ratio test. Previous research by Farber et al. (in press) compares the power of different tests, only to find that the differences between the likelihood ratio test and other popular alternatives are relatively minor. The test (2 -distributed with 1 degree of freedom) is defined in the following way: ∗ ∗ LR = 2(LSAR − LOLS )
(15)
∗ LSAR
where is the log-likelihood function of the SAR model evaluated using the maximum likelihood estimators of the coefficients, ∗ and LOLS is the log-likelihood function for the ordinary least squares regression model, likewise evaluated using the maximum likelihood estimators of this model: LOLS = −
n n 1 log − log 2 − (Y − Xˇ) (Y − Xˇ) 2 2 2 2
(16)
The test is based on the null hypothesis of no dependence: H0 :
Y = Xˇ + ε
(17)
while the alternative hypothesis is: HA :
Y = WY + Xˇ + ε
(18)
The test is computed at ˛ = 0.05. The frequency of rejections for each test at this level is stored and divided by the number of draws D and multiplied by 100 to give the percentage of times the test rejects the null hypothesis. When = / 0 the test should reject the null hypothesis most of the time (commonly used benchmarks are 90%, 95%, and 99% of the time), and the chance of erring by failing to identify the true model is the percentage of times the test fails to reject the null hypothesis. The more powerful the test is, the smaller the chance of committing this type of error. Similarly, when = 0 the test should fail to reject the null hypothesis most of the time and the chance of erring by falsely identifying a non-existing process is the percentage of times the test rejects the null hypothesis (commonly used benchmarks are 1%, 5%, and 10% of the time). Ideally, the test should have false positive probability not larger than ˛, the significance level. In order to assess the quality of the estimates in the simulations, the mean squared error (MSE) is calculated (q.v. Florax and Rey, 1995; Stetzer, 1982). The MSE combines the estimation variance as well as the bias of an estimate in a single summary measure of goodness of fit. For a given parameter , the MSE is calculated as follows: MSE () =
2 (ˆ r − ) ˆ¯ r
R
+
r
(s − ˆ r ) R
2
,
for r = 1, 2, . . . , R (19)
A. P´ aez et al. / Social Networks 30 (2008) 309–317
313
Fig. 1. Mean Square Error of the estimate of parameter for different combinations of z, c, and . X axis is sampling rate, and Y axis is MSE (Note: ♦: = 0; : = 0.1; : = 0.2; : = 0.3; 夽: = 0.4; : = 0.5).
¯ where ˆ r is the estimate in replication r, ˆ is the mean of the estimate for all replications, s is the true value of the parameter, and R is the number of replications in the simulation experiment. The simulations were performed in MATLAB using a combination of custom programs and code written by J.P. LeSage as part of his Spatial Econometrics package [http://www.spatialeconometrics.com/].1 4.3. Matrix W: coding procedures 4.3.1. Experiment 1: omission of relevant ties The first experiment implemented simulates the effect of omitting relevant ties from a network, either because of measurement error or by a conscious effort to reduce the cost of observation by sampling from a network. Under the assumption of a closed system, all individuals are observed for their personal attributes. Given a true network WT , on the other hand, a sample of s individuals is selected to obtain their network ties. For the individuals in this sample the network ties obtained are complete and accurate. This can be thought of as using egocentric networks in the model for some but possibly not all individuals in the system. The data is generated using Eq. (13) for all 24 matrices resulting from the combination of network topology parameters z (=1.5, 3.5, 5.5, 7.5) and c (=0.2, 0.3, 0.4, 0.5, 0.6, 0.7). Estimation on the other hand is conducted assuming that the matrix is: WS = WT − WO
(20)
A set of 100 draws D are obtained for the error terms and independent variable X. The sample levels are s = 1.0 (complete network) to s = 0.5 (50% sample) in −0.05 steps. Each network is sampled 100 times at each sampling level, to give 100 × 100 = 10,000 independent replications R for this experiment.
1 The data and code used to conduct the simulations are available from e-content or for download from http://www.science.mcmaster.ca/geo/faculty/ paez/publications.html.
4.3.2. Experiment 2: inclusion of irrelevant ties The second experiment implemented simulates the effect of introducing irrelevant social ties in a network. As before, all individuals are observed for their personal attributes under the assumption of a closed system. The true network WT includes all relevant ties, but the model is estimated using a matrix that contains more ties than there really are, perhaps as a consequence of an overly liberal protocol for coding ties. The data is generated using Eq. (13) for all 24 matrices resulting from the combination of network topology parameters z (=1.5, 3.5, 5.5, 7.5) and c (=0.2, 0.3, 0.4, 0.5, 0.6, 0.7). The matrix used for estimation on the other hand is: WX = WT + WI
(21)
The estimation matrix WX is thus obtained by adding a set of irrelevant links WI . These links are obtained in the form of a matrix with the same level of clustering as WT , but increasing degrees of over-specification as follows: +1.5, +3.5, +5.5, and +7.5. For each true matrix WT a set of 10,000 draws D are obtained for the error terms and independent variable X to give 10,000 independent replications R for this experiment. 5. Results and discussion 5.1. Experiment 1 omission of relevant ties (Figs. 1 and 2)2 Discussion in this and the following section is based on autocorrelation coefficient and the power of the tests. The results are very similar, if quantitatively different for coefficients ˇ1 and ˇ2 , as noted in the concluding section. When = 0, the situation is equivalent to that of no underlying social influence effect. In this case, the risk is to falsely identify a social process that does not exist when applying the likelihood ratio test. Since in this situation the model would be correct for any weight matrix, if the test for has the right significance level
2 Plots for coefficients ˇ1 and ˇ2 are available from e-content or for download from http://www.science.mcmaster.ca/geo/faculty/paez/publications.html.
314
A. P´ aez et al. / Social Networks 30 (2008) 309–317
Fig. 2. Rejection rates of the likelihood ratio test for different combinations of z, c, and . X axis is sampling rate, and Y axis is rejection rate (Note: ♦: = 0; : = 0.1; : = 0.2; : = 0.3; 夽: = 0.4; : = 0.5).
the rejection probability should be the constant for any matrix W, correctly measured or not. This is confirmed by the results, where it can be seen in addition that the rejection frequency is consistently below 10%. This indicates that at worse, there is only a 10% chance that a social influence process will be falsely identified regardless of the extent to which relevant ties are omitted from the matrix (at most 50% of all relevant ties in the experiments). In terms of the quality of the estimators, omission of relevant ties does not have any effect on the MSE of the estimates. When = 0.1 the power of the test drops as the mean degree z of the underlying matrix increases, and also as the proportion of missing ties increases. Generally, at this level of autocorrelation, the test never achieves rejection rates greater than 90%, even for low z. At best, rejection rates are about 70% and 50% when z = 5.5 and 7.5, respectively, and drop consistently with increasingly sparse matrices that omit more relevant ties. The effect of c is relatively limited, but displays more variability when the mean degree of the true matrix is high at 7.5. Still, the MSE error of the estimates remains flat. When = 0.2 the power of the test improves, and the rejection rates tend to remain above the 90% level even for omission rates around 65%, with the exception being those cases where z is high and/or c is low. The MSE tends to remain small, and there are no noticeable differences associated with various levels of z and c in the underlying matrix. When = 0.3 the power of the test, as above, remains high, with some slight gains since now the rejection rates are greater than 90% for all levels of missing ties with a small number of exceptions for rates of 50% in situations where c is low. MSE effects are now clearly seen, and tend to increase as sampling rates decrease. At this level, these effects are still largely insensitive to the topology of the underlying network, and the only effect appears to be a small decrease in bias as the mean degree z goes up. When = 0.4 the power of the test is largely the same as for the case of = 0.3. MSE becomes fairly large for all coefficients, and in particular for ˇ1 , increasingly so as the proportion of omitted relevant ties increases. The effect becomes more complex with changes in z of the underlying matrix. With respect to the effect of increas-
ing clustering c, MSE appears in general to decline at all omission levels. There is some initial evidence that the relationships between sampling and the topology of the network are not linear. When = 0.5 the power of the test remains high and drops below 90% only for proportion of omitted ties below 60% when z is moderate (∼3.5) and/or c is low (0.2–0.4). The effect on MSE is quite dramatic now. The non-linear effects first evinced in the case of = 0.4 are clear now. In general the effect of sampling on MSE is to increase when z = 3.5 compared to z = 1.5, but then to decrease for z = 5.5 and then again for z = 7.5, especially for lower levels of measurement error. The effect of increasing c on the other hand is generally to decrease the MSE, although this effect is not linear. 5.2. Experiment 2: inclusion of irrelevant ties (Figs. 3 and 4) As before, the case when = 0 represents a situation where the effect of social influence is absent. The probability of committing a Type I error and falsely inferring a social process that does not exist is consistently below 10% for all degrees of inclusion of irrelevant ties and combinations of z and c of the underlying matrix. The quality of the estimators as measured by the MSE is good, although, and unlike the case of a matrix that fails to include all relevant ties, there is some perceptible (if marginal) increase of the MSE of and ˇ1 with increasing levels of z. This result supports the suggestion, based on Eq. (11), that some small amount of bias will be introduced by the estimate of . The results on the other hand appear to be insensitive to changes in c, which does not seem to have an impact on MSE or the power of the test. When = 0.1 and z is low the power of the test is high if the measurement errors are relatively small (i.e. a +1.5 change with respect to the mean degree of the true matrix), but it tends to drop quite rapidly as the errors become larger. This is consistent with findings by Farber et al. (in press) concerning the effect of mean degree on the power of tests, and Smith’s investigation of bias in strongly connected matrices (2007). The effect of c is relatively small, and MSE is also small for all parameters, without any evidence of interactions between z and c.
A. P´ aez et al. / Social Networks 30 (2008) 309–317
315
Fig. 3. Mean Square Error of the estimate of parameter for different combinations of z, c, and . X axis is over-specification degree, and Y axis is MSE (Note: ♦: = 0; : = 0.1; : = 0.2; : = 0.3; 夽: = 0.4; : = 0.5).
When = 0.2 the power of the test is as before, but the loss of power as z increases becomes less dramatic. Some interactions between z and c start to become evident in terms of the power of the test, especially when z is low. The MSE is still small, but the effect of misspecification becomes more noticeable in the case of ˇ1 . When = 0.3 the power of the test achieves rejection rates greater than 90% for all levels of error in the definition of W when the mean degree of the underlying matrix is 5.5 or 7.5. MSE becomes more important and tends to be slightly larger when z is small. When = 0.4 the power of the test is again greater than 90% when z is 5.5 or 7.5, for all levels of inclusion or irrelevant ties.
MSE continues to increase with both increasing mean degree and increase in the number of irrelevant ties, but the latter effect is not linear. Again, there is evidence of some complex interactions between z and c, and in general MSE appears to increase with increasing c. When = 0.5 the power of the test remains high, but as before, even at this relatively high level of autocorrelation, rejection rates tend to drop as the extent of the measurement error goes up, when z = 1.5 and 3.5. When z is 5.5 or 7.5 the rejection rates are greater than 90% for all cases of inclusion of irrelevant ties studied. The quality of the estimates becomes more of a problem as MSE now tends to deteriorate especially when z is low. MSE also tends to
Fig. 4. Rejection rates of the likelihood ratio test for different combinations of z, c, and . X axis is sampling rate, and Y axis is rejection rate (Note: ♦: = 0; : = 0.1; : = 0.2; : = 0.3; 夽: = 0.4; : = 0.5).
316
A. P´ aez et al. / Social Networks 30 (2008) 309–317
increase with increasing correlation coefficient c, although this effect is not linear. The most problematic topology configuration is a combination of moderate z and low c. As z increases, this effect becomes less marked. 6. Discussion It is worthwhile at this point to consider the implications of the results obtained above. The experiments are based on adding and deleting ties from underlying (true) networks with different topological characteristics, while controlling for different levels of network autocorrelation, or in other words, different levels of strength of the social influence effect. Network topology in the experiments was a function of two control parameters. The first of these was the mean degree of the distribution z, simply the average number of ties per node in the network. The higher the mean degree is the more ties each node is likely to have, while low degree values are associated with more sparse weight matrices. Mean degree provides a simple measure of network connectedness. The second parameter was clustering coefficient c, a measure of transitivity defined as the number of transitive triads formed in the network as a proportion of the theoretically maximum number of possible transitive triads. This transitivity property is frequently explained as the probability that “the friend of my friend is also my friend”. Keeping the interpretations of these parameters in mind, some intuitions can be derived. First, it is worthwhile noting that estimation of the social influence effect suffers when the underlying coefficient is high and z and/or c are low in situations where relevant ties are omitted. In concrete terms, this implies a network in which actors have relatively few ties to others (low z) and that these ties are not contained in local pockets of the network (low transitivity as given by c). Since the data generation mechanism is one in which social influence plays a role (indicating a degree of similarity between connected actors in the network), this would imply that the effect of social influence on the network occurs through/in only a limited number of ties. Intuitively, omission of those ties can have a disproportionate effect in estimation, and this effect would be exacerbated as the level of the underlying social influence effect becomes stronger. The situation is somewhat different when irrelevant ties are included, in which case estimation suffers when z is at the lowest level (1.5), and the clustering coefficient c increases. This would tend to indicate a network composed of small pockets of connected actors with relatively few connections between pockets. In this situation, if a researcher mistakenly adds ties to the network (i.e. ties between actors that are in fact less likely to be similar), the model that follows is one in which the overall relatively low similarity of actors in the network is explained through too large a number of ties, thus diluting the social influence estimate. 7. Summary and concluding remarks Interest in social contact, interaction, and influence, appears to be on the rise in a number of disciplines. As Bothner (2003) notes within the context of mathematical sociology, with the field of network analysis moving to a mature stage it becomes increasingly critical to determine when social ties matter most. Similar opinions have been voiced as well in recent works in other fields such ´ and as regional science (Anselin, 2003) and travel behaviour (Paez Scott, 2007). More generally, as the ideas emanating from social network analysis diffuse to other disciplines, and the number of empirical studies is anticipated to increase, it seems important to improve our understanding of the behaviour of models used to operationalize the notion of social influence. The objective of this paper has been to explore some so far overlooked issues that arise
when the matrix used to represent social structure in empirical models of social influence contains measurement errors or is not properly coded. This paper presents some new results that cast the problem of measurement error in the mould of omission of relevant or inclusion of irrelevant network ties. These results indicate that these two types of errors are problematic in terms of estimate bias. The analytical results are suggestive but limited given the impossibility to derive closed expressions for the estimator of autocorrelation parameter . A numerical experiment was undertaken to further explore the implications of incorrectly measuring ties in a network by omitting relevant ties or including irrelevant ties. The results of this experiment reveal a number of interesting points. First, the numerical results provide evidence of the complex interactions between network topology and measurement error of the matrix. The main control of the power of a test and the quality of the estimates appears to be the level of autocorrelation, followed by the mean degree of the true matrix, and its clustering coefficient. The results confirm previous findings that the power of the likelihood ratio test tends to decrease with increasing z (Farber et al., in press). In addition, this result is shown to extend to the case of matrix a matrix that includes irrelevant ties even at higher levels of autocorrelation, when the power of the test is in general much higher. The results also suggest that the quality of the estimators suffers when the matrix erroneously includes superfluous ties, even when there is no underlying social influence effect. Fortunately, the probability of falsely identifying a substantive process in this case is relatively low, and failure to reject the null hypothesis would indicate the efficiency of estimating a model without matrix W. The problems associated with this type of measurement error become more serious in those situations where is high while z and/or c are low. The interactions between these two topology attributes are also complex, as indicated by the numerical results. In terms of matrices that omit relevant ties, the interactions between topology and sampling levels are relatively more straightforward, and indeed, for lower levels of autocorrelation the results appear to be insensitive to matrix topology. Estimator quality is high when is small or zero, and since the probability of wrongly inferring a process is consistently below 10%, the test has high discriminating power. A more ambiguous situation emerges when autocorrelation is low but not zero, since in this case the test may fail to identify a legitimate process, even if the estimates are of good quality. At higher levels of correlation the process is more easily identified, but the quality of the estimates suffers with more stringent levels of sampling. While not directly comparable to the case of inclusion of irrelevant ties in this paper, it is worth noting that the problem of lower quality estimates appears to be more serious (in terms of the magnitude of MSE), and also more persistent for a wider range of values of z, c, and when the matrix omits relevant ties. On the other hand, the problem of model identification using the likelihood ratio test appears to be more persistent when irrelevant ties are included. As a final remark, while there are several situations that lead to a steady deterioration of the quality of the estimates, the relative effect is more marked for the intercept ˇ1 than for the other two parameters, including autocorrelation parameter . It seems reasonable to conclude that whenever the variables are uncorrelated, the intercept will tend to absorb most of the damage produced by measurement error. This in turn would suggest the use of uncorrelated variables in empirical applications. The effect of colinearity is indicated as a topic to further extend the line of research initiated in this paper. In addition, a number of variations of the autoregressive model investigated in this paper have been developed over the years to address error autocorrelation issues, spatial
A. P´ aez et al. / Social Networks 30 (2008) 309–317
heterogeneity, and other situations in which two or more of these effects may be simultaneously present (q.v. Anselin, 1988; Griffith, 1988). A direction for further research therefore could be to assess more complex model specifications, and the potential effects in model estimation and identification of using erroneously measured matrices W. Acknowledgments The authors are grateful to three anonymous reviewers and the editor for their valuable suggestions. The authors alone are responsible for any remaining errors or omissions. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.socnet.2008.05.001. References Anselin, L., 2003. Spatial externalities, spatial multipliers, and spatial econometrics. International Regional Science Review 26, 153–166. Anselin, L., 1988. Spatial Econometrics: Methods and Models. Kluwer, Dordrecht. Barabasi, A.L., Albert, R., 1999. Emergence of scaling in random networks. Science 286, 509–512. Boots, B.N., 1984. Evaluating principal eigenvalues as measures of network structure. Geographical Analysis 16, 270–275. Bothner, M.S., 2003. Competition and social influence: the diffusion of the sixthgeneration processor in the global computer industry. American Journal of Sociology 108, 1175–1210. Cliff, A.D., Ord, J.K., 1981. Spatial Processes: Models and Applications. Pion, London. Cressie, N.A.C., 1993. Statistics for Spatial Data. John Wiley & Sons, New York. Doreian, P., 1980. Linear-models with spatially distributed data—spatial disturbances or spatial effects. Sociological Methods & Research 9, 29–60. Doreian, P., Teuter, K., Wang, C.H., 1984. Network auto-correlation models - some Monte-Carlo results. Sociological Methods & Research 13, 155–200. Dow, M.M., Burton, M.L., White, D.R., 1982. Network auto-correlation—a simulation study of a foundational problem in regression and survey-research. Social Networks 4, 169–200. Dow, M.M., Burton, M.L., White, D.R., Reitz, K.P., 1984. Galton problem as network auto-correlation. American Ethnologist 11, 754–770.
317
´ Farber, S., Paez, A., Volz, E. Topology and dependency tests in spatial and network autoregressive models. Geographical Analysis, in press. Florax, R.J.G.M., Rey, S., 1995. The impact of misspecified spatial structure in linear regression models. In: Anselin, L., Florax, R.J.G.M. (Eds.), New Directions in Spatial Econometrics. Springer-Verlag, Berlin, pp. 111–135. Greenbaum, R.T., 2002. A spatial study of teachers’ salaries in Pennsylvania school districts. Journal of Labor Research 23, 69–86. Griffith, D.A., 1988. Advanced Spatial Statistics: Special Topics in the Exploration of Quantitative Spatial Data Series. Kluwer, Dordrecht. Griffith, D.A., 1996. Some guidelines for specifying the geographic weights matrix contained in spatial statistical models. In: Arlinghaus, S.L., Griffith, D.A., Drake, W.D., Nystuen, J.D. (Eds.), Practical Handbook of Spatial Statistics. CRC Press, Boca Raton, FL, pp. 82–148. Griffith, D.A., 2004. Extreme eigenfunctions of adjacency matrices for planar graphs employed in spatial analyses. Linear Algebra and its Applications 388, 201–219. Griffith, D.A., Lagona, F., 1998. On the quality of likelihood-based estimators in spatial autoregressive models when the data dependence structure is misspecified. Journal of Statistical Planning and Inference 69, 153–174. Haining, R., 1990. Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press, Cambridge. Hummon, N.P., Doreian, P., 1990. Computational methods for social network analysis. Social Networks 12, 273–288. Kanski, K.J., 1963. Structure of Transportation Networks: Relationships Between Network Geometry and Regional Characteristics. University of Chicago Press, Chicago, IL. Leenders, R.T.A.J., 2002. Modeling social influence through network autocorrelation: constructing the weight matrix. Social Networks 24, 21–47. Marsden, P.V., Friedkin, N.E., 1994. Network studies of social influence. In: Wasserman, S., Galaskiewicz, J. (Eds.), Advances in Social Network Analysis: Research in the Social and Behavioral Sciences. SAGE, Thousand Oaks, pp. 3–25. McVeigh, R., Myers, D.J., Sikkink, D., 2004. Corn, Klansmen, and Coolidge: structure and framing in social movements. Social Forces 83, 653–690. Mizruchi, M.S., Neuman, E.J., 2008. The effect of density on the level of bias in the network autocorrelation model. Social Networks 30, 190–200. ´ Paez, A., Scott, D.M., 2007. Social influence on travel behavior: a simulation example of the decision to telecommute. Environment and Planning A 39, 647–665. Smith, T.E., 2007. Biassedness in spatial models with strongly connected matrices. In: Paper presented at the 54th North American Meetings of the Regional Science Association International Savannah, GA, October. Stetzer, F., 1982. Specifying weights in spatial forecasting models—the results of some experiments. Environment and Planning A 14, 571–584. Volz, E., 2004. Random networks with tunable degree distribution and clustering. Physical Review e 70, 5056115–5056121. Watts, D.J., Strogatz, S.H., 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442. Worrall, J.L., 2004. The effect of three-strikes legislation on serious crime in California. Journal of Criminal Justice 32, 283–296.