Networks and geography: Modelling community network structures as the outcome of both spatial and network processes

Networks and geography: Modelling community network structures as the outcome of both spatial and network processes

Social Networks 34 (2012) 6–17 Contents lists available at ScienceDirect Social Networks journal homepage: www.elsevier.com/locate/socnet Networks ...

840KB Sizes 0 Downloads 48 Views

Social Networks 34 (2012) 6–17

Contents lists available at ScienceDirect

Social Networks journal homepage: www.elsevier.com/locate/socnet

Networks and geography: Modelling community network structures as the outcome of both spatial and network processes Galina Daraganova a,∗ , Pip Pattison a , Johan Koskinen b , Bill Mitchell c , Anthea Bill c , Martin Watts c , Scott Baum d a

School of Behavioural Science, University of Melbourne, Australia Nuffield College, Oxford University, United Kingdom The University of Newcastle, Australia d Griffith University, Australia b c

a r t i c l e

i n f o

Keywords: Exponential random graph models Endogenous network processes Clustering Distance Spatial processes

a b s t r a c t This paper focuses on how to extend the exponential random graph models to take into account the geographical embeddedness of individuals in modelling social networks. We develop a hierarchical set of nested models for spatially embedded social networks, in which, following Butts (2002), an interaction function between tie probability and Euclidean distance between nodes is introduced. The models are illustrated by an empirical example from a study of the role of social networks in understanding spatial clustering in unemployment in Australia. The analysis suggests that a spatial effect cannot solely explain the emergence of organised network structure and it is necessary to include both spatial and endogenous network effects in the model. © 2010 Elsevier B.V. All rights reserved.

1. Introduction It is not a new idea that physical distance limits people’s capacity to form and maintain relationships. Empirical research on the associations between social and geographic space has occurred in disconnected scientific communities, including human geography, tourism and regional science since at least the 1930s (see summary in Butts, 2011; Carrothers, 1956; Mok et al., 2007). Physical propinquity effects have been demonstrated to occur for different types of relationship and at multiple levels of analysis (Festinger et al., 1950; Merton, 1948; Caplow and Forman, 1950; Blake et al., 1956; Whyte, 1957; Sommer, 1969; Wellman, 1996; Mok et al., 2007; Faust et al., 1999; Axhausen, 2006; Larsen et al., 2006). Furthermore, this relationship appears remarkably robust to advances in technology (internet, phones), transportation (highways, freeways, etc.), and cultural differences (Latane et al., 1995; Carley and Wendt, 1991). Over the last 50 years researchers have consistently argued that human relationships are predominantly “local” and that the probability of a tie diminishes as the distance between actors increases, often following either a power law or an exponential decay function (Brown and Moore, 1970; Freeman and Sunshine, 1976; Irwin and Hughes, 1992; Morrill, 1963; Kleinberg, 2000; Wong et al., 2005;

∗ Corresponding author. Tel.: +613 92147820. E-mail addresses: [email protected], [email protected] (G. Daraganova). 0378-8733/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.socnet.2010.12.001

Butts, 2002; Butts and Carley, 2000; Butts et al., 2007). The most recent study (Preciado et al., in press) provides empirical evidence that the log odds of a friendship tie between adolescents decreases smoothly as the logarithm of their distance increases. They have also shown that the strength of distance dependence is negatively related with age and shared meeting places. A number of empirical studies of the features of large-scale, spatially embedded human networks have also been conducted (Bernard et al., 1988; Butts, 2002; Butts and Carley, 2000; Dodds et al., 2003; Killworth and Bernard, 1978; Korte and Milgram, 1970; Liben-Nowell et al., 2005; Milgram, 1967; Travers and Milgram, 1969). While it has been borne out by the past research that the geographical arrangements of individuals have powerful structuring effects on social relationships and social interactions, there have been relatively few attempts to build explicitly spatial models of social networks and to use these models to understand the way in which social networks are embedded geographically. Models proposed by Butts (2002) and Wong et al. (2005) are notable exceptions and aim to understand how different geographical arrangements of connected actors give rise to a particular social network structure, and so, for example, whether geographical proximity between individuals can explain some of commonly observed properties of social networks, such as clustering, skewed degree distributions, and short average geodesic distances. Butts (2002) utilised a family of spatial non-directed inhomogeneous Bernoulli graphs to study the empirical relationship between geographical distance and network tie probability. He proposed a

G. Daraganova et al. / Social Networks 34 (2012) 6–17

model that assumes that ties between individuals are independent of one another conditional on an observed distance structure (Butts, 2002) Pr(X = x|D = d, ϕd ) =



B(xij |ϕ(dij )),

(1)

i,j

where X is an adjacency matrix of network tie variables with xij = 1, if there is a tie between i and j, xij = 0, otherwise; D is a distance matrix with dij the geographical distance between i and j; ϕd = ϕ(d) is a distance interaction function, which is defined as a function mapping distances defined on (0,∞) onto tie probabilities in [0,1]; x and d refer to realisations of X and D, respectively; and B is the Bernoulli probability mass function given by



B(x, ϕd ) =

pd 1 − pd

if x = 1, if x = 0,

(2)

Here the substantial steps are to identify a choice of distance measure, D, as a function of locations in a physical space, and to specify the distance interaction function, ϕd , that relates distance to tie probability. Given a distance matrix D and a distance interaction function ϕd , the spatial Bernoulli graph can be constructed as the outcome of series of independent Bernoulli trials where the probability of each edge between actors i and j is determined probabilistically by the distance between them and the distance interaction function. Butts (2002) argued that this model captures some of the commonly observed structural characteristics of the network, such as a high degree of transitivity and the formation of locally dense clusters. Butts emphasised that spatial locations can be considered not only in terms of physical locations of actors (for example, residential location) but also in terms of social positions, for example, positions in Blau space, where social locations of individuals are represented in a socio-demographic coordinate system, however, the focus of this paper is particularly on physical locations of individuals represented by “physical distance” or “geographical proximity”. Wong et al. (2005) also developed a model for networks in which the edge probability between any two nodes is considered to be dependent on the spatial distance between those nodes and demonstrated that this model is also able to capture many commonly observed properties of social networks. They represented social networks through an extension of non-directed Erdös-Rényi random graphs by proposing a step-function relationship between edge probability and spatial distance, i.e.:



Pr(xij |dij ) =

p + pd , p − ˛,

if dij ≤ R, if dij > R,

(3)

where xij is a network tie; dij is the Euclidian distance between actors i and j; p is the density (i.e. average probability) of the network; pb is the proximity bias, which specifies the sensitivity to geographical proximity; R is the neighbourhood radius within which the proximity bias applies; and ˛ is a correction term, that is required to ensure that the average density remains the same, given a distance matrix, d, for all possible R and pb . It is assumed that the Xij are identical and independent distributed Bernoulli random variables conditional on distance being ≤R or >R, and hence that proximity bias, pb , is the same for all actors. These models demonstrate how the potential importance of geographical proximity to tie formation processes can be parameterised in models for social networks. The main advantage of these models is that they enable us to explore the parametric function relating social interactions to physical distance and can demonstrate simple regularities in a social network that may be associated with spatial proximity of actors. The models propose different functions. In Butts’ model, tie probabilities vary according to distance via

7

a continuous spatial interaction function, whereas in Wong et al.’s model, there is a simple threshold, or step-function, relating tie probability and distance, with the proximity bias in operation only when distance is below the threshold. In spite of this clear difference between these two models, they are similar in one important respect. These models do not incorporate complex dependences between edges, and explain emergence of social network structure solely in terms of spatial proximities among actors. Yet it may be unrealistic to assume that network ties are conditionally independent entities given spatial proximity and whether there is a network tie between two actors, Ben and Sarah may depend not only on the geographical proximity of Ben and Sarah, but also on whether they have network partners in common. While Butts (2002) and Wong et al. (2005) primarily focus on the cases in which edges are independent given the distance, they both recognize that endogenous social processes may account for some aspects of emergent social structure, so that spatial proximity may be both a by-product as well as a determinant of social structure. They also indicate the necessity of constructing a nested family of exponential random graph models that can be used to evaluate the empirical evidence for spatial and network dependences and demonstrate that both models can be easily translated into the more general exponential family framework with some minor modification. It is worth noting that (to our knowledge) there have been no applications to date of this promising approach. The first objective of this paper is therefore to formulate models that allow simultaneous estimation of potential spatial and network effects involved in tie formation. We do so by building empirically testable models for network structure that accommodate geographical proximity as well as endogenous network processes in explanations of network structure. In the process of model construction, we utilise exponential random graph models as well as Butts’ (2011) framework for specifying the distance interaction function, a function that describes the relationship between tie probability and the distance between actors. Our second objective is to apply these models to network data gathered using a snowball sampling methodology in a suburban community within Melbourne, Australia, and hence to assess, at the level of a large community, the potentially distinctive roles that spatial proximity and network processes may play in shaping social network structure.

2. Methodology 2.1. Exponential random graph models As noted earlier, a random graph or network is represented by a binary matrix X = [Xij ] of network tie variables on a node set N. Each possible edge or tie in the network is regarded as a random variable, with xij = 1 if there is an edge from node i to node j, and xij = 0 otherwise. Here, we regard the node set as fixed and possible ties as nondirected (Xij = Xji ) and disallow self-ties of the form Xii . The matrix of all network variables is denoted by X, while x = [xij ] refers to a realisation of X. Exponential random graph models (ERGM) constitute a general class of stochastic models that has been developed to model networks with fixed number of nodes (Frank and Strauss, 1986; Wasserman and Pattison, 1996). In the formulation of exponential random graph models, a tie is considered the unit of analysis rather than an individual. The distinctive feature of ERGMs is that the assumption of independence among network tie variables may be relaxed, allowing different assumptions about the dependencies among network variables to be incorporated. As Frank and Strauss demonstrated, the dependence assumptions determine the

8

G. Daraganova et al. / Social Networks 34 (2012) 6–17

sufficient network statistics for a model and, in combination with a homogeneity assumption, yield a model in which the sufficient statistics are counts of particular network subgraphs or configurations. For instance, the Markov dependence assumption gives rise to a model whose statistics are the number of edges, the numbers of stars of different sizes, and the number of triangles. More specifically, Frank and Strauss (1986) showed that some fundamental theorems for interdependent observations developed in spatial statistics (Besag, 1974a,b) could be applied to arbitrary dependence structures, and hence to structures with interdependent network tie variables. They applied these results to a network array X to obtain a general expression for a probability model Pr(X = x) from a specification of which pairs of relational variables are conditionally independent, given the values of all other relational variables. This approach yields the model: Pr(X = x)

1 exp(˙C C zC (x)), k

(4)

where the summation is over all possible subgraph configurations C; C is a parameter corresponding to the configuration C and is nonzero only if all pairs of network variables in C are assumed to be conditionally dependent; zC (x) = (i,j)∈C xij is the network statistic corresponding to the configuration C (indicating whether all ties in C are observed in x); and  is a normalizing quantity which ensures that (4) is a proper distribution. Assuming that isomorphic configurations have equal parameters (the homogeneity assumption), then this model expresses the probability of a network as a function of the frequency of occurrence of various types of configurations C (edges, k-stars, triangles and so on) and, hence, allows us to examine a variety of structural effects. In the case when a parameter corresponding to a type of configuration is large and positive, networks with many such configurations are more probable; conversely, in the case of a large negative parameter, such networks are less probable (Robins et al., 2005). Three types of dependencies that may influence network topologies can be identified in the literature on modelling networks. The first one is Bernoulli dependence, where all tie variables are assumed to be independent; this assumption leads to the wellknown Erdös-Rényi model (Erdos and Renyi, 1959) in which only edge configurations have non-zero parameters. The second one is Markovian dependence (Frank and Strauss, 1986) which assumes that pairs of tie variables are independent, conditional on the remaining tie variables, unless they share a node. Configurations with non-zero parameters in this case are edge, star and triangle configurations. The third is a realisation dependence assumption (Pattison and Robins, 2002; Snijders et al., 2006) in which the dependence between a pair of tie variables may depend on the realised value of other tie variables. Versions of this dependence assumption in combination with Markov dependence assumption give rise to models in which alternating triangles and alternating 2-path configurations are involved; these are reviewed below. It is worth emphasising that different dependence assumptions yield different model specifications. Robins et al. (2001) and Pattison and Robins (2002) showed that sometimes exogenous factors may interact with potential dependencies among network tie variables, developing a general formulation of ERGMs with node-level covariates. This more general model may also be elaborated to include dyadic covariates, and hence to allow potential interactions between network configurations and spatial proximities. In particular, the presence of a network tie between any two nodes may be conditionally dependent on the distance between the spatial locations of the corresponding nodes. Further, interactions among network tie variables may also depend on spatial proximities among nodes, giving rise to more complex configurations, denoted here by I, comprising

both network and proximity variables. This assumption in combination with the Markov dependence assumption gives rise to configurations of star, two-path and triangle types with additional proximity variables. While these more complex configurations are of potential interest, they are not considered in this paper. Let distances be represented by a matrix of continuous distance variables D = [Dij ] of all possible physical distances among pairs of individuals in the population with realisation d = [dij ]. Then the general form of ERGMs for continuous spatial dyadic variables is as follows: 1 Pr(X = x) = exp 

 

C zC (x) +

C





ıI zI (x, d, ϕ)

,

(5)

I

where the second summation is over all configurations I; ıI is a parameter corresponding to the configuration I; zI (x, d, interaction  ϕ) = (i,j)∈I xij ϕ(dij ) is the statistic corresponding to the configuration I; other terms are as in (4). Distance between individuals can be calculated in a number of ways, and different distance measures may be useful in different contexts. In the applications considered here, Euclidian distance is used since the modelling focuses on distances at a regional scale in a two-dimensional physical space. As the distance between any nodes i and j is a continuous measure, the model includes also the distance interaction function ϕ(dij ) (Besag, 1974a,b; Robins et al., 2001). The choice of the interaction function is of importance and explained below. 2.2. Distance interaction function The distance interaction function plays a key role in formulating an exponential random graph model as it determines the spatial properties of the network. Butts (2011) defined four key characteristics of the distance interaction function that should be parameterised when proposing a function, i.e. monotonicity, baseline probability at origin, curvature near the origin, and tail weight. Butts (2011) showed that while a monotonic function adequately describes spatial behaviour in most cases, there are situations when non-monotonicity may occur. For example, criminal behaviour can involve people who live in nearby areas but not in the same local area. Baseline probability at origin refers to the tie probability when the distance is zero. If it might be assumed that the tie probability tends to 1 as the distance between nodes tends to 0, whether a tie probability actually equals 1 primarily depends on the type of relationship under investigation, since interaction does not always take place when two parties are sufficiently close to one another. The form of curvature refers to the impact of distance, when distances are small, and in particular to how quickly the distance effect diminishes (Butts, 2011). A local positive curvature of the distance interaction function indicates that the impact of distance diminishes rapidly as the distance increases. By contrast, a local negative curvature indicates that the impact of distance diminishes more slowly at short distances than at slightly longer distances. The tail weight governs the number of edges at large distances. Although a tie probability at large distances is approximately zero “. . . even differences in long-distance tie probability which appear small to the unaided eye can have an easily discernible effect on the number of realised long-ranged ties” (Butts, 2011, p. 34). Although different functional forms can be posited for the distance interaction function, attention is restricted here to two possible families: (i) an inverse power law and (ii) exponential decay. The choice is determined by the apparent similarity in their general model behaviour but with differences in tail weight. Also, these functions have been widely used in several empirical studies (Latané et al., 1992; Kleinberg, 2000; Butts, 2002) to model the probability of interaction as a function of the distance between indi-

G. Daraganova et al. / Social Networks 34 (2012) 6–17

viduals. Below, we utilise Butts’ (2011) framework for specifying the distance interaction function. Butts (2011) defines two forms of the inverse power law of the distance interaction function: (i) the standard power law denoted PL and (ii) the attenuated power law denoted AP. The PL model is defined as follows (Butts, 2011): ϕ(dij ) =

pb (1 + ˛dij )



,

(6)

where pb is a baseline probability,  is a parameter controlling the effect of distance (i.e. tail weight), and ˛ is a scaling parameter controlling for the growth of d, “the actual range over which distance effects are realised” (Butts, 2011, p. 36). The AP is defined as follows (Butts, 2011): ϕ(dij ) =

pb

,

1 + ˛dij

(7)

where each parameter in Eq. (7) has the same interpretation as in equation above. The difference between these two forms is the local curvature. The PL has a positive local curvature, while the AP has a negative local curvature. The distance interaction function associated with the exponential decay law (EDL) model is defined as follows: ϕ(dij ) =

pb , e˛dij

(8)

where the scaling factor (˛) and the baseline tie probability (pb ) are independently parameterised. The second is a logistic probability law denoted LPL. LPL is an attenuated analogue to the exponential decay law and is recognizable as being a one-sided logit transform of the (additively inverted) distance (Butts, 2011): ϕ(dij ) =

(1 + ˇ)pb , 1 + ˇe˛dij

(9)

where the scaling factor (˛) and the baseline tie probability (pb ) have the same interpretation as in the above equation and ˇ controls the shape of local curvature. The EDL and LPL also differ in local curvature, i.e. positive and negative, respectively. As mentioned above, the main distinction of the exponential decay law from the power law is in tie probabilities at large distances, i.e. the form of the tail. That is, for exponential decay functions the tie probability tends to 0 when distance increases more quickly than for the inverse power law functions. Hence, a tie probability at large distances equals virtually zero beyond some given point in an exponential decay function, while in power law functions a tie probability at large distances is non-negligible. A challenge in the present context is to write these functional forms in exponential family form. While it is quite simple to convert EDL and LPL under the exponential random graph models framework, there are no equivalent forms in the ERGM for PL and AP. The explicit transformation of these forms into the ERGM framework leads to a curved exponential family graph model and would require estimation of non-linear parameters. Specifically, Butts (2011) showed that a general case for power law dependencies can be expressed as follows: dij = logit(ϕ(dij )).

(10)

To overcome the issue of using a curved exponential graph model, we consider the simple case that a tie probability between nodes i and j is conditionally independent of other ties in the graph given the distance between i and j. The probability of the network x can then be written as follows: Pr(X = x) =

1 exp(xij + ϕ xij ϕ(dij )). 

(11)

9

The probability that there is a tie between i and j is: Pr(Xij = 1|Dij = d) =

e+ϕ ϕ(d) 1 . = +1 1 + e−(+ϕ ϕ(d))

e+ϕ ϕ(d)

(12)

One approach to expressing the distance interaction function in an ERGM-like form is then to make the following substitutions:  = − log ˛ ϕ = − ϕ(d) = log d The model can then be expressed in the form: Pr(Xij = 1|Dij = d) =

1 1 + exp{−(− log ˛ −  log d)} =

1 1 + ˛e log d

=

1 . 1 + ˛d

(13)

The model now looks like an attenuated power law function with baseline probability equal to one and with the theta parameter referring to a scaling factor in a simple Bernoulli case. Below we will refer to this function as the LOG function. While it is always possible that the inclusion of more complex dependence assumptions may lead to changes in the form of the relationship between tie probability and distance, for the explanatory purpose of the paper this case is not considered1 . While past researchers have developed models for spatially embedded networks the model proposed here is rather different. Hoff et al. (2002) and Handcock et al. (2007) developed a model for the probability of a tie as a function of unobserved positions of actors in a d-dimensional Euclidean latent social space. In the current model, the probability of a tie between any two individuals is likewise assumed to depend on the geographical distance between actors in a two-dimensional Euclidean geographical space but in this case the distances are observed. Hoff et al. (2002) and Handcock et al. (2007) were mainly focused on how unobserved dyadic proximities might explain some fraction of observed triangulation in social networks, and tended to view this as in some sense a competing mechanism for triangle formulation. A second difference to the model proposed here is the explicit incorporation of endogenous network effects. The modelling approach utilised here is also very similar to Butts’s spatial models for social networks (Butts, 2000, 2003, 2011) but is not limited to Bernoulli inhomogeneous graphs. Hoff et al. (2002), Handcock et al. (2007) and Butts (2011) developed important modelling frameworks to which endogenous network effects might be added, but there has been practically no empirical investigation of simultaneous spatial and network effects. The model (2) just described therefore provides a valuable framework for such empirical investigations. Using this approach, we address the following question in the empirical study presented below: • What role might distance play in network tie formation? In particular, what is the functional form of the relationship between distance and tie probability? • Does taking account of spatial effects reduce the apparent effect of endogenous network processes?

1 For example, this would be the case if the dependence of tie probability on distance was itself dependent on the presence of neighbouring network ties, that is, if there were more complex interactions between network and distance variables than we assume here.

10

G. Daraganova et al. / Social Networks 34 (2012) 6–17

3. Empirical example 3.1. Data The data used in this example is derived from a survey designed to assess the role of networks and geographical proximity in explaining the distribution of unemployment in a suburban Australian region (see Daraganova, 2008, for details). The survey was a quantitative study using an interviewer-administered survey where participants were recruited via a 2-wave snowball sampling scheme (e.g., Frank, 2005; Frank and Snijders, 1994; Goodman, 1961; Handcock and Gile, 2010). Specifically, a stratified random sample of individuals was drawn from the population of interest. This sample is termed the initial sample. Individuals of the initial sample were asked to name their network partners. Named individuals could be individuals from the initial sample or not. Some of those named individuals might be named by several individuals from the initial sample. Those who were mentioned by at least one individual in the initial sample and who were not in the initial sample comprised the first wave of the snowball sample. Individuals of the first wave were then asked to name their network partners. Those who were not members of the initial sample or the first wave comprised the second wave of the snowball sample. Participants of the second wave were also asked about their network partners but the later ones were not followed. Ideally, the snowball sampling design assumes that a respondent nominates all his/her network partners and a researcher follows all of these nominees. In practice, it is difficult for respondents to name and for the researcher to recruit or follow all network partners, so participants were asked to recruit up to 4 network partners out of the list of nominees generated from two name generator questions. Given that the respondent dataset and the nominee dataset are not mutually exclusive, i.e. wave 1 and wave 2 respondents appear as both nominees and respondents, and given that the reference geographical area was approximately 14 km in diameter, it was possible that different respondents could nominate the same people. To identify unique individuals across the sample of respondents and nominees a probabilistic matching process based on first name, initial of surname, residential street, suburb, gender and age was used. As a result 551 unique individuals, comprising 306 respondents and 245 non-interviewed participants, were identified. Out of 306 respondents, 58 were respondents of initial wave, 81 were wave one respondents, and 167 were wave two respondents and 245 non-interviewed participants were network partners of second wave respondents (see Daraganova, 2008, for details). Of the 551 individuals in the sample, 279 were female. Respondents’ age ranged between 15 and 70 years old and not-interviewed participants’ age ranged between 13 and 78 years old. The following key variables were measured for all respondents in the sample: network ties and residential location. The network ties were derived as a binary network from two name generator questions: (1) who are you close to? and/or (2) with whom do you discuss employment matters? Nominating any individual on any or both of these relations led to the presence of a tie in the network; the resulting network was an undirected binary, composite tie network. The argument for combining the responses to these two questions is that both ties are likely to involve discussion of employment matters. Based on the above approach two networks of particular interest were constructed: first, the network of size 551 referred to as the Augmented Network and comprising both respondents and not-interviewed participants; and, second, the network of size 306 comprising respondents only, and referred to as the Respondents Network. On average, the two name generators yielded 2.06 (sd = 1.81) and 0.96 contacts (sd = 1.06) recruited contacts, respectively, and the average number of recruited contacts in all per respondent was 2.69 (sd = 1.97).

Fig. 1. Distribution of street lengths.

Residential locations were derived from street name, street type and locality (suburb) and then geocoded. The absence of data on street numbers represented a problem for address geocoding. To overcome this problem three options were considered: (i) to allocate the address point to the centroid of the Collection District (CD); (ii) to allocate the address point to the centroid of the specified street; or (iii) to allocate the address point randomly in the specified street. The first two options could cause potential ambiguity. First, it was impossible to define a unique CD as streets can pass across more than one CD. Second, different sides of the street could be associated with a different CD. Finally, the main disadvantage of the second method is that the geographical distance between individuals residing on the same street will always be set to 0 even though individuals may not be at the same residential location. The last option was the least ambiguous as actual address points for the street could be constructed, and one could be sampled at random, and assigned to the sample address. This method takes account of the fact that streets numbers are not necessarily consecutive and that the same number can refer to different households (apartments/units/townhouses). For example, for address point “Albatross Court, Keilor” all possible street numbers are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 14, and one item from this list is sampled at random. This approach recognises that 11 is missing and 14 appears twice. This method has the advantage that it assigns a unique identifier to each address point, and, affords the calculation of geographical distance between residential locations of individuals taking into account that distance equals 0 for people who live in the same household. Using latitude and longitude coordinates, Euclidean distance was calculated from one address point to another, ignoring roads and natural barriers for all address points in the sample data (Daraganova, 2008, p. 166). It is worth noting that the length of a street ranged from 31 m to 5200 m with 83.1% of street lengths being no greater than 1000 m, 11.1% being 1000–2000 m, and only 5.8% greater than 2000 m. It can be seen in Fig. 1 that the distribution of street distances is positively skewed with mode 400 m and mean 666.1 m with standard deviation 841.0 m. The error introduced by the uncertainty in address points is therefore small relative to the estimated distances between residential locations (see Table 1 below). 3.2. Descriptive statistics Fig. 2 depicts the respondents’ network and the augmented network where vertices are located by standardised geographic coordinates.

G. Daraganova et al. / Social Networks 34 (2012) 6–17

11

Fig. 2. Spatial layout of respondents’ network. (a) Respondents’ network; (b) augmented network.

Table 1 Geographical data summary for the samples of 306 and 551 individuals.

Mean (km) Maximum (km) people at distance >25 km (%)

306

551

6.877 21.938 0

11.058 297.396 4.9

On examining Fig. 2a, it is very difficult to draw any conclusions about whether ties in the respondent network are more likely at small distances rather than at large distances. One apparent feature of this figure is the empty space where the dashed line is drawn. The dashed line refers to the Western Ring Road, a major highway linking Melbourne’s major arterial highways. On examining Fig. 2b, it can immediately be discerned that the augmented network contains a large number of individuals who are in relatively close proximity to one other (95.1% of individuals at distance less than 22 km) and only a small number of individuals at relatively large distances. Table 1 presents a summary of geographical data for the two samples. There is a large discrepancy in the mean distance and maximum distance between the Respondent and Augmented Networks that, in part, reflects the latter’s wave recruitment strategy (which allowed nominees to reside outside the reference area. Fig. 3 represents a network layout of two networks. The depiction attempts to optimize the readability of the underlying network structure by separating different components and minimising edge crossing. Table 2 represents basic network characteristics for the two networks. It can be immediately noticed that the Respondent and Augmented Networks are comprised 34 and 28 disconnected subgraphs, respectively, which is not unexpected giving the snowball sampling design. It can be seen that most of subgraphs are relatively large in size and highly clustered, the largest components consist of 37 and 74 nodes for the Respondent and the Augmented Networks,

respectively. The overall density is quite low for both networks. A key question to be addressed is whether the observed clustering is best modelled as the outcome of endogenous network processes, geographical proximity or some combination of these processes. 4. Analysis and results The results are presented in two parts. The first part describes the analysis of different forms of the distance interaction function. The second part presents the results of fitting the exponential random graph models with geographical proximity to the empirical data. 4.1. Distance interaction function 4.1.1. Statistical analysis To identify the best functional form for the relationship between tie probability and geographical distance parameters were estimated for each of the functional forms of the distance interaction function described above. The estimation was conducted in the absence of network effects. – Attenuated power law with pb = 1: ϕ(dij ) =

1 

1 + ˛dij

;

– Standard power law: ϕ(dij ) =

pb (1 + ˛dij )



;

– Attenuated power law: Table 2 Network characteristics for respondents’ and augmented network.

Density Global clustering Number of components The largest component The smallest component

Respondent Network

Augmented Network

0.011 0.58 34 37 3

0.009 0.55 28 74 3

ϕ(dij ) =

pb



1 + ˛dij

;

– Exponential decay function ϕ(dij ) =

pb ; e˛dij

12

G. Daraganova et al. / Social Networks 34 (2012) 6–17

Fig. 3. Network layout of respondent and augmented networks. (a) Respondent Network; (b) augmented network.

– Logistic probability law: ϕ(dij ) =

(1 + ˇ)pb . 1 + ˇe˛dij

A reference distribution was the homogeneous Bernoulli model. The simplest way to estimate the parameters and define the best parametric form for ϕ is to fit data using the spatial Bernoulli model via maximum likelihood methods. In the Bernoulli model it is assumed that variables are binomially distributed, hence, the likelihood function L has the following form: L(x|d, ) =



x

ϕ(dij ; ) ij (1 − ϕ(dij ; ))

1−xij

,

(14)

i
where xij corresponds to the observed relational tie between actors i and j, dij corresponds to the observed distance between i and j, and  corresponds to a vector of parameters (˛„, pb ). The log likelihood is: l(x|d, ) =



Fig. 4. Average number of ties (binned) by distance (meters). Inset is a corresponding log–log plot.

xij log(dij ; ) − xij log(1 − ϕ(dij ; ))

+ log(1 − ϕ(dij ; )).

(15)

Given that network ties and distance between individuals are known, any unknown parameters can be estimated by finding the maximum likelihood estimates . It is worth noting that the independence assumption means that missing data may be conceived as being independent draws based on observed data, therefore, the snowball sampling design could be ignored while fitting the Bernoulli model (for details see Handcock and Gile, 2010; Koskinen et al., 2010). After parameters of different functional forms have been estimated, the comparison of different functional forms can be made. Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) were computed for goodness of fit evaluation. For the best fit, lower scores are preferred for both indices. 4.1.2. Results The binned relative frequency of ties against the distance interval is depicted in Fig. 4. It can be seen that as distance increases the frequency of ties decreases, indicating that geographic proximity is indeed associated with an increased probability of a tie. However,

for distances larger than 14,000 m there is no apparent relationship observed between the presence of a tie and distance. Hence, it is difficult to fit these functional forms to the entire data set. From a theoretical perspective (Butts, 2011), this difficulty might be anticipated since it is very plausible that different processes are at work at different distance scales. Given that we are primarily interested in the process of tie formation at small distances and taking into account that for two datasets the average distance among individuals is less than 14,000 m, we fitted data only for observations at distance no greater than 14,000 m. The amount of data lost when truncating the different data sets can put in terms of loss of observations (pairs of individuals), loss of edges, or loss of actors (see Table 3). Specifically, Table 3 Loss of observations, edges, and actors by truncating the data to 14 km distances. Cut off (km)

551

306

14

300

14

300

Observations Edges Actors

127,090 1101 551

151,389 1261 551

44,484 492 306

46,607 507 306

G. Daraganova et al. / Social Networks 34 (2012) 6–17

13

Table 4 Maximum likelihood estimates for distances truncated to the interval [5; 14,000]. Model

df

LOG

2

SPL

3

APL

3

EDL

2

LPL

3

551 306 551 306 551 306 551 306 551 306

pb

˛



ˇ

AIC

BIC

1 1 0.801 (0.146) 0.840 (0.100) 0.918 (0.163) 0.931 (0.156) 0.016 (0.001) 0.026 (0.001) Undefined Undefined

0.041 (0.011) 0.031 (0.002) 0.045 (0.011) 0.023 (0.001) 0.043 (0.001) 0.021 (0.002) 0.0001 (0.00001) 0.0001 (0.00001) Undefined Undefined

0.969 (0.016) 0.995 (0.016) 0.972 (0.016) 0.985 (0.012) 0.975 (0.016) 0.996 (0.014) – – – –

– – – – – – – – Undefined Undefined

10341.63 6124.92 10341.82 6125.11 10341.98 6125.23 12480.42 8645.55 Undefined Undefined

109423.13 6124.92 10342.56 6125.01 10342.89 6125.48 12499.92 8721.51 Undefined Undefined

Note: Undefined means that MLEs could not be obtained.

when the 14,000 m truncation rule is applied the proportion of retained pairs of individuals is 83.44%, and 95.44% for the 551 node dataset, and 306 dataset, respectively; the proportion of retained edges is 87.31%, and 97.04% for the two data sets, respectively; the proportion of retained actors is 100% for all datasets. These figures indicate that the amount of data lost by the truncation rule is not substantial. Table 4 presents the maximum likelihood estimates along with the goodness of fit criteria for various functional forms for all observations at distance no greater than 14,000 m. It can be seen from the table that according to AIC and BIC criteria and also model deviance, the best model is the attenuated power law with baseline probability equal to 1 (LOG), followed closely by the standard power law (SPL) and the general attenuated power law (APL). The exponential decay law model reproduced the observed data much less adequately, while it was not possible to estimate the logistic probability law. A graphic example of how different distance interaction functions replicate the data is presented in Fig. 5. The figure represents the attenuated power law and the exponential decay function for the largest dataset where the distance is truncated to the interval [5; 14,000]. It can be seen that the exponential decay function does a very poor job in predicting ties not only at small distances but also at large distances, while the attenuated power law reproduces the relationship between tie probability and distance very well over all distances. Examination of the results in Table 4 indicates that there are no substantial differences in maximum likelihood estimates for pb and  parameters and for AIC and BIC values for each dataset across the functions LOG, SPL, and APL. The differences are only observed for the scaling factor, which is not unexpected given that the samples differ in size and that the maximum distance between individuals varies across the two datasets. It is remarkable that the estimates are so similar not only for each dataset across the difference functions, but also for each function across the datasets. Of major interest for the current paper is the comparison between the general attenuated power law and the attenuated power law in which the tie probability at the origin is set to 1. It can be seen that for both networks the parameter for the tie probability at origin is approximately 1 in the general attenuated power law case, i.e. it equals 0.92, and 0.93, respectively. The same results are observed for the scaling factor and for gamma. The results strongly support the assumption that a tie probability at distance 0 is 1 is quite reasonable. The results suggest that, first, there is clearly an effect of distance on tie formation. Second, ties are more probable at very short distances and there is a steep decrease in the distance interaction function from a region with relatively large tie probabilities to a rather flat tail. Third, the models indicate that while it can be seen that after a certain point a tie probability varies slightly and tends to be very small, nonetheless some ties remain possible at large distances.

4.2. Spatial vs. network clustering 4.2.1. Statistical analysis To address the question of whether taking spatial effects into account reduces the effects of endogenous network processes, a nested sequence of four exponential random graph models were fitted to each of the Respondent and Augmented Networks. By using a hierarchical set of nested models, it is possible to compare simpler with more complex models and hence assess the contribution

Fig. 5. Examples of how the attenuated power law (a) and the exponential decay model (b) replicate the observed data for the sample of 551 individuals. Black and grey curves represent the observed and expected data, respectively. Horizontal axis corresponds to distances between individuals in meters. Vertical axis corresponds to average number of ties by bin. Insets are corresponding log–log plots.

14

G. Daraganova et al. / Social Networks 34 (2012) 6–17

provided by a more complex dependency structure. One of the specific advantages of using a hierarchical set of nested models is that it allows us to compare the forms of clustering in networks that are modelled as spatial and network processes. The first model specification is a complete independence or homogeneous Bernoulli model, in which it is assumed that all observed ties are independent and occur with equal probability. The model therefore contains a single choice parameter, referring to the edge count, and used as a baseline model. This model is constrained by the Bernoulli dependence assumption. The second model specification, the spatial effects model, assumes that ties are conditionally independent given the distance between them. This model allows us to assess the unique contribution of spatial effects in the explanation of network clustering. Log of distance was used across all estimations. The model contains two effects, an edge effect and a spatial effect. This model by definition is equivalent to the attenuated power law function with the baseline tie probability set equal to 1 (LOG) with the following correspondence:  = − log ˛ ϕ = − ϕ(d) = log d However, the comparison of the estimates of the scaling factor and the power parameter obtained using the exponential random graph models approach with the corresponding ones obtained by fitting the LOG function is not possible. In part, this can be explained by the amount of data fitted. While in the former case all observations are included in the analysis, in the later one only the observations at distance no more that 14 km were considered. The third model, the network effects model, allows us to model network ties as a function of endogenous network processes and assess the unique contribution of network effects in the absence of spatial effects. The network effects model is constrained by a combination of Markov and partial dependence assumptions and consists of an edge effect, an alternating star effect and an alternating triangle effect (Snijders et al., 2006). The alternating star effect refers to a collection of configurations in which one actor has ties to various numbers of others (i.e., a collection of so-called star subgraphs) and the alternating triangle effect refers to a collection of configurations in which a pair of connected actors has various numbers of common partners. These parameters allow for interdependence between ties and also assist with modelling multiple connectivity and triangulation within a network. It should be noted that models including only Markov effects failed to converge. The fourth model, the combined model, is constructed to reflect both network and spatial processes simultaneously while controlling each process for the presence of the other. Table 5 presents a list of all the effects included across the four models. For each network, for all four ERGMs, conditional maximum likelihood estimates were obtained using the procedure for incomplete data (snowball sample data) of Pattison et al. (in preparation). This conditional estimation proceeds in the same way as the simulation-based estimation procedure for completely observed networks proposed by Snijders (2002), with the exception that part of the network is considered fixed in the simulation. The estimation procedure yields, for each effect in the model, a parameter estimate, the standard error of the parameter estimates, a convergence statistic and a t-ratio. The estimate is regarded as having converged if the convergence statistic is less than 0.1 in absolute value. The t-ratio represents a ratio of the parameter estimate and the standard error of the estimator and corresponds to an approximate Wald’s test of whether a parameter is significantly different

Table 5 Effects modelled, their configurations and interpretations. Effects Edge

Alt-star

Alt-triangle

Geographic proximity

Configuration

Statistic



i= / j xij

n−1 k=2

n−1 k=2



k

(−1) (Sk /k−2 )a

k

(−1) (Tk /k−2 )b

i= / j xij log

dij

See Pattison and Robins (2008) and Snijders et al. (2006) for additional information. a Where Sk is the numbers of nodes with degree equals to k,  = 2. b Where Tk is the number of mixed triangles with k shared nodes,  = 2.

from 0. If this ratio is larger than 2 in absolute value, the parameter is considered to be significantly different from 0. In the following, a parameter thus deemed significant is asterisked. The fit of the model to the data is checked by comparing selected graph statistics of the observed graph which were not directly modelled with the corresponding graph statistics in a distribution of graphs simulated from the parameter values obtained in the conditional estimation process (Goodreau, 2007; Hunter et al., 2008; Snijders et al., 2006). A t-ratio, the goodness of fit (GOF) ratio, is computed to compare any observed statistic of interest to the distribution of the statistic in the simulated sample. The statistic, based on the mean and standard deviation of the statistic in the sample, is used as an indicator of good or poor fit. As a heuristic, a GOF ratio of more than two in absolute value suggests that the observed statistic is unlike that of the sample. Local and aggregated local effects have been used for model evaluation. The local effects comprised Markov effects (2-star, 3-stars and triangle) and higher order effects (alternating star, alternating triangle and alternating 2-path statistics). The aggregated local effects included the standard deviation of degree distribution, the skewness of degree distribution, global clustering, the mean of local clustering, and the variance of local clustering2 . All models were fitted and tested using the SPnet software (Wang et al., 2008), available at http://www.sna.unimelb.edu.au/. 4.2.2. Results Table 6 presents the parameter estimates and standard errors for two networks, for all four models. Table 7 contains the heuristic assessment of goodness-of-fit. We first consider the baseline model (model 1) and the spatial effects model (model 2) for the Respondent Network. The large difference in the edge parameter for models 1 and 2 reflects the introduction of the scaling factor in the second model. It can be seen that the age effect is significant in both models suggesting that a tie is more likely between individuals of similar age. The gender effect is not significant. It is worth noting that the inclusion of the geographical effect does not affect the age effect. The GOF ratios obtained from an evaluation of goodness of fit for these models suggest that neither model provides a good fit, with GOF ratios larger than 2 in absolute value for almost all statistics, although the improvement in fit from model 1 to model 2 is clearly appar-

2 Degree distribution – is the probability distribution of nodes’ degrees over the whole network. Global clustering – the number of triangles proportional to the number of 2-path. Local clustering of a node i – the number of triangles connected to a node i proportional to the number of 2-path centred on the node i (Watts and Strogatz, 1996).

G. Daraganova et al. / Social Networks 34 (2012) 6–17

15

Table 6 Parameter estimates and standard errors for different model specifications for the respondents’ network. Respondent Network 1

Augmented Network

2

−4.87* (0.13)

1.56* (0.65)

3

4

1

−4.79* (0.66)

−0.20 (0.87)

−4.83* (0.07)

−0.86* (0.18)

−0.86* (0.2)

2.74* (0.15) Age heterophily Gender homophily

−0.78* (0.08) −0.07* (0.01) −1.13 (0.69)

−0.07* (0.01) −1.13* (0.61)

2.69* (0.14) −0.56* (0.07) 0.002 (0.06) 0.07 (0.47)

0.001 (0.07) 0.09 (0.83)

2

−0.03* (0.01) −0.59 (0.48)

2.92* (0.29)

−0.95* (0.04) −0.03* (0.01) −0.72 (0.45)

3

4

−5.74* (0.56)

−1.17 (0.74)

−0.79* (0.13)

−0.83* (0.16)

3.09* (0.09)

2.91* (0.09) −0.51* (0.04) 0.003 (0.07) 0.04 (0.56)

0.002 (0.09) 0.07 (0.63)

Note: — denotes a tie; — denotes distance. Statistically significant parameters are marked by an asterisk.

ent. Nonetheless, neither of these models can replicate any of the clustering effects, the degree distribution, or the standard deviation and skewness of the degree distribution. The results suggest that while the geographical arrangement of individuals is important, the emergence of network structure cannot be explained solely by spatial proximity. A different picture is seen for the network effects model (model 3). While model 3 still underestimates the clustering effects in the data – since the t-statistics for the counts of triangles, the global clustering coefficient and the variance of local clustering coefficients are larger than 3 in absolute value – the model nonetheless provides a considerable improvement in fit over models 1 and 2 (see Table 4). It can be seen that in the presence of network effects, the age effect disappears. We see a strong positive effect and a negative effect for the alternating triangle and alternatingstar effects, respectively. This is not surprising and suggests that there is a tendency for triangles to occur together in “clumps” and a tendency for individuals to differ in degree. A remarkable result is obtained when we add the geographical proximity in model 4. While it has been suggested by a number of studies (Butts, 2002, 2007, 2011; Hoff et al., 2002) that geographical proximity can partly explain clustering within a network, we observe a slightly different pattern. Comparison of parameter estimates across models 3 and 4 reveals that the inclusion of the geographical proximity does not affect the network effects in the Respondent Network. The parameter estimates for the network effects remain almost the same and significant. This assumption is also supported by

the goodness of fit results. The network effects model (model 3) reproduces observed network statistics reasonably well even when the geographical proximity is not included. The difference in edge parameters reflects the interaction between the density parameter and the scaling factor. The negative significant effect for the spatial effect suggests that people who reside close to each other are more likely to have a tie. But, while geographical proximity among individuals appears to be important to the process of tie formation, geographical processes appear to be relatively independent of network clustering processes. Geographical proximity may be more likely to give rise to initial tie formation than to more complex characteristics of network structure. It can be seen from Tables 6 and 7 that a similar pattern is observed for the Augmented Network. The first two independence models have very poor fit. While the spatial effect model (model 2) suggests that the inclusion of geographical proximity improves the model fit, it is still far from being a reasonable model. The network model (model 3) and the combined model (model 4) are of particular interest. While for both models some t-ratio values (for non-fitted statistics) are larger than the criteria proposed above, model 4 represents the data slightly better than model 3. Comparison of the parameter estimates for these models suggests a similar pattern as for the Respondent Network. The inclusion of the spatial effect does not change the parameter values for the alternating-triangle and alternating-star effects substantially. One difference in results for the Augmented Network is observed for the goodness of fit results across models 3 and 4. While the inclusion of

Table 7 Heuristic goodness-of-fit results for spatial models for the respondents network. GOF-ratio = (observation − sample mean)/standard deviation Respondent Network

2-Star 3-Star Triangle Alt-star Alt-triangle Alt-2-path Geographical proximity Std dev degree distribution Skew degree distribution Global clustering Mean local clustering Variance local clustering

1

2

2.78 6.18 70.94 0.33 49.42 −2.60 −1.99 6.24 5.92 18.9 43.11 10.4

2.67 5.76 53.01 0.36 40.98 −2.33 5.79 5.61 16.1 32.26 6.03

Augmented Network 3

4

1.4 2.17 4.80

1.13 1.67 4.04

−1.13 −1.05 2.53 1.55 3.95 −0.9 4.09

−1.01 2.28 0.89 3.63 −1.34 3.58

Note: Blank cells refer to fitted effects that have convergence statistic values less than 0.1 in absolute value.

1

2

4.00 7.50 140.77 0.51 82.71 −6.53 −6.33 9.05 −2.29 40.37 90.09 4.35

3.17 5.60 57.19 0.44 48.35 −5.99 6.75 −1.95 27.93 48.30 −5.03

3

4

1.28 2.00 7.30

0.94 1.45 3.46

−2.29 −3.01 2.64 −0.28 7.23 −1.25 6.62

−1.74 1.73 −0.21 3.67 −1.23 4.41

16

G. Daraganova et al. / Social Networks 34 (2012) 6–17

the spatial effects did not considerably change the goodness of fit results for the Respondent Network, the inclusion of the geographical proximity parameter reduces the GOF ratios for the triangle counts and the degree distribution in the Augmented Network. A possible interpretation is while distance is mainly responsible for initial tie formation at small scale distances (approximately 22 km), there may be an interaction between network and spatial effects in the sense that network closure effects may be dampened over larger scale distances (approximately 100–300 km). Remarkably, the parameter estimates for the alternating-triangles, alternating-star, and spatial effects change only a little in comparison with the corresponding effects for the Respondent Network (see Tables 6 and 7). It is also noteworthy that comparing t-ratios for the spatial effects across all models for the Respondent and Augmented Networks, systematic differences can be seen. While the t-ratio is less than 1 across all models for the Respondent Network – suggesting that all models reproduce the spatial effect to an adequate degree – in the Augmented Network the t-ratio is quite large across models, suggesting that the geographical effects are not well captured unless we control for spatial effect. This difference is observed as a result of differences in geographical distribution of individuals in the two networks. The maximum distance between individuals in the Respondent Network is 22 km and the average distance is approximately 7 km. In the Augmented Network the maximum distance between two individuals is 297 km and the average distance is approximately 11 km. One possible explanation of the results presented above is that distance and network effects may begin to interact at large scales. At large scales, distance may temper some of the network effects that can operate over smaller distances.

5. Conclusions The main goal of this study was to formulate models that allow simultaneous estimation of spatial and network effects from observed network data, and, then, to assess these effects simultaneously. The assessment was made using a study designed to assess community network structure as a function of spatial locations of individuals in a suburban Australian region. As a method we utilised the exponential random graph model approach as it allows the relaxation of the assumption of independence between observations and is capable of modelling spatial and network processes simultaneously. In line with previous research, the analyses suggested that distance is likely to play an important role in network tie formation. While the probability of a tie decreases rapidly as the distance between individuals increases, a non-negligible interaction probability remains at large distances. We showed that the attenuated power law with baseline probability set to one describes the relationship between tie probability and distance for the particular dataset reasonably well. Further analyses suggested that a spatial effect cannot solely explain the emergence of organised network structure and that, rather, endogenous network processes should be taken into account. Interestingly, it was shown that, while it is necessary to include both effects in the model, the size of each effect is relatively unaffected by the presence of the other, at least over relatively small distances. In particular, it was shown that the inclusion of network effects substantially improves model fit and helps to explain clustering within the network. While spatial effects assist in the explanation of initial tie formation and to improve the fit of the degree distribution, network effects are important to explain network structure. It was also shown that at larger distances spatial and network processes operate slightly differently. While geographical proximity and endogenous network processes still provide a unique contribution to the emergence of organised

social system, at large distances network closure effects are not solely able to explain clustering within a network. Moreover, in this particular context the network effects are relatively robust to the omission of geographical proximity, i.e. estimates of the clustering and degree effects are almost the same with and without controls for the geographical arrangements of individuals. This is a very important observation, since most social network research does not have information on geographical proximity. If a strong interaction between network effects and geography had been found, then a validity of inferences drawn from network analyses that did not control for geographical effects would become a concern. Of course, more empirical research is needed to confirm this finding. The following limitations of this study should be mentioned. The first limitation concerns the representativeness of the sample and is a result of the way in which respondents were sampled and traced. Only people who were sufficiently proficient in English were interviewed. With regard to nominees, only those who agreed to participate were followed. Therefore, it may be that those participating are not representative of the area under investigation. The second limitation concerned the spatial data, and, consequently, a level of accuracy in geocoding process. No street numbers were available by design and the street type could be missing or incorrectly supplied. These limitations arise as a result of confidentiality concerns and the fallibility of reporting information about nominees. While it is very difficult to avoid these limitations in network research, in the future it would be of great advantage to collect complete information on residential location for all respondents and nominees, i.e. street number, street type, and suburb. In spite of these limitations, the study is one of a very few which have collected relatively precise network and spatial information not only for respondents but also for their network partners. Most survey studies have collected information on the relative locations of alters, for example, distance to alters, where they live in communities, or the time it takes to travel to visit alters. Other studies have collected proxy information on local social networks such as whether people know many of their neighbours, or where family members live in the neighbourhood. Noticeable exceptions are studies by Faust et al. (1999), Butts et al. (2007), and Liben-Nowell et al. (2005), where spatial information to the level of town (village) or state has been collected. A third limitation concerns the distance interaction function which was implemented in the models. While it was observed that this functional form provided a plausible description of the relationship between tie probability and distance, possible modification of this parametric form in the presence of network effects has not been explored. In order to do that, the curved exponential random graph modelling framework (Hunter and Handcock, 2006) should be utilised so that non-linear parameters can be estimated. This study has made a significant advance in the application of exponential random graph models to relational and spatial data. The main argument proposed in the study is that the ERGM framework can be utilised to examine the spatial embeddedness of social networks. In particular, it has been demonstrated that the ERGM framework can be used to successfully model spatially dependent network processes. It has also been demonstrated that a hierarchy of models can be developed to assess the unique contribution of various spatial and network effects and their interactions. Moreover, it has been shown how an estimation approach that conditions on the measures observed in some waves of a snowball sample can be used to estimate models successfully from an incomplete set of observations. It is worth noting that until now, a major difficulty in estimating the ERGM models on a large scale has been the requirement of a complete set of observations on network ties.

G. Daraganova et al. / Social Networks 34 (2012) 6–17

Acknowledgements We thank Carter Butts for his helpful suggestions and discussion. We would also like to thank Peng Wang for the programming of the simulation and estimation programs. References Axhausen, K. (Ed.), 2006. Moving Through Nets: The Physical and Social Dimensions of Travel. Elsevier, Oxford. Besag, J.E., 1974a. Spatial interaction and the statistical analysis of lattice systems (with discussion). Journal of the Royal Statistical Society, Series B: Methodological 36, 96–127. Blake, R.R., Rhead, C.C., Wedge, B., Mouton, J.S., 1956. Housing Architecture and Social Interaction. Sociometry 19 (2). Bernard, H., Killworth, P., Evans, M., McCarty, C., Shelley, G., 1988. Studying social relations cross-culturally. Ethnology 27 (2), 155–179. Besag, J., 1974b. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B (36), 96–127. Brown, L.A., Moore, E.G., 1970. Urban acquaintance fields: an evaluation of a spatial model. Environment and Planning 2, 443–454. Butts, C.T., 2003. Predictability of large-scale spatially embedded networks. In: Breiger, R., Carley, K., Pattison, P. (Eds.), Dynamic Social Network Modelling and Analysis: Workshop Summary and Papers. National Academies Press, Washington, DC, pp. 313–323. Butts, C.T., 2000. Spatial Models of Large-Scale Interpersonal Networks, PhD, Carnegie Mellon University. Butts, C.T., 2011. Space and Structure: Methods and Models for Large-Scale Interpersonal Networks. Springer (under contract). Butts, C.T., 2002. Predictability of Large-scale Spatially Embedded Networks. In: Breiger, R., Carley, K.M., Pattison, P. (Eds.), Dynamic Social Network Modelling and Analysis: Workshop Summary and Papers. National Academies Press, Washington, DC, pp. 313–323. Butts, C.T., Carley, K., 2000. Spatial Models of Large-Scale Interpersonal Networks. Unpublished paper. Butts, C.T., Petrescu-Prahova, M., Cross, B.R., 2007. Responder communication networks in the world trade center disaster: implications for modeling of communication within emergency settings. The Journal of Mathematical Sociology 31 (2), 121–147. Caplow, T., Forman, R., 1950. Neighbourhood interaction in a homogeneous community. American Sociological Review, 357–366. Carley, K., Wendt, K., 1991. Electronic mail and scientific communication. Science Communication 12 (4), 406–440. Carrothers, G.A.P., 1956. An historical review of the gravity and potential concepts of human interaction. American Institute of Planners 22, 94. Daraganova, G., 2008. Statistical models for social networks and network-mediated social influence processes. Thesis dissertation. The University of Melbourne. Dodds, P.S., Muhamad, R., Watts, D.J., 2003. An experimental study of search in global social networks. Science 301 (5634), 827–829. Erdos, P., Renyi, A., 1959. On Random graphs. 1. Publicationes Mathematicae (Debrecen) 6, 290–297. Faust, K., Entwisle, B., Rindfuss, R., Walsh, S., Sawangdee, Y., 1999. Spatial arrangements of social and economic networks among villages in Nang Rong district, Thailand. Social Networks 21 (4), 311–337. Festinger, L., Schachter, S., Back, K., 1950. Social Pressures in Informal Groups. Stanford University Press, Stanford, California. Frank, O., 2005. Network sampling and Model fitting. In: Carrington, J., Wasserman, S. (Eds.), Models and Methods in Social Network Analysis. Cambridge University Press, New York. Frank, O., Strauss, D., 1986. Markov graphs. Journal of American Statistical Association 81, 832–842. Frank, O., Snijders, T., 1994. Estimating the size of hidden populations using snowball sampling. Journal of Official Statistics 10 (1), 53–67. Freeman, L.C., Sunshine, M.H., 1976. Race and intra-urban migration. Demography 13 (4), 571–575. Goodman, L.A., 1961. Snowball sampling. The Annals of Mathematical Statistics 32, 148–170. Goodreau, S., 2007. Advances in exponential random graph (p*) models. Social Networks 29 (2).

17

Handcock, M., Gile, K., 2010. Modeling social networks from sampled data. Annals of Applied Statistics 4 (1), 5–25. Handcock, M.S., Raftery, A.E., Tantrum, J., 2007. Model-based clustering for social networks. Journal of the Royal Statistical Society, Series A 170 (2), 301–354. Hoff, P.D., Raftery, A.E., Handcock, M.S., 2002. Latent space approaches to social network analysis. Journal of American Statistical Association 97 (460), 1090–1098. Hunter, D.R., Handcock, M.S., 2006. Inference in curved exponential family models for networks. Journal of Computational and Graphical Statistics 15 (3), 565–583. Hunter, D., Goodreau, S., Handcock, M., 2008. Goodness of fit of social network models. Journal of the American Statistical Association 103 (1), 248–258. Irwin, M., Hughes, H., 1992. Centrality and structure of urban interaction: measures. Concepts and Application. Social Forces 71 (1), 17–51. Kleinberg, J.M., 2000. Navigation in a small world. Nature 406, 845. Killworth, P.D., Bernard, H.R., 1978. Reversal small-world experiment. Social Networks 1 (2), 159–192. Koskinen, J.H., Robins, G.L., Pattison, P.E., 2010. Analysing exponential random graph (p-star) models with missing data using Bayesian data augmentation. Statistical Methodology 7 (3), 366–384. Korte, C., Milgram, S., 1970. Acquaintance networks between racial groups – application of small world method. Journal of Personality and Social Psychology 15 (2). Larsen, J., Urry, J., Axhausen, K., 2006. Mobilities, Networks, Geographies. Aldershot, Ashgate. Latane, B., Liu, J.H., Nowak, A., Bonevento, M., Zheng, L., 1995. Distance matters: physical space and social impact. Personality and Social Psychology Bulletin 21 (8), 795–805. Liben-Nowell, D., Novak, J., Kumar, R., Raghavan, P., Tomkins, A., 2005. Geographic routing in social networks. Proceedings of the National Academy of Sciences of the United States of America 102 (33), 11623–11628. Merton, R., 1948. The social psychology of housing. In: Dennis, W. (Ed.), Current Trends in Social Psychology. University of Pittsburgh Press, Pittsburgh, PA, pp. 163–217. Milgram, S., 1967. Small-world problem. Psychology Today 1 (1), 61–67. Mok, D., Wellman, B., Basu, R., 2007. Did distance matter before the Internet? Interpersonal contact and support in the 1970s. Social Networks 29, 430–461. Morrill, R., 1963. The distribution of migration distances. Papers of the Regional Science Association 11, 75–84. Pattison, P., Robins, G., 2002. Neighborhood based models for social networks. Sociological Methodology 32, 301–337. Pattison, P.E., Robins, G.L., 2008. Probabilistic network theory. In: Rudas, T. (Ed.), Handbook of Probability Theory with Applications. Sage Publications, Thousand Oaks, CA, pp. 291–312. Pattison, P.E., Robins, G.L., Snijders, T.A.B., Wang, P., in preparation. Conditional estimation of exponential random graph models from snowball and other sampling designs. Preciado, P., Snijders, T., Burk, W., Stattin, H., Kerr, M., in press. Proximity matters: exploring the distance dependency of adolescent friendships, Social Networks, doi:10.1016/j.socnet.2011.01.002. Robins, G., Pattison, P., Elliott, P., 2001. Network models for social influence models. Psychometrika 66, 161–190. Robins, G., Pattison, P., Woolcock, J., 2005. Social networks and small worlds. American Journal of Sociology 110, 894–936. Snijders, T., 2002. Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure 3.2 (April 19, 2002). Snijders, T., Pattison, P., Robins, G., Handcock, M., 2006. New specifications for exponential random graph models. Sociological Methodology, 99–153. Sommer, R., 1969. Personal Space. Prentice-Hall, Englewood Cliffs, NJ. Travers, J., Milgram, S., 1969. Experimental study of small world problem. Sociometry 32 (4), 425–443. Wasserman, S., Pattison, P., 1996. Logit models and logistic regressions for social networks: I. An introduction to Markov graphs and p*. Psychometrika 61, 401–425. Wang, P., Robins, G., Pattison, P., 2008. PNet: Program for the Simulation and Estimation of p* Exponential Random Graph Models. University of Melbourne. Watts, D.J., Strogatz, S.H., 1996. Collective dynamics of ‘small world’ networks. Nature 393, 440–442. Wellman, B., 1996. Are personal communities local? A Dumptarian reconsideration. Social Networks 18, 347–354. Whyte, W.H., 1957. The Organisation Man. Doubleyday, Garden City, NY. Wong, H., Pattison, P., Robins, G., 2005. A spatial model for social networks. Physica A: Statistical Mechanics and its Applications 360 (1), 99–120.