International Journal of Industrial Organization 48 (2016) 270–290
Contents lists available at ScienceDirect
International Journal of Industrial Organization www.elsevier.com/locate/INDOR
Network dynamics and knowledge transfer in virtual organisations✩ Neil Gandal a,∗, Uriel Stettner b a b
Berglas School of Economics, Tel Aviv University, Israel Coller School of Management, Tel Aviv University, Israel
a r t i c l e
i n f o
Article history: Received 23 October 2015 Revised 28 June 2016 Accepted 30 June 2016 Available online 14 July 2016 Keywords: Network dynamics Knowledge spillovers Modification of code Social network Open source
a b s t r a c t Employing a model of knowledge spillovers, we find empirical evidence consistent with both direct and indirect spillovers among open source software projects. We further find that programmers who work on many other projects have a positive effect on the success of a project beyond the effect they have on connectivity of the network. We also find that, both “modifications” and “additions” are p ositively asso ciated with project success. © 2016 Elsevier B.V. All rights reserved.
1. Introduction This study examines whether the success of open source software (OSS) products depends on knowledge spillovers across distinct OSS development projects. It also evaluates ✩ We appreciate the invaluable research assistance of Yaniv Friedensohn and Peter Naftaliev. We gratefully acknowledge the financial support of the Israel Science Foundation (Grant nos. 1287/12 and 1069/15) and the support of a grant by the Research Program for the Economics of Knowledge Contribution and Distribution. We are especially grateful to the editor, Pierre Dubois, and two anonymous referees for comments and suggestions that significantly improved the paper. We also thank Sarit Weisburd and seminar participants at UT-Austin, Tel Aviv University and the Economics of Knowledge Contribution and Distribution conference at Georgia Institute of Technology for helpful comments. Any opinions expressed are those of the authors. ∗ Corresponding author. Fax: +972 3 640 9908. E-mail addresses:
[email protected],
[email protected] (N. Gandal),
[email protected] (U. Stettner).
http://dx.doi.org/10.1016/j.ijindorg.2016.06.010 0167-7187/© 2016 Elsevier B.V. All rights reserved.
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 271
the relative contribution of product modifications and functional additions by programmers of software co de to product success. We do so by taking under consideration that programmers may work on multiple projects simultaneously. Product development in community-based organisations is becoming an increasingly important setting in which individuals create and disseminate knowledge in joint efforts to develop products. In such work environments, knowledge spillovers enable fellow software programmers, researchers and firms to benefit from innovations of others. Software programming is a vocation in which knowledge spillovers are likely to be important for product development given the rapid advancements in technologies, development methodologies, changing product-market preferences, and increasing competitive pressures. In particular, OSS can facilitate spillovers in R&D because the underlying software code is freely available in human readable form to the broad public. In its traditional practice, OSS development is a collaborative effort of loosely coordinated and geographically dispersed programmers who contribute their time and knowledge to establishing and improving software. Members create innovations as a collaborative effort in which they reveal and share knowledge not only with their project p eers but often with potential competitors (Harhoff, Henkel & Von Hippel, 2003; Hippel, 2005). Indeed, OSS innovations are typically developed by consumers and end users, rather than manufacturers, but are freely revealed with manufacturers in hopes of having them produce the product (Von Hippel and Von Krogh, 2003). These “lead users” require sp ecialised solutions to existing product limitations and thus develop their own modifications to existing products, or entirely new products (Von Hippel, 1986). OSS projects, like virtual teams, are semi-structured groups of skilled programmers working on interdependent tasks using informal, non-hierarchical, and decentralised communication with the common goal of creating a valuable product (Lipnack and Stamps, 1997). Virtual development teams, as opposed to traditional work teams that enjoy the benefits of face-to-face communication may also encounter challenges to form personal relationships (Beyerlein et al., 2001), to communicate (Pinto and Pinto, 1990), and perform (Jehn and Shah, 1997). Consequently, the resulting lack of strong connections and social support may have negative effects on productivity through reduced commitment, trust and leadership as well as willingness to share knowledge (Cascio, 2000; Townsend et al., 1998; Whiting and Reardon, 1998; Wong and Burton, 2000). Accordingly, by the nature of its organisational design and structure, members of dispersed virtual development teams are restricted in their exposure to knowledge and know-how. On the other hand, there are numerous advantages to the open source “team” model of innovation. In the case of OSS, the contribution of each individual programmer is known and measurable, since each addition or modification to the software is associated with a particular programmer. Hence, moral hazard problems that arise from joint output produced by teams (Holmstrom, 1982) are less likely to arise in OSS settings than in proprietary “co op erative” research settings like research joint ventures. Additionally, OSS development teams make the underlying project knowledge accessible to the general
272 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
population under a variety of OSS licenses (Laurent, 2004). Such licenses typically grant the rights to use the entire work, to create a derivative work, or to share or market such a work (Bonaccorsi et al., 2006; Von Hippel and Von Krogh, 2003; Lerner and Tirole, 2002). Hence, intellectual property barriers in the form of patent thickets are less likely to adversely affect innovation in open source settings. Indeed, one of the central aspects of OSS development is the ability to share and absorb knowledge that has been created outside of a distinct OSS project. Such spillovers facilitate the transfer of knowledge and ideas among individual programmers and across OSS projects. Knowledge spillovers across projects often occur via programmers who, for example, adapt software so that it can function in a computing environment that is different from the one for which it was originally designed, provide alternative value propositions on the basis of previously existing software solutions, or apply distinct pieces of functionality to different development efforts settings in attempt to benefit from previous development efforts. Direct spillovers o ccur when projects have a common programmer who transfers information and knowledge primarily in the form of source code from one project to another. In contrast, indirect spillovers occur when knowledge is transferred from one project to another when the two projects are not directly linked through a common programmer. The objectives of this paper are three-fold. First, we examine whether knowledge spillovers occur in OSS projects and, if so, determine whether they are direct or indirect. Second, we wish to differentiate software development efforts by evaluating whether and how modifications and additions of code impact project success. Third, we aim to examine whether and how programmers who work on multiple projects affect project success. To do so, we first construct a unique data set using publicly available data from SourceForge, a platform that hosts tens of thousands of OSS projects and their programmers. We then construct the project network by defining two OSS projects to be connected if they have a programmer in common. We also construct the related programmer network by defining two programmers to be connected if they work on a common project. We then compute the number of modifications and additions made to the code for each project over the p erio d b etween 2005 and 2012. A modification is defined as a change made by a programmer to existing code within a distinct file, while an addition occurs when a programmer adds a new file that contains a block of code that was not previously part of a focal OSS project. Thus, a modification captures an activity that affects a particular set of code with the desire to, for example, make the code more efficient or stable. Accordingly, modifications are a go o d proxy for incremental innovation that, for example, improve how the software product works via the refinement, reutilisation, and elaboration of established ideas and technologies. Additions are a proxy for new knowledge and technologies that provide additional functionality (Lewin et al., 1999). To better illustrate the process, we include an example of a modification in Appendix B. We find that there are both direct and indirect knowledge spillovers. We also find that the number of additions and mo difications are p ositively asso ciated with the number of downloads. Thus, projects with more additions and modifications are positively
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 273
associated with project success. Moreover, our analysis reveals that programmers who work on many projects have a positive effect on project success that go es b eyond their contribution to the network structure. This finding is novel and opens up opportunities for future, fine-grained studies that could, for example, associate individual knowledge contributions with the programmers who made them. Perhaps more importantly, our study is the first to measure input to open source projects at the “micro–micro” level, i.e., by taking account of every contribution to the co de, b oth via modifications of existing code, or by additions of new blocks of code – and examining the relationship between changes to the code and project success. Prior research has examined the relationship between network structure and performance (Ahuja, 2000; Calvó-Armengol et al., 2009; Claussen et al., 2012; Fershtman and Gandal, 2011).1 In particular, Fershtman and Gandal (2011) focus on spillovers that occur by means of the interactions of different programmers in OSS projects. Using cross-sectional data, they find that the structure of the product network is associated with the project’s success, which under the assumptions of their model, provides support for knowledge spillovers. Our paper is closest to Fershtman and Gandal (2011) and Claussen et al. (2012). Fershtman and Gandal (2011) used a single year of data so they are not able to address potential endogeneity arising from unobserved time-invariant project factors. By using panel data, our study deals with potential endogeneity from unobserved time-invariant project factors. By controlling for these time-invariant project factors, we find (unlike Fershtman and Gandal, 2011) that the presence of a contributor who works on many projects is asso ciated with project success. Further, their study did not include data on modifications and additions, which is one of the key contributions of our study. Our study also goes beyond Claussen et al. (2012) who study the economic effect of a developer’s connectedness in the electric game industry. Whereas that study deals with the effect of direct ties on project success, our study establishes the importance of both direct and indirect knowledge spillovers on project success. Further, their study did not examine the effect of modifications and additions. 2. Research setting and data This paper uses data from Sourceforge.net, a free and accessible online platform for managing software development projects, facilitating develop er collab oration and communication. Sourceforge.net is the largest repository of registered OSS development projects during the period of our study hosting tens of thousands of projects and their programmers. Each project links to a standardised “Project page” that lists descriptive information on a particular project, including a statement of purpose, software categories, intended audience, the license, and the operating system for which the 1 Other recent studies have examined the relationship between network structure and behavior (e.g., Ballester, Calvó-Armengol, and Zenou, 2006; Calvo-Armengol and Jackson, 2004; Goyal, van der Leij, and Moraga-Gonzalez (2006); Jackson and Yariv, 2007; Karlan, Mobius, Rosenblat, and Szeidl, 2009).
274 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
application is designed. The most popular categories are “Internet Software”, “Development Software”, “System” and “Communications Software”. Other popular software categories are “Games/Entertainment” and “Scientific/Engineering” Software.2 Similarly, a standardised “Statistics page” shows various project activity measures, including the number of project page views and downloads registered for the project. Moreover, each OSS project contains a list of registered members who contribute their time and knowledge to the advancement of the project. Each project links to a standardised “programmer page” that contains meta-information on a particular programmer, including the unique user name, the date the programmer joined the project, the programmer’s functional description (e.g., administrator, programmer) and his or her geographic location. Within this environment, direct knowledge spillovers may occur when two projects have a common programmer who transfers information and knowledge embedded in the code from one project to the other. In contrast, indirect project spillovers occur when knowledge is transferred from one project to another when the two projects are not directly linked through a common programmer. For example, suppose that programmer “A” works on projects I and II, while programmer “B” works on projects II and III. Programmer A could take innovative code from project I and use it in project II. Programmer B might find that code useful – and port it from project I I to project I I I. In such a case, knowledge is transferred from one project to another by programmers who work on more than one project. There is a direct spillover from project I to project II, and an indirect spillover from project I to project III, since projects I and III are not directly connected. Following Grewal et al. (2006) and Fershtman and Gandal (2011), we construct a project network by defining two OSS projects to be connected if they have a programmer in common. In the example above, there is a direct link between projects I and II and a direct link between projects I I and I I I. We also construct a related programmer network by defining two programmers to be connected if they work on the same project. In the example above, there is a direct link in the programmer network between programmers A and B, since they both work on project II. Then, in addition to constructing the project and programmer networks, we calculate the number of modifications and additions made to the code for each project over the period of our study. 2.1. Dependent variable Consistent with prior research, we measure project performance or success (denoted S) by examining the number of times a project has been downloaded (Fershtman and Gandal, 2011; Grewal et al., 2006). We focus on downloads of the executable, compiled product b ecause end-users do not typically download the code. In the case of software, downloading code and getting it to work takes time and effort; hence, engineers and computer scientists consider downloads to be an excellent proxy for success and the 2
See Appendix C for details regarding products by software categories.
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 275
Table 1 Distribution of components in project networks – 2009. Project network
Programmer network
Programmers per project
Percent of total projects
Projects per programmer
Percent of total programmers
1
69.9
1
77.2
2
14.4
2
14.1
3–4
9.2
3-4
5–9
4.8
5-9
1.9
10 or more
1.7
10 or more
0.2
6.5
perceived quality of the product (Fershtman and Gandal, 2011; Grewal et al., 2006). Although some data are available for other p erio ds, statistics on downloads are available only for the 2006–2009 p erio d. Therefore, we deploy yearly panel data from 2006–2009 in our analysis.
2.2. The project and programmer networks We constructed two distinct two-mode networks: (i) the project network and (ii) the programmer network.3 In the case of the project network in 2009, we find that 84.3% of the projects have either one or two programmers, 9.2% have three to four programmers and 6.5% have five or more programmers (see Table 1). With regard to the programmer network in January 2009, 91.3% of the programmers worked on one or two projects, 6.5% of the programmers worked on three to four projects, and 2.1% of the programmers worked on five or more projects.4 Most empirical networks including ours consist of multiple distinct components: one very large component and several very small components. Indeed, our project network has one extremely large component (here forth “giant component”) consisting of more than 14,000 connected projects in 2009 while the next largest component consists of less than 30 connected projects. In our analysis, we use measures that are defined only for projects in a connected component. Hence, similar to other papers in the literature, our analysis will focus exclusively on the giant component. Whereas we focus on the project network, our analysis also includes a key feature of the programmer network: programmers who work on five or more projects. In the giant component, approximately 50% of the projects have one programmer who works on five or more projects. Indeed, we have defined this variable (five or more projects) so that no project has more than one programmer who works on five or more projects. 3 Recall that in the project network, the nodes are the OSS projects, and two projects are linked when there are common contributors who work on both. In the contributor network, the nodes of the contributor network are the contributors, and two contributors are linked if they participated in at least one OSS project together. 4 Percentages were virtually identical in other years as well.
276 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
2.3. Degree and closeness While do not directly observe spillovers, we adopt a simple model from Fershtman and Gandal (2011) allowing us to proxy spillovers by two network centrality measures: (i) a project’s degree, which is the number of projects with which the focal project has a direct link or common programmers, and (ii) a project’s closeness centrality, which is the inverse of the sum of all distances between a focal project and all other projects multiplied by the number of other projects. Intuitively, closeness centrality measures how far each project is from all the other projects in a network and is calculated as:5 Ci ≡
(N − 1) j ∈ N d (i, j ),
(1)
where N is the number of projects and d(i,j) is the distance between project i and j. For two projects that are directly connected, d(i,j) = 1. For two projects that are indirectly linked via a third project, d(i,j) = 2. In the case of a network with a single project that is connected to all other projects, the closeness centrality of that project equals 1, which is the maximum value for closeness centrality. Projects that indirectly link other projects have a higher closeness centrality measure than projects at the edge of a network.6 2.4. Model specification Having defined degree and closeness centrality as our proxies for spillovers we continue by assuming that the expected success level of each project “i” without any spillovers is given by Sit = α + Xit ω + εit .
(2)
where the variable Sit is the success of project i at time t, αi ≡ α+Ai δ, where α is a constant, Ai is a vector of unobserved time-invariant project factors, Xit is a vector of observable time-varying factors, and εit is an error term. There are likely many important unobserved time-invariant project factors (in the vector A) including project management structure, conditions potential programmers have to meet in order to join the project, and rules about who can make edits and changes to the code. Given these important unobserved time-invariant project factors, Eq. (2) should b e estimated using a fixed effects model in which αi ≡ α+Ai δ is a parameter to be estimated. As Angrist and Pischke (2009) note, treating αi as a parameter to be estimated is equivalent to estimating in deviations from means.7 5
See Freeman (1979, pp. 225–226) and Wasserman and Faust (1994, pp. 184–185). Closeness centrality lies in the range [0,1]. In the case of a Star network with a single project in the middle that is connected to all other projects, the closeness centrality of the project in the center is one. 7 Fixed effects are also equivalent to estimating in differences if there are only two p erio ds of data. 6
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 277
Having a panel rather than cross-sectional data simplifies the process of determining causality since unobserved time-invariant fixed project effects might be driving success. A cross-section cannot control for time-invariant project effects; they are included in the error term in cross-sectional analysis. If these unobserved effects are correlated with the right-hand-side variables, the estimates from the cross-sectional analysis will be biased; however, we eliminate this problem by using fixed effect models. Further, as we show below, panel data enables us to develop a novel test for reverse causality. We believe that performing such a test is important when working with network data. We adopt Fershtman and Gandal’s (2011) assumptions that (a) each project may receive a positive spillover denoted β from all “connected” projects, and (b) that a project may enjoy positive spillovers from projects that are indirectly connected, but (c) that these spillovers are subject to decay that increases linearly as the distance between the projects in the projects network increases. When the distance between project i and j is d(i,j), this spillover is γ/ j d(i, j ). Under these assumptions, the success level of each project i at time t can be written Sit = α + Xit ω + β Dit + γ/
j
d(i, j ) + εit .
(3)
where Dit is the degree of project i in the network at time t, and β and γ are greater than or equal to zero. Using (1), the expression for closeness centrality, project i’s success at time t can be rewritten as Sit = α + Xit ω + βDit + γCit /(N − 1)t + εit .
(4)
This spillover specification is simple but quite general. When β and γ equal zero, there are no spillovers at all. When β > 0 and γ = 0, there are only direct spillovers. When β = 0 and γ > 0, there are both direct and indirect spillovers which are exclusively measured by the projects’ closeness centrality. When β > 0 and γ > 0, there are additional spillovers from directly connected projects ab ove and beyond those captured by its closeness measure: the spillovers have a “hyperb olic” structure. 2.5. Functional form “Success” as measured by the number of downloads is skewed in our data, with a few projects having great success, and many others having less success. For this reason, we follow prior research in the network literature and use the natural log of success as the dependent variable (e.g., Claussen et al., 2012 ; Fershtman and Gandal, 2011). From this discussion, it follows that either a log/log or a log/linear model is appropriate.8 A log/log model is appealing because the relationship between the number of programmers and downloads may be non-linear; that is, additional programmers are likely associated with 8
In the literature, some authors use the log/log model; other authors use the log/linear model.
278 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
a larger number of downloads, but the marginal effect of each additional programmer declines as the number of programmers increases. The same is probably true for the relationship between network variables and downloads as well. On the other hand, it is very attractive to use a log/linear model in which the independent variables enter linearly; in this case, zero values on modifications and additions are much easily dealt with. Perhaps not surprisingly, the results using the log/log and the log/linear model are qualitatively similar.9 We employ the log/linear model in the b o dy of the paper. The results with the log/log model were reported in a previous version of the paper.10 We denote the dependent variable as ldownloads ≡ ln(downloads,) where “ln” means the natural logarithm. All projects have at least one download in every year, so every project is included in the analysis. 2.6. Independent variables 2.6.1. Network variables For the empirical analysis, we use the following project network variables: • •
degree = degree of the project. closeness = closeness of the project. where project degree and project closeness were defined earlier. Other key variables of interest include the following:
•
The dummy variable “Many_Projects” takes on the value one if the contributor was a member of five or more projects. This variable stems from the programmer network rather than the project network. Clearly, having such a programmer join a project b estows that project with additional connections to other projects. An interesting question is whether adding such a programmer to the team of programmers has an effect on the success of a project beyond the effect it has on connectivity (i.e., network structure). Recall that no project has more than one such programmer.
2.6.2. Additions and modifications Source Code encapsulates a collection of computer instructions written in a humanreadable computer language such as C++ or Java.11 Generally, these source code files are stored in a database of a source-code version control systems (VCS). Individual software programmers can add files containing source code to the VCS. Alternatively, programmers can retrieve existing files from the system and return modified version to 9
We are grateful to the referee for his/her suggestions in this section. See Gandal and Stettner (2014), CEPR discussion paper. 11 The actions to be performed are generally transformed by a compiler program into low-level machine code (i.e., executable file) for execution at a later time. Most software applications, and in particular closed, proprietary software products, are distributed in a form that includes executable files, but not their source code. 10
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 279
the VCS that correct errors (i.e.., bug fixes), make the code more efficient (i.e., require fewer processing p ower), make the co de more stable to avoid crashes (e.g., Windows’s infamous Blue Screen of Death) or introduce enhancements. In fact, moderately complex software often requires the compilation of hundreds of different source code files each of which may have undergone dozens of modifications over time by different software programmers. In this study, we have gained access to the VCS of all software projects enabling us to track each and every addition and modification to the code by project. Thus, for each project in each year, we count the number of modifications and additions. Hence, in 2009, the total number of modifications (additions) for each project is the sum of the modifications (additions) made during the 2006–2009 period. • • •
We define “Num_mods” as the number of modifications on the project. We define Num_adds as the number of additions to the project. Since some projects do not have any additions or mo difications for the p erio d which we have data, we also include dummy variables denoted modification and addition, where the variable “modification” takes on the value one if the project had at least one modification during the period for which we have data. The variable “addition” is similarly defined.
2.6.3. Control variables In addition to the variables of interest, we have data for a group of control variables: •
•
•
The variable years_since is defined as the number of years that have elapsed since the project was first launched on Sourceforge. The variable cpp is defined as the number of contributors that participated in the project. The data from Sourceforge.net include information on the six possible (formal) stages of development for each product. The stages are: 1 – Planning, 2 – Pre-Alpha, 3 – Alpha, 4 – Beta, 5 – Production/Stable, 6 – Mature. The variable stage takes on values between one and six.12
3. Empirical analysis 3.1. Informal examination of the data focusing on the giant component Closeness centrality is only defined for connected projects; hence the formal analysis in our paper focuses on the giant component. Nevertheless, it is worth briefly comparing the giant component with the other very small comp onents. Projects in the giant component 12 A few of the projects have multiple stages listed. We exclude these projects from the analysis. Including these projects and taking the average stage as the stage of the project has no effect on the results.
280 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
have on average many more downloads than projects outside of the giant component (96,998 vs. 18,839). Further, projects in the giant component have (i) more contributors (4.07 vs. 1.63), (ii) a larger degree (6.26 vs. 1.18), and a great number of contributors who work on five or more projects (0.50 vs. 0.08). Additionally, projects in the giant component receive on average 506 modifications compared to 112 for projects outside of the giant component. Similarly, projects in the giant component receive on average 271 additions compared to 73 for projects outside of the giant component We will, of course, formally estimate the model. Before we do so, however, it is interesting to report the number of downloads conditional on various characteristics of the projects. We do this for 2009. The other years are similar.13 The median number of downloads for projects with a contributor who worked on at least five projects was 3015, while the median number of downloads for projects without such a contributor was 1723.14 In the case of single contributor project, the median number of downloads was 661. In the case of projects with two contributors, the median number of downloads was 1219, while the median number of downloads for projects with more than two contributors was 4419. The median number of downloads for projects with values of degree above the median was 3848, while the median number of downloads for projects with values of degree below the median was 1333. Similarly, the median number of downloads for projects with values of closeness above the median was 5582, while the median number of downloads for projects with values of closeness below the median was 1746. Of those projects that had a least one addition, the median number of downloads for those with additions above the median was 15,151 while the median number of downloads for those projects with additions below the median was 4593. Similarly, for those projects that had a least one modification, the median number of downloads for those with mo difications ab ove the median was 16,909 while the median number of downloads for those projects with modifications below the median was 3602. Not all projects in the giant component benefited from modifications or additions during the 2006–2009 p erio d. In total, 36% had at least one addition or modification during that p erio d; in 2009, these projects had on average significantly more downloads (257,023 vs. 94,343). Moreover, in 2009, projects that benefited from at least one modification or addition during the 2006–2009 p erio d had (i) on average more programmers (6.60 vs. 2.91), (ii) a larger degree (9.11 vs. 5.46), and (iii) a large number of programmers who worked on five or more projects (0.57 vs. 0.43). This descriptive examination of the data shows that degree, closeness, “many_projects”, additions and mo difications are p osi-
13 Descriptive statistics are in Table A1. Correlations among the network centrality variables are in Table A2. 14 Note that the mean number of downloads for projects with a contributor who worked on at least five projects was 171,710, while the median number of downloads for projects without such a contributor was 133,765. This illustrates how skewed the variable downloads are. For this reason, in this exploratory discussion, we report medians, rather than means.
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 281
Table 2 Results using projects in giant component dependent variable: ldownloads. Model 1
Model 2
Model 3
∗∗∗
∗∗∗
∗∗∗
Model 4
Constant
2.392
(34.93)
4.046
(67.05)
5.251
(51.46)
5.169∗∗∗
(41.36)
cpp
0.0374∗∗∗
(8.43)
0.00952∗
(2.09)
0.00802∗∗
(3.10)
0.0092
(1.01)
Degree
0.0157∗∗∗
(7.28)
0.0171∗∗∗
(9.45)
0.00799∗∗∗
(7.21)
0.0111∗∗∗
(9.50)
(15.94)
14.49∗∗∗
(14.28)
Closeness
28.72
∗∗∗
many_projects −0.0600∗∗
(14.35)
12.84
(−2.82)
∗∗∗
∗∗∗
(7.37)
15.16
0.105∗∗∗
(7.76)
0.0458∗∗∗
(5.29)
0.0294∗∗
(3.21)
0.144∗∗∗
(5.67)
0.145∗∗∗
(4.45)
Stage
0.546∗∗∗
(74.57)
0.327∗∗∗
(9.42)
years_since
0.301∗∗∗
(68.05)
0.280∗∗∗
(120.08) 0.211∗∗∗
num_adds
3.7∗ 10−5
(0.40)
3.7∗ 10−5∗∗ (3.43)
num_mods
2.8∗ 10−6
(0.63)
2.1∗ 10−5∗
(2.27)
Modifications
0.536∗∗∗
(5.69)
0.0810∗∗
(2.71)
−0.0266
Additions
0.555∗∗∗
(14.33)
0.0479∗
(1.75)
0.0431∗
Single
−0.696∗∗∗
(−31.36) −0.222∗∗∗
1.4∗ 10−5∗∗
(3.48)
1.3∗ 10−5∗∗∗ (4.03)
(−4.94) −0.117∗∗
Yes
(129.60) 0.196∗∗∗
Yes
1.2∗ 10−6
5.6∗ 10−5∗∗ (2.94)
(−1.30) −0.0251 (2.07)
(117.77) (0.73)
0.0554∗
(−3.08) −0.221∗
Fixed effects
No
Δcpp = 0
No
No
No
Yes
Data
2006–2009
2006–2009
2007–2009
2007–2009
Observations
53,608
53,608
40,507
35,445
(−1.27) (2.46) (−2.58)
Yes
We employ robust standard errors (without clustering). t-statistics in parentheses. ∗ p < 0.05. ∗∗ p < 0.01. ∗∗∗ p < 0.001.
tively correlated with success.15 The descriptive data further points to the significance of the social network underlying open source product development. 3.2. Analysis As discussed above, a “log/linear” model is appropriate for our analysis. Thus, we use the following equation and estimate it using a fixed effects model: ldownloads = α + β0 + β1 cpp + β2 degree + β3 closeness + β4 Many_Projects + β5 Stage + β6 years_since + β7 num_mods + β8 num_adds + β9 modification + β10 addition + β11 single + ε,
(5)
where16 the variable single is a dummy variable that takes on the value 1 if the project only has a single contributor and zero otherwise. We estimate (5) for 2006–2009, the p erio d for which we have data on downloads. The main results are shown in Table 2. In Model 1 in the Table 2, we do not include fixed effects; In Model 2 of the table, we add fixed effects. We now discuss the results: 15 Complete descriptive statistics and correlations among degree, closeness, cpp and “many_projects” are in Appendix A. 16 We suppress the time subscript (t) and, except for the fixed effects, we suppress the project subscript (i) for ease of presentation in (5)
282 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
3.3. Results 3.3.1. Direct and indirect knowledge spillovers across projects Both of the regressions (in Models 1 and 2) show that degree centrality is positively associated with the number of downloads and that this association is statistically significant. Further, both of these results show that changes in closeness centrality are positively associated with downloads for projects – and the association is statistically significant. Since the estimated coefficients on degree and closeness are b oth p ositive and significant, this suggests that there are both direct and indirect project spillovers and that the spillovers have a hyperb olic structure, so that the direct spillover is quite large, relative to the indirect spillovers. 3.3.2. Contributors who work on many projects Table 2 shows that without the fixed effects, the presence of a contributor who works on many projects is not positively associated with success. Once we include the fixed effects, however, we indeed find that presence of such a contributor is positively associated with success. This result is interesting because this effect obtains, even after accounting for the network structure induced by such contributors, the presence of such a contributor on a project is positively correlated with success. 3.3.3. Additions and modifications In b oth Models 1 and 2, the number of additions and mo difications are p ositively associated with the number of downloads. Further, the estimated coefficients on the dummy variables “modification” and “addition” are statistically significant. (From above, these dummy variables take on the value one if there were any modifications or additions, respectively.) Thus, projects with either modifications or additions have more downloads than projects without any modifications or additions. 3.3.4. The role of fixed effects What does the introduction of fixed effects do? With the exception of the estimated parameter on “Many_Projects”, the introduction of fixed effects does not change the factors that are associated with success. However, the introduction of fixed effects does affect the magnitude of the estimated associations. This is because some of the unobserved time invariant factors (for example, rules controlling whether and how programmers can join the project) are likely correlated with the number of programmers and hence also degree, closeness, and the number of programmers who contribute to many projects. Hence, we would expect that the introduction of fixed effects would change the estimates of these parameters. Indeed, the introduction of fixed effects in Model 2 reduces the estimated coefficients on closeness centrality, the number of programmers, as well as the estimated coefficients on modification and addition. The last effect likely occurs because there is correlation between whether there are modifications or additions to a project and rules about who has
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 283
“executive power” to commit a modification or addition to the project. Hence, it seems important to employ the fixed effect model in the analysis. Note that in all specifications using fixed effects, we strongly reject the hypothesis that the fixed effects are equal to zero.17
4. Robustness analysis 4.1. Addressing endogeneity from reverse causality Fixed effect models help address the endogeneity bias associated with time-invariant project factors. In the case of network analysis, there is an additional potential endogeneity as well: although our analysis focuses on how the network structure affects success, the reverse may be true as well: contributors may want to join p opular/successful projects. Developers may want to b e asso ciated with very successful projects, thereby making the number of contributors and degree endogenous. The number of contributors and degree could be endogenous in our data set. Here, the interpretation would be that developers may want to b e asso ciated with more successful projects. This “joining popular projects” effect would make the number of contributors and degree endogenous. Since our network is fairly thin, and since there are many projects and relatively few developers per project, it is likely that the “joining popular projects” is not an important phenomenon in our setting. Nevertheless, we would like to examine this issue. Closeness could also be endogenous under the following unlikely scenario: developers may want to work on project “A” so that a develop er on that project can “introduce” them to a developer on project “B” whom they would like to meet. Again, since our network is a fairly thin one, it is unlikely that this indirect contact mechanism would play any role. It is probably much easier and much more effective to simply contact the programmer directly. Nevertheless, we wish to address this potential endogeneity as well. The panel data set enables us to employ a novel test to investigate potential endogeneities. We do so by restricting the analysis to those projects that had no changes in the number of contributors from one year to another. In such a case, reverse causality (i.e., the effect that describes the tendency to join popular projects) is absent.18 The key point is that the degree can change for projects that have no changes in the number of their contributors. The mechanism by which this change can occur is that the degree centrality of the original project also increases when a contributor on a particular project joins another project. Similar to degree, the project closeness and the variable
17
The standard Hausman test rejects a “random effects” in favor of a fixed effects model. Of course, it is possible that some contributors joined and some left with a net change of zero during the year, but the vast majority of projects had no changes or small changes in personnel from year to year. This is because, as noted, the number of contributors per project is quite small. 18
284 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
“Many_Projects” can also change even when the number of contributors on the project does not change.19 , 20 When we conduct the robustness analysis control for possible endogeneities, we only include projects for which Δcpp = 0; in such a case, we lose the data for 2006. Hence, in Model 3 of Table 2, we re-run the fixed effects model (of model 2 in the same table) using data from 2007–2009 to get a “baseline” result before we test for feedback endogeneity. Our results when we test for this endogeneity and include projects only if Δcpp = 0 are reported in Model 4 of Table 2. When we compare Models 3 and 4, our main results continue to hold. The results concerning the estimated coefficients on degree, closeness, and “Many Projects” are qualitatively unchanged. All remain significant and the coefficients are virtually unchanged. This suggests that contributors do not typically join projects for popularity reasons. This result makes sense, since the project network on SourceForge is thinly connected and most contributors work on either one or two projects. Comparing Models 3 and 4, all of the other main results hold as well.21 This shows that our results are robust to possible endogeneities resulting from reverse correlation. 4.2. Long differenced data In order to perform an additional robustness check, we also examined “long differenced” data from 2006 to 2009. This approach suffers from the disadvantage of losing a large part of the data, namely data from 2007 and 2008, yet allows to employ data from 2006, which we could not use in the fixed effect robustness analysis (when Δcpp = 0). A previous version of the paper (Gandal and Stettner, 2014) shows that the results using a “long differenced” model are qualitatively unchanged. 4.3. Alternative interpretations of the results The coefficient on degree measures how the number of downloads increases as we increase the number of other projects on which a programmer to the project in question work, holding the number of programmers constant. This can happen either because a current programmer increases the number of projects she works on, or because a programmer with fewer projects is replaced with a programmer with more projects. In our 19 In the case of “Many Projects,” the value of the variable can change when a contributor on a particular project joins other projects and that contributor “transitions” from working on fewer than five to working on five or more projects. 20 One alternative way to test for endogeneity is to only consider relatively young projects. The “joining p opular pro jects” effect is likely to be less of a factor for relatively young projects. This is what Fershtman and Gandal (2011, RAND) did. We repeated their test and our results are qualitatively unchanged. But based on our discussions with many colleagues, we (and they) believe that the test we propose is preferable. 21 The estimated coefficient on Num_mods is positive and statistically significant in both Models 3 and 4. The coefficient on the number of additions is insignificant in Model 4, but the coefficient on whether the project has any additions is positive and statistically significant.
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 285
setting, there are virtually no cases where programmers are replaced by other programmers. Hence, the interpretation of degree as a spillover measure is strengthened.22 Furthermore, there is also an alternative explanation regarding the positive association between closeness and success: if a few highly productive programmers work together on several projects, their projects will have a high value of “closeness” even though no spillover occurs. While this story is plausible in a small, relatively tightly connected network, it is unlikely in our network, which is huge and thinly connected. In summary, these arguments suggest that the interpretation of degree and closeness as knowledge spillovers is reasonable in our case. 5. Further discussion and concluding remarks Prior research studying the relationship between network structure and performance has ignored the implications of the dynamics of knowledge spillovers that occur by means of the interaction of different programmers collaborating in different software development projects over time. Estimating a model of knowledge spillovers, we provide empirical evidence that is consistent with both direct and indirect spillovers. We also find evidence that the addition of a programmer who works on many projects is associated with greater success, even after controlling for the induced change in the network structure that are associated with that programmer’s additional direct connections to other projects. We further find modifications and additions to the code to be p ositively asso ciated with project success and this result is extremely robust. Clearly, end-users react positively to product reliability and stability which is most directly associated with continuous, incremental product modification and adjustment. In fact, while end-users also appreciate new product features decisions to consume are driven more directly by the former. In our setting, a modification was defined as a change made by a programmer to existing co de. Accordingly, mo difications act as a go o d proxy for incremental innovation that improve how the software product works. Similarly, an addition was defined when a programmer adds a previously non-existing functionality Additions, thus act a go o d proxy for new product features that extend the functionality of the product (Lewin et al., 1999). Our results demonstrate these assertions, but also indicate that, both product stability (i.e., modifications) the introduction of new product features (i.e., additions) are critical aspects of product success. These findings may help open source product development teams in considering their resource allocation decisions towards these distinct activities. Indeed, the findings may further inform firms in their effort to create distinct value propositions. It is often difficult to measure incremental innovations. By quantifying this measure and by showing that these innovations are positively asso ciated with project success, our results suggest that even small innovations can lead to project success in the open 22
We thank the referee for this suggestion and discussion.
286 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
Table A.1 Descriptive statistics. Obs
Mean
Std. dev.
Min
Max
Outside of giant component Downloads
107,534
18838.9
829,788
1
1.85e+08
years_since
107,534
5.009
1.922
0.967
10.152
Degree
107,534
1.267
2.039
0
33
cpp
107,534
1.628
0.567
1
53
many_projects
107,534
0.084
0.277
0
1
Stage
107,534
3.759
1.160
1
6
num_adds
107,534
72.53
799.68
0
91,752
num_mods
107,534
111.821
2294.84
0
469,000
Inside of giant component Downloads
53,608
96998.17
3,269,947
1
4.74e+08
years_since
53,608
5.712
1.935
0.969
10.160
Degree
53,608
.847
8.535
1
264
cpp
53,608
0.033
0.005
0.013
0.052
many_projects
53,608
4.072
6.947
1
267
Stage
53,608
0.497
0.500
0
1
num_adds
53,608
3.944
1.120
1
6
num_mods
53,608
271.199
2164.149
0
103,368
Downloads
53,608
505.764
3324.277
0
209,882
innovation process. It would be interesting to note if this result obtains in other settings as well. The inclusion of modifications and additions in our analysis suggest that (i) such painstaking efforts yield fruits and that (ii) our methodology could be employed in future studies of networks in order to learn more about the open innovation process. Such future work could associate modifications and additions with each programmer who made the particular contribution. In such a way, the effort from each programmer could be measured directly. Finally, the institutional context in which open source software is being developed provides interesting insights into innovation activities within dispersed team Whereas prior research has stressed the differences between virtual development teams and their commercial counterparts, our study points to something they do have in common. That is, in spite of the social and structural differences underlying such organisations, these distinct institutions both compete for market share by serving their customer via a stable and feature-rich value proposition.
Appendix A. Descriptive statistics and correlations Tables A.1 and A.2.
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 287
Table A.2 Correlation among network centrality variables, cpp and “many_projects” (giant component, N = 53,608). cpp
Degree
Closeness
cpp
1.00
Degree
0.62
1.00
Closeness
0.27
0.44
1.00
many_projects
0.14
0.49
0.30
Many projects
1.00
Appendix B. Example of a modification in Project aMSN Project aMSN is an MSN compatible messenger application. Accordingly, on May 28, 2008 a user with username square87 made a modification to file guicontactlist.tcl. This revision with unique identifier [r9986] is described in a comment by square87 as “A minor code improvement." The modification covers the deletion of some lines of code (indicated in red) and the addition of new lines of code (indicated in green).23
MSN compatible messenger application Commit [r9986] A minor code improvement Authored by: square87 2008-05-28 Changed: guicontactlist.tcl Line # a/trunk/amsn/guicontactlist.tcl 7 # * change cursor while dragging (should we ?) 10 # * ... cfr. "TODO:" msgs in code ::Version::setSubversionId {$Id: guicontactlist.tcl 9911 12 2008-05-22 21:51:56Z tom $} 14 15
namespace eval ::guiContactList { namespace export drawCL
Line # b/trunk/amsn/guicontactlist.tcl 7 # * change cursor while dragging (should we ?) 10 # * ... cfr. "TODO:" msgs in code ::Version::setSubversionId {$Id: guicontactlist.tcl 9986 12 2008-05-28 07:43:18Z square87 $} namespace eval ::guiContactList { 14 15
namespace export drawCL
1050
if { [lindex $unit 1] == "reset" } {
1050
if { [lindex $unit 1] == "reset" } {
1051 1052 1053 1055 1056 1057
set font_attr [font configure $defaultfont] } else { set font_attr [font configure [lindex $unit 1]] } array set current_format $font_attr } else { array set current_format $font_attr
1051 1052 1053
set font_attr [font configure $defaultfont] } else { set font_attr [font configure [lindex $unit 1]] }
1055 1056
} else { array set current_format $font_attr
23 For improved readability, empty lines and some comments have been removed from the source code. Note that an earlier modification to the file guicontactlist.tcl with unique identifier [r9911] is referenced in the text. It was made on May 22, 2008 by a different user whoes username we have shortened to preserve privacy.
288 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
1058 1059 1060 1409 1411
array set modifications [lindex $unit 1] foreach key [array names modifications] { set current_format($key) [set modifications($key)] # Function that draws a contact proc drawContact { canvas element groupID } {
1057 1058 1059 1408 1410
1413 1414
# We are gonna store the height of the nicknames variable nickheightArray #Xbegin is the padding between the beginning of the contact and the left edge of the CL variable Xbegin
1413 1414
1411
1416 1417
1416 1417
array set modifications [lindex $unit 1] foreach key [array names modifications] { set current_format($key) [set modifications($key)] # Function that draws a contact proc drawContact { canvas element groupID } { i { ${::guiContactList::external_lock} || !$::contactlist_loaded } { return } # We are gonna store the height of the nicknames variable nickheightArray #Xbegin is the padding between the beginning of the contact and the left edge of the CL variable Xbegin
N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290 289
Appendix C. Distribution of software by topics Topic
Relative frequency in single-programmer projects (%)
Relative frequency in multi-programmer projects (%)
Internet
16
17
Software development
14
14
System
11
13
Communications
11
10
Games/entertainment
11
8
Scientific/engineering
7
7
Multimedia
8
8
Office/business
5
5
Database Other
5
5
13
14
290 N. Gandal, U. Stettner / International Journal of Industrial Organization 48 (2016) 270–290
References Ahuja, G., 2000. Collaboration networks, structural holes, and innovation: A longitudinal study. Adm. Sci. Q. 45 (3), 425–455. Angrist, J., Pischke, J., 2009. Mostly Harmless Econometrics. Princeton University Press, Princeton, New Jersey. Ballester, C., Calvó-Armengol, A., Zenou, Y., 2006. Who’s who in networks. Wanted: the key player. Econometrica 74 (5), 1403–1417. Beyerlein, M.M., Johnson, D.A., Beyerlein, S.T., 2001. Virtual teams. Jai. Bonaccorsi, A., Rossi, C., Giannangeli, S., 2006. Adaptive entry strategies under dominant standards: Hybrid business models in the open source software industry. Manag. Sci. 52 (7), 1085–1098. Calvó-Armengol, A., Jackson, M.O., 2004. The effects of social networks on employment and inequality. Am. Econ. Rev. 94 (3), 426–454. Calvó-Armengol, A., Patacchini, E., Zenou, Y., 2009. Peer effects and social networks in education. Rev. Econ. Stud. 76 (4), 1239–1267. Cascio, W.F., 2000. Managing a virtual workplace. Acad. Manag. Exec. 14 (3), 81–90. Claussen, J., Falck, O., Grohsjean, T., 2012. The strength of direct ties: Evidence from the electronic game industry. Int. J. Ind. Organ. 30 (2), 223–230. Fershtman, C., Gandal, N., 2011. Direct and indirect knowledge spillovers: the “social network” of open– source projects. RAND J. Econ. 42 (1), 70–91. Freeman, L., 1979. Centrality in social networks: Conceptual clarification. Soc. Netw. 1 (3), 215–239. Gandal, N., Stettner, U., 2014. Network Dynamics and Knowledge Transfer in Virtual Organizations: Overcoming the Liability of Dispersion. CEPR Discussion Paper, 9980, Center for Economic and Policy Research. Goyal, S., Van Der Leij, M.J., Moraga-González, J.L., 2006. Economics: An emerging small world. J. Polit. Econ. 114 (2), 403–412. Grewal, R., Lilien, G.L., Mallapragada, G., 2006. Lo cation, lo cation, lo cation: How network embeddedness affects project success in open source systems. Manag. Sci. 52 (7), 1043–1056. Harhoff, D., Henkel, J., Von Hippel, E., 2003. Profiting from voluntary information spillovers: how users benefit by freely revealing their innovations. Res. Policy 32 (10), 1753–1769. Hippel, E., 2005. Democratizing Innovation. MIT Press, Cambridge, MA. Holmstrom, B., 1982. Moral hazard in teams,. Bell J. Econ. 13 (2), 324–340. Jackson, M.O., Yariv, L., 2007. Diffusion of behavior and equilibrium properties in network games. Am. Econ. Rev. 97 (2), 92–98. Jehn, K.A., Shah, P.P., 1997. Interpersonal relationships and task performance: An examination of mediation processes in friendship and acquaintance groups. J. Pers. Soc. Psychol. 72 (4), 775. Karlan, D., Mobius, M., Rosenblat, T., Szeidl, A., 2009. Trust and social collateral. Q. J. Econ. 124 (3), 1307–1361. Laurent, A.M., 2004. Understanding Open Source and Free Software Licensing. O’Reilly Media, Inc. Lerner, J., Tirole, J., 2002. Some simple economics of open source. J. Ind. Econ. 50 (2), 197–234. Lewin, A.Y., Long, C.P., Carroll, T.N., 1999. The coevolution of new organizational forms. Organ. Sci. 10 (5), 535–550. Lipnack, J., Stamps, J., 1997. Virtual Teams: Reaching Across Space, Time, and Organizations with Technology. John Wiley & Sons Inc. Pinto, M.B., Pinto, J.K., 1990. Project team communication and cross-functional co op eration in new program development. J. Prod. Innov. Manag. 7 (3), 200–212. Townsend, A.M., DeMarie, S.M., Hendrickson, A.R., 1998. Virtual teams: Technology and the workplace of the future. Acad. Manag. Exec. 12 (3), 17–29. Von Hippel, E., 1986. Lead users: a source of novel product concepts. Manage. Sci. 32 (7), 791–805. Von Hippel, E., Von Krogh, G., 2003. Open source software and the “private-collective” innovation model: Issues for organization science. Organ. Sci. 14 (2), 209–223. Whiting, V.R., Reardon, K.K., 1998. Communicating From a Distance: Establishing Commitment in a Virtual Office Environment. Academy of Management. Wasserman, S., 1994. Social Network Analysis: Methods and Applications. Cambridge University Press. Wong, S.-S., Burton, R.M., 2000. Virtual teams: what are their characteristics, and impact on team performance. Comput. Math. Organ. Theory 6 (4), 339–360.