Science of Computer Programming (
)
–
Contents lists available at ScienceDirect
Science of Computer Programming journal homepage: www.elsevier.com/locate/scico
An empirical study of FOSS developers patterns of contribution: Challenges for data linkage and analysis Sulayman K. Sowe a,∗ , Antonio Cerone b , Dimitrios Settas b a
Information Services Platform Lab, NICT, Kyoto, Japan
b
UNU-IIST, Macau
article
info
Article history: Received 13 October 2011 Received in revised form 15 November 2013 Accepted 19 November 2013 Available online xxxx Keywords: Open Source Software developers Open Source Software projects Software repositories Concurrent Versions System Mailing lists
abstract The majority of Free and Open Source Software (FOSS) developers are mobile and often use different identities in the projects or communities they participate in. These characteristics pose challenges for researchers studying the presence and contributions of developers across multiple repositories. In this paper, we present a methodology, employ various statistical measures, and leverage Bayesian networks to study the patterns of contribution of 502 developers in both Version Control System (VCS) and mailing list repositories in 20 GNOME projects. Our findings shows that only a small percentage of developers are contributing to both repositories and this cohort is making more commits than they are posting messages to mailing lists. The implications of these findings for understanding the patterns of contribution in FOSS projects and on the quality of the final product are discussed. © 2013 Elsevier B.V. All rights reserved.
1. Introduction In the world of Free and Open Source Software (FOSS) development, developers are identified by what they can do rather than by where they are located. Generally, FOSS developers are like nomads; freely moving from one project to another. They commit bits and pieces of code, report and fix bugs, take part in discussions in various mailing lists, forums, IRC channels, document coding ethics and guidelines, and help new entrants. Along the way they create and archive a wealth of knowledge and experience associated with their art [1]. Participants in various projects use tools (versioning systems, mailing lists, bug tracking systems, etc.) to enable the distributed and collaborative software development process to proceed. These tools serve as repositories which can be data mined to understand who is involved, who is talking to whom, what is talked about, how much someone contributes in terms of code commits or email postings. Thus, by applying cyber-archeology [2] to these repositories, we can learn and better understand the patterns of contribution [3–5] of FOSS developers in the projects concerned. An important aspect of software engineering research, and the certification of FOSS products in particular, is understanding and measuring the contribution of individuals, particularly developers, who work on a project [6,1]. A host of factors which have both empirical and industrial implications motivates this kind of research. Factors include, but are not limited to; – helping practitioners understand and monitor the rate of project development, – characterizing FOSS projects in terms of developers turnover and extent of contribution [7,4,8],
∗
Corresponding author. Tel.: +81 774 98 6871; fax: +81 774 98 6960. E-mail addresses:
[email protected],
[email protected] (S.K. Sowe),
[email protected] (A. Cerone),
[email protected] (D. Settas).
0167-6423/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.scico.2013.11.033
2
S.K. Sowe et al. / Science of Computer Programming (
)
–
– identifying bottlenecks and isolate exceptional cases in terms of projects and individuals’ contributions [6], – using the research results to help develop new metrics or evaluate an existing taxonomy [9] of metrics for FOSS quality attributes [10] and the certification process, – address empirical research challenges associated with identifying, visualizing, and quantifying FOSS developers’ contributions across multiple software repositories [5]. Furthermore, as argued by Shaikh and Cerone [11], communication and patterns of contribution are factors that contribute to measure the efficiency of the development process. Indeed, the patterns of (code) contribution in FOSS projects have emerged as an important measure in quality assurance for FOSS products [9,10]. Traditional frameworks for software quality assurance and certification, based on the conformance to prescriptive standards and strict guidelines, which are sometimes incorporated in legal prescriptions for software certification, are in strong contrast to FOSS philosophy. In fact, the philosophy of freedom and absence of hierarchical organization typical of FOSS communities results in collaborative production environments in which there is no space for prescriptive standards and strict guidelines. Thus, a descriptive approach that analyses the FOSS community of practise and its activities is likely to define better indicators of the quality of the software product than a prescriptive approach that tries to check whether these activities follow prescribed standards and guidelines [12]. Posting and committing are two key activities that can be used in the definition of quality indicators for FOSS. At individual level, the number of postings is an indicator of engagement while number of commits is an indicator of productivity [13]. At community level, contents of postings incorporate information about collaboration among community members [13] as well as information that may be interpreted as symptoms of community antipatterns [14]. Cerone [12] discusses how to combine engagement and productivity as well as other individual factors, such as reputation and learning, with collaboration effectiveness and community antipatterns towards the definition of a comprehensive framework for the analysis of FOSS software quality to be used in a certification process. In this work we quantitatively analyze developers’ contributions in terms of numbers of postings and commits, and investigate two research hypotheses on the correlation of these two activities. The purpose of our analysis is to study the patterns of contributions of developers in a number of successful FOSS projects, characterize projects in terms of patterns of contributions and discuss how patterns of contributions affect the engagement and productivity of developers and, ultimately, the quality of the produced code. Even though a strong linkage exists between the information in FOSS repositories (e.g. bug reports and source code repositories [15,16]), few researchers, with the exception of [5], strive to understand how developers’ contributions vary across repositories. In this work we try to fill this niche by establishing links between source code repositories and mailing lists to locate developers who are present in both repositories and quantify their contribution in terms of commits and posts. Furthermore, the unique characteristics of each FOSS project introduces uncertainty in estimating the balance between posting messages to mailing lists and committing code to a project’s code repository. In order to manage this uncertainty, and further understand how developers’ contributions vary across repositories, we compute the probability of achieving a specific contribution in terms of commits and posts using Bayesian Network analysis. Bayesian Belief Networks (BBNs) are a suitable formalism that can be used to build models of domains with inherent uncertainty [17,18]. BBNs provide a natural, logical and probabilistic framework to examine the relationship between posts and commits. Three considerations justify the decision of modeling the FOSS developers’ contribution with BBNs:
• The formalism of Bayesian Networks enables the development of a precise mathematical model of FOSS developer contribution.
• The suggested model can be used by researchers to quantify the probability of FOSS developers making more commits to a project’s Version Control System (VCS) or code repository than they are posting messages to mailing lists or the probability of posting more messages than code commits. All the VCS data from the 20 GNOME projects used in this research originates from SVN (Apache Subversion), which is one type or class of VCS. Thus, we will be referring to SVN to describe our commits data, instead of VCS throughout this paper. • The proposed BBN model can be used by researchers to estimate an adequate balance between coding and posting to mailing lists in terms of the probability of achieving the desired contribution size of mailing list posts or code commits. 1.1. Research questions and hypotheses For software projects to evolve, it is by design that developers must continuously commit and review the codebase. In the eyes of the developer, user, and business community, an active mailing list is a proxy of project success. The presence of project’s leads, core and active developers in mailing lists has a profound effect on the way individuals within and outside the project see the commitment of most influential members in the project. For software companies and private enterprisers, developer presence in lists may indicate that software support activities are not only available from ordinary users, but also come from individuals behind the software and project. Thus, developers should strive to balance their coding activity with their involvement in mailing lists. This raises a number of productivity and governance questions which may be of great interest to both FOSS project administrators and researchers. For instance; How many developers are willing to commit code and patches and at the same time participate in discussions in mailing lists and other project’s fora? If developers are
S.K. Sowe et al. / Science of Computer Programming (
)
–
3
coding more than they are participating in mailing lists, what does this tell us about the maintenance and dynamics of the software and project? How much effort can a developer allocate to one activity and at what stage in the project’s life-cycle? This pattern of contribution raises further questions such as, if attaining a balanced activity is greatly required in a project, how can project administrators schedule and assign or dedicate one activity at the expense of another? What is the impact on the performance of the project of having developers specializing in one activity? Our research hypothesis is stated thus;
• Hypothesis: FOSS developers make more commits to a project’s code (SVN) repository than they are posting messages to mailing lists. To help us answer some of the above questions and address our research hypothesis, we used data provided by the FLOSSMetrics project [19] to proposed a methodology to help us shed light on some of the above issues. FOSS researchers (e.g. [20–22]) study and report developers coding activities separately from their mailing list’s activities. However, research on the FOSS development process [23,3] informs us that in many projects, a small number of talented core developers or ‘‘code gods’’ [24] are busily (as if in a software beehive) submitting patches and tinkering with the code to produce good and usable software for the rest of the community. This cohort also contributes to discussions in mailing lists; interacting with other software developers and users, keeping abreast with project activities and monitoring what goes on in their respective projects or packages [25]. Nevertheless, we conjecture that not all the developers who commit or make changes to a project’s source repository also participate in [developer] mailing lists. Thus, this research investigates the contributions of FOSS developers to both SVN and developer mailing lists and presents a methodology to overcome the empirical research challenges associated with integrating or linking data from the two repositories. That is, we find whether developers are coding through commits in SVN as much as they are participating in mailing lists. This involves correlating developers commits activities with their corresponding mailing lists activities within the same project. The rest of the paper is organized as follows. In Section 2, we present the background and work related to our research. Our research methodology, characteristics of our data, and developers’ contributions are discussed in Section 3. This is followed by an analysis and discussion of our results in Section 4. Our analysis is further extended in Section 5 to incorporate Bayesian networks to help us visualize developers patterns of contribution. In our concluding remarks, in Section 6, we summarize and discuss findings and state the implications if this research FOSS quality assurance and certification. This section also highlights our future work we plan to undertake. 2. Background and related work A lot of research utilizes data from a single repository to analyze code contribution of developers [24,4], trends and inequality in posting and replying activities in Apache and Mozilla [20], KDE [21], Debian [25], and FreeBSD [22]. Most of these researchers use data from Concurrent Versions Systems (CVSs) or mailing lists as these are de facto repositories in FOSS projects. SVN is mainly used to coordinate the coding activities of software developers and manage software builds and releases. Mailing lists, on the other hand, are the main communication channels [25]. Many important aspects of a project are negotiated in (developer) lists: software configuration details, the project roadmap and how to deal with future requests, how tasks are distributed, issues concerning package dependencies, scheduling online and off-line meetings, etc. Thus, for a developer to keep abreast with developments in a project, committing code to SVN alone is not sufficient. She/he needs to participate in the respective lists, communicate her/his ideas, and engage with colleagues. To bolster this view, [26] pointed out the essence of communication as a means to foster long term success of software projects. This may take the form of a bi-directional developer to developer, developer to user, and developer to community communication. Bayesian networks are used in situations that require statistical inference and have extensively received attention in the FOSS literature. For instance, Bayesian networks have been used to model causal relationships between FOSS success metrics [27], study code reuse in FOSS [28], predict defects in the Eclipse codebase [29], assess the interoperability and guide developers to improve the quality of FOSS [30], as a means to assess the complexity and predict software quality using data from NASA projects and FOSS systems [31], etc. In [27], the authors focused on the development of a model for predicting software project success that leverages the wealth of available FOSS data in order to accurately forecast the behavior of software engineering groups. The model accounts for both the technical elements of software engineering and the social elements that drive the decisions of individual developers. In [28], a Bayesian algorithm was used to estimate the probability of a code modification to be a knowledge reuse incident, using conditional probabilities, by rating the occurrence of specific words based on training data. Using Bayesian analysis, the authors in [28] were able to calculate the probability that a source code modification comment related to imported lines of code from another project. A combination of statistical reasoning and Bayesian networks have even been used to infer resource management in debugging in several large FOSS projects [32]. BBNs are causal networks that consist of a set of variables and a set of directed links between variables and provide the means to structure a situation for reasoning under certainty [18,17]. Each variable has a finite set of mutually exclusive states. Furthermore, to each variable A (e.g. commit) with influencing variables (parent posts) B1 , . . . , Bn , there is attached the potential table Pr(A|B1 , . . . , Bn ). These directed network graphs represent causal relations between different events. The directions of the graph indicate the direction of the impact. Causal networks can be used in order to follow how a change
4
S.K. Sowe et al. / Science of Computer Programming (
)
–
Fig. 1. Methodology to identify developers from multiple repositories.
of certainty in one variable may change the certainty for other variables. Formally, the relation between the two nodes is based on Bayes’ rule, and can be expressed as: Pr(B|A) =
Pr(A|B) Pr(B) Pr(A)
.
(1)
Building a BBN for the domain of FOSS developer contribution in terms of posts and commits involves three tasks. The first task is to identify the network variables that are of importance, together with their possible values. The second task is to identify the cause and effect relationships between the variables and express them in a graphical structure. The last task is the quantitative analysis that requires building a probabilistic network by obtaining the required probabilities. The model can then be explored using Bayesian propagation in order to set specific values to either the commits or posts variables and explore the effect to the probabilities of the other variable. This method, formally presented in Section 5, can be used to quantify the probability of achieving a specific number of posts or commits given a number of posts or commits. In this way, researchers can visualize the different patterns of contributions according to the specific characteristics of each software project. 3. Research methodology The methodology employed in this research investigates the simultaneous occurrence of developers in SVN and mailing lists. That is, identifying developers who make both commits and posts, and ensuring that the developer making the SVN commit(s) is the same individual posting to the developer mailing list(s) of the same project. The methodology also ensures that developers with multiple identities in the mailing lists are only counted once. The methodology, as represented in Fig. 1, shows the FLOSSMetrics database as the data source from which we extracted SVN and mailing lists data for the 20 projects in our study. In our data acquisition, a fundamental question is always asked; ‘‘Is this the same developer we have in both repositories?’’ Fig. 1 also shows the MySQL database tables and fields from which we extracted commits and posts. These two attributes are used in our analysis as variables for identifying developers in the projects. The identification of developers across the two repositories is discussed in more detail in Section 3.3. The links between the tables as indicated by the arrows (with ‘‘IS’’) shows the path taken to locate a developer and counting his/her contribution to both SVN and mailing lists. 3.1. Data The data for this research comes from the 20 GNOME projects shown in Table 1. The FLOSSMetrics database retrieval system uses a combination of tools to retrieve and analyze data from FOSS projects (e.g. GNOME, Apache, etc.) repositories (SVN, mailing lists, bug tracking systems, etc.) and forges (e.g. SourceForge) and computes various code and community metrics. 3.1.1. SVN data The CVSAnalY2 [33–35,8] tool retrieves Source Content Management systems (SCM) data and stores committers attributes into various tables. Each project in the FLOSSMetrics database can have one or more SVN dumps. For each project, we extracted the SVN committers and the number of commits they made. These two values act as fields in our commits table for committer identification.
S.K. Sowe et al. / Science of Computer Programming (
)
–
5
Table 1 List of GNOME projects studied. No.
Projects
No.
Projects
1 2 3 4 5 6 7 8 9 10
Balsa Brasero Deskbar Applet Ekiga Eog Epiphany Evince Evolution GDM gedit
11 12 13 14 15 16 17 18 19 20
GNOME Control Center GNOME Games GNOME Media GNOME Power Manager GNOME Screensaver GNOME System Tools Libsoup Metacity Nautilus Seahorse
3.1.2. Mailing lists data The MLStats [36,8] tool extracts one or more mailing lists archives of a particular project. We downloaded *.sql files dumps of each project and extracted data contained in two tables. The messages_people table contains two fields; type_of_recipient and email_address. The type_of_recipient field has the format ‘‘From’’, ‘‘To’’, and ‘‘Cc’’. The ‘‘From’’ email header is used to identify lists posters [37,1] and count their contribution to mailing lists [1]. The people table contains three fields; email_address, name, and username. As shown by the ‘‘IS’’ arrows in Fig. 1, two of these fields are used to link and reference developers’ information in the messages_people table and commits table. 3.2. Data cleaning FOSS data is widely and freely available in various repositories in various formats. This provides researchers with an unprecedented abundance of easily accessible data for research and analysis of FOSS community participation [38]. However, it is becoming increasingly evident that collecting and analyzing FOSS data has become a problem of abundance, reliability, aggregation, cleaning and filtering useful information to address the research question at hand [38]. Having identified fields needed to analyze developers participation in SVN and mailing lists, we proceeded with data cleaning. From the people table we need both the ‘‘name’’ and ‘‘username’’. Therefore, all posters without recognizable names and/or usernames were removed. Some of the names contained unrecognizable characters such as ‘‘=?ISO88591?Q?g=FCrkan_g=FCr?=’’. Some of the posts with null posters or developers were also removed. Furthermore, since the full name (first +last) is needed to identify a developer, all posters with a single name were deleted from the mailing lists data. That is, delete developer ‘‘Foo’’ but retain developer ‘‘Foo Bar’’. For the SVN data in the commits table, all commits without committers or authors were removed. Aggregate numbers of items deleted in each of the above categories were; unrecognizable characters = 28, posts with null posters = 30, posters with a single name = 14, and commits without authors = 5093. 3.3. Identification of developers across repositories As depicted in Fig. 1, a poster in the mailing lists can be identified in two ways. In the messages_people table, a poster is identified by his email address. By using the ‘‘From’’ field, all the emails posted by a particular person can be aggregated. The people table is used to identify a poster through his ‘‘email address’’, poster ‘‘name’’ in the form of first name + last name (e.g. Pawel Salek), and ‘‘username’’ (eg. pawsa). For the SVN data, the committer field from the commits table was used to identify a committer or author of a commit. In SVN, an individual is simply identified as a ‘‘Committer’’ or an ‘‘Author’’ of one or more commits. Mailing lists participants, on the other hand, can be identified by means of message identifiers like ‘‘From:’’ in email headers [1]. The identification process proceeds thus; 1. SVN data: For each project in the commits table, LIST all the committers AND for each committer (unique commit_id) SUM all his commits and store the value as ncommits variable. 2. Mailing lists data: • For each project in the people table, LIST (‘‘email address’’ + ‘‘name’’ + ‘‘username’’ or poster_id) WHERE both name and username are the same for this committer as in the commits table, AND • From the messages_people table, LIST developers ‘‘email_address’’, WHERE people.email_address = messages_people. email_address. For each developer, COUNT all the posts and store the value as nposts variable. The results of a typical query are shown in Fig. 2, with developers emails made anonymous. From the query, it can be seen that a developer may appear many times. This is because while a developer has only one identification in SVN, his commit id, the same developer may use many email addresses when posting messages to developer mailing lists. 3.3.1. Unmasking aliases and removing duplicates The volunteering nature of the FOSS development process and participation in public repositories means that participants may use different emails. For example, as shown in Fig. 2, a developer (e.g. Felix Riemann) has his identity masked in
6
S.K. Sowe et al. / Science of Computer Programming (
)
–
Fig. 2. Identification of FOSS developers from SVN and mailing lists.
three email aliases;
[email protected],
[email protected],
[email protected]. The fundamental problem in email alias unmasking [39,25] is finding out whether those aliases all belong to one participant or developer. The algorithm for checking duplicate records and unmasking aliases in the mailing lists data proceeded thus; BEGIN { For ALL records in project X IF ncommits OCCURS MORE THAN ONCE for THIS developer AND poster_id = commit_id RETURN ‘‘THIS is a duplicate’’ RECORD ONLY 1 value of ncommits for THIS developer TOTAL posts for this developer = SUM of nposts }END The query scenario in Fig. 3 shows the result when the algorithm is applied to the dataset. This literally means; a developer (e.g. Federico Mena Quintero in Fig. 2) with a unique commit_id (federico) made 100 commits to the project’s SVN. However, he contributed to mailing lists using two emails (
[email protected] and
[email protected]). He posted 28 messages using the first email and 1 message using the second email. The developer’s overall email postings is the sum of the two posts he made using the different emails, i.e. 28 + 1 = 29. All duplicate records are identified and developers nposts and ncommits are calculated in a similar manner. There were, on average, 115 duplicate records of this nature per project in our dataset. This means that many developers are using multiple email addresses. 4. Analysis and discussion Since we have little support from the literature about how a similar research or methodology was being carried out, we begin our analysis using a technique called exploratory data analysis of (EDA) [40]. This technique is useful in helping us examine the distribution, the nature of the commits and posts, and prepare the ground for what may be the appropriate analysis technique to be used to answer the research hypotheses [5]. 4.1. Trends in mailing lists and SVN Tables 2 and 3 shows the descriptive statistics of the developers’ posting and committing activities after data cleaning.
S.K. Sowe et al. / Science of Computer Programming (
)
–
7
Fig. 3. Query scenario to identify developers in SVN and mailing lists. Table 2 Descriptive statistics of posts. Projects
Balsa ** Brasero Deskbar Applet Ekiga ** Eog Epiphany ** Evince ** Evolution ** GDM ** gedit ** GNOME Power Manager ** GNOME Control Center GNOME Games GNOME Media GNOME Screensaver GNOME System Tools ** Libsoup ** Metacity Nautilus ** Seahorse Total
N Posters
1,088 63 97 729 134 889 451 4,769 658 571 203 174 173 289 27 297 52 60 2,065 62
Mean
Median
Std. Dev.
Skewness
Std. Err. of Skewness
Max.
Sum
12.98 4.13 7.00 9.24 4.17 5.91 3.46 7.44 3.99 3.95 5.58 8.36 8.79 5.39 5.59 4.51 3.73 4.82 8.61 6.16
2.00 1.00 2.00 3.00 1.50 1.00 1.00 2.00 1.00 1.00 2.00 2.00 2.00 2.00 3.00 1.00 1.00 2.00 2.00 2.00
67.273 8.071 19.187 59.999 8.914 23.795 13.093 46.274 25.597 15.653 33.046 20.936 25.146 12.270 7.846 11.019 8.761 11.029 61.402 18.382
13.942 3.498 5.200 22.103 4.533 12.657 14.013 25.619 20.006 14.252 13.881 5.261 5.909 5.884 3.322 6.076 6.326 5.301 32.822 5.390
.074 .302 .245 .091 .209 .082 .115 .035 .095 .102 .171 .184 .185 .143 .448 .141 .330 .309 .054 .304
1465 45 137 1509 67 470 238 1760 595 306 470 186 224 115 39 112 63 77 2475 122
14,125 260 679 6,734 559 5,250 1,562 35,478 2,628 2,253 1,133 1,455 1,521 1,557 151 1,339 194 289 17,782 382
12,851
95,331
As shown in Table 2, for each project the total number of posters (N posters), the mean post per poster, the median, standard deviation, skewness, the maximum posts made by one individual, and the total or sum of postings for that project are shown. For all the projects, the mode and minimum numbers of posts made equals 1. A total of 12,851 posters contributed 95,331 email messages. Table 3 shows the same descriptive statistics for the committers (N Committers) in each project. A total of 4,762 developers (fewer than about a third of the posters) made 140,665 commits. Evident from the statistics is that each project has its unique characteristics [7] in terms of developers’ postings and committing activities, as well as the number of developers involved in each activity. For instance, 45% (N = 9) of the projects (marked with ++ in Table 3) have more committers than posters. The other 55% (N = 11) of the projects (marked with ** in Table 2) have more posters than committers. Generally, the contributions of the developers to mailing lists is characterized by smaller means (post per poster). However, the posting data has many outliers; with many developers posting few emails and a few making large numbers of posts. On the contrary, the commits are characterized by larger means (commits per committer). These characteristics are reminiscent of power distributions observed in FOSS participants’ contributions to mailing lists [25] and CVS [41] activities. 4.2. Trends in mailing lists and SVN over time In order to explore the trend in developer participation in the projects’ repositories over time, we divided the distribution of commits and posts into three main groups, as shown in Figs. 4–6.
8
S.K. Sowe et al. / Science of Computer Programming (
)
–
Table 3 Descriptive statistics of commits. Projects
Balsa Brasero ++ Deskbar Applet ++ Ekiga Eog Epiphany Evince Evolution gdm gedit GNOME Power Manager GNOME Control Center ++ GNOME Games ++ GNOME Media ++ GNOME Screensaver ++ GNOME System Tools ++ Libsoup Metacity ++ Nautilus Seahorse ++ Total
N Committers
181 86 133 186 298 252 203 430 282 329 148 423 321 324 126 207 37 264 395 137
Mean
Median
Std. Dev.
Skewness
Std. Err. of Skewness
44.09 26.05 19.67 41.99 16.59 34.84 17.30 81.11 23.63 20.68 22.75 21.39 27.68 13.05 12.79 20.55 32.49 15.50 37.52 21.39
4.00 5.00 5.00 5.00 4.00 6.00 4.00 10.00 5.00 5.00 5.00 6.00 7.00 4.00 4.00 5.00 1.00 4.00 7.00 5.00
241.309 137.976 84.751 286.865 53.660 217.618 59.881 309.253 103.297 81.699 161.952 55.908 89.618 31.803 74.388 82.615 111.261 61.675 126.202 99.481
9.233 8.869 8.413 12.130 8.231 14.340 7.494 8.099 9.653 10.704 12.060 6.917 8.559 6.804 11.097 10.079 5.067 8.547 8.529 9.603
.181 .260 .210 .178 .141 .153 .171 .118 .145 .134 .199 .119 .136 .135 .216 .169 .388 .150 .123 .207
4762
(a) ncommits.
(a) ncommits.
Max.
2688 1271 834 3757 581 3352 535 4061 1266 1153 1974 634 1164 345 838 1043 647 600 1712 1087
Sum
7,981 2,240 2,616 7,810 4,944 8,780 3,511 34,877 6,663 6,804 3,367 9,049 8,884 4,228 1,611 4,254 1,202 4,091 14,822 2,931 140,665
Fig. 4. Steady increase in commits and posts.
Fig. 5. Fluctuations (Crests and troughs) in commits and posts.
(b) nposts.
(b) nposts.
1. Group 1: Projects which show a steady increase in commits and posts during their life-cycle (Fig. 4). For this group, 75% (N = 15) of the projects show a steady increase in the number of commits. Only 45% (N = 9) of projects show a steady increase in postings. A small percentage, 30% (N = 6) show a steady increase in both commits and posts.
S.K. Sowe et al. / Science of Computer Programming (
(a) ncommits.
)
–
9
(b) nposts. Fig. 6. Great fluctuations and inconsistency in commits and posts.
2. Group 2: Projects which show slight fluctuation, indicated by the presence of one or two crests and troughs in commits and posts (Fig. 5). There are a few projects in this group; 3 projects with fluctuations in commit activities and 4 projects in posting activity, respectively. No project shows a fluctuation in commits and posts at the same time. 3. Group 3: Projects which show great fluctuations and inconsistency in the commits and posting activities of developers (Fig. 6). Only two projects (evince and libsoup) show significant fluctuation and inconsistency in their commit activities. The evince projects show great fluctuations and inconsistency in the commits between April 1999 until September 2003 when commit activities began to rapidly increase. In the libsoup project, commits fluctuated from the beginning until the end of our study in July, 2011. Seven projects displayed a significant fluctuation and inconsistency in their posting activities, with gnome media and gnome system tools showing exceptional fluctuations when compared to the rest. 4.3. Developers in both SVN and mailing lists In order to analyze the simultaneous occurrence of the developers in both repositories, we queried the SVN and mailing lists data of each project and computed developers’ contributions in terms of the ncommits and nposts variables discussed in Section 3.3. Table 4 shows the number of developers (N_dev) in each project who contributed to both SVN and mailing lists. For the 20 projects, 502 developers made more commits (mean = 152.1; Std. deviation = 431.171) than posts (mean = 43.19; Std. deviation = 164.353). Furthermore, as shown in Fig. 7, our identification technique and algorithm revealed a relatively small, but varying percentage of developers who are involved in both activities. The percentages for a set of posts and commits in each project are calculated as follows:
% posters =
N_dev
N Poster
% committers =
∗ 100
N_dev
N committers
(2)
∗ 100.
(3)
The percentage of developers in each activity varies across the projects. For example, the Ekiga, Gnome Power Manager, and Seahorse projects have less than 5% of their developers committing to VCS and at the same time posting messages to their respective projects’ mailing lists. Projects such as Balsa and Nautilus have few posters (3.68% and 3.23%), but a higher percentage of committers (22.1% and 34.43%). 4.4. Hypothesis validation In validating our hypothesis, for each project, we compared the total number of commits made to SVN with the total number of messages developers posted to the mailing lists. The pattern of contribution for all the 502 developers in the 20 projects is shown in the boxplots in Fig. 8. In the boxplots, the median line and error T-bar widths for each set of project data (nposts and ncommits) are shown. The domination of SVN commits, with larger means of commits per developer (mean = 150.32, Std. deviation = 424.986) over posts (mean = 42.63, Std. deviation = 161.852) is evident in all the projects. Furthermore, while the distribution of posts and commits (Fig. 8) showed skewness towards more commits, we further explored the possibility that, in some projects, FOSS developers’ contribute equally to code repository and mailing lists. In the exploration, we used correlation between commits and posts to study how developers’ activities in SVN and mailing lists are related. The scatter plot in Fig. 9 shows the correlation between commits and posts in all projects. In the plot, data points are fitted to a line to show the trend in the commits and posting activities of the developers. Previous research [25, page 414]
10
S.K. Sowe et al. / Science of Computer Programming (
)
–
Table 4 Developers’ contribution to both SVN and mailing lists. N_dev.
Projects
nposts Mean
Median
Std. dev.
Max.
Sum
ncommits Mean
Median
Std. dev.
Max.
Balsa Brasero Deskbar_applet Ekiga Eog
40 6 8 4 16
37.23 19.33 20.13 438.25 18.81
5.5 2 6 121.50 4.5
133.76 30.936 35.64 722.49 25.95
851 77 106 1509 78
1489 116 161 1753 301
112.53 69.17 120.25 2170.25 129.38
25 8.5 5.5 2417 37.5
206.33 98.3 289.5 1876.01 196.16
751 196 834 3757 581
4,501 415 962 8,681 2,070
Epiphany Evince Evolution gdm gedit g_control_center g_games g_media g_power_manager
40 18 92 21 19 35 14 23 7
55.73 27.17 56.47 26.38 20.84 21 43.57 21.87 3.43
7 2 4.5 2 2 4.00 7 5 4
116.69 58.61 172.68 56.63 69.34 51.753 83.15 36.67 1.51
470 238 1481 227 306 296 304 130 6
2229 489 5195 554 396 735 610 503 24
146.82 100.89 283.29 112.9 103 69.54 178.93 39.22 8
16.5 10.5 46 17 4 19 13.5 6 2
536.03 180.31 622.01 242.76 267.13 125.13 341.49 84.03 11.4
3352 535 4061 939 1153 527 1164 345 32
5,873 1,816 26,063 2,371 1,957 2,434 2,505 902 56
4 22 3 10 136 2
15.75 17.14 28.33 14.8 49.95 73.5
15.84 33.42 39.63 23.85 225.14 101.12
39 154 74 77 2475 145
63 377 85 148 6793 147
211.5 92.59 219.67 184.2 86.63 198
3.5 24 8 7 13 198
417.67 228.27 370.09 270.12 220.1 275.77
838 1043 647 600 1712 393
846 2,037 659 1,842 11,782 396
g_screensaver g_system_tools Ibsoup Metacity Nautilus Seahorse
10 3 8 6 8.5 73.5
Sum
Seahorse
Nautilus
Metacity
Libsoup
Gnome System Tools
Gnome Screensaver
Gnome Media
Gnome Games
Gnome Control Center
Gnome Power Manager
gedit
GDM
Evolution
Evince
Epiphany
Eog
Ekiga
Deskbar Applet
Brasero
Balsa
Percentage
Where g_ indicates GNOME related projects.
Projects
Fig. 7. Percentage of developers involved in posting and committing.
showed that FOSS developers and users’ mailing lists activities have fractal or self-similarities properties and could best be explained by a polynomial model of third order, i.e. a cubic relation of the type LogN = b0 + b1 ∗ logr + b2 ∗ (logr )2 + b3 ∗ (logr )3 .
(4)
As shown in Fig. 9, our fit method could explain 30.4% (r 3 = 0.304) of the variability in commits and posting activities. This translates to 26.5% or r 2 = 0.265 in linear terms. The linear association between ncommits and nposts as measured by Pearson correlation = 0.594, and this is significant at the 0.01 level (2-tailed) with ρ = 0.000. However, the ncommits and nposts data are not normally distributed and have outliers. Thus, nonparametric correlations using Spearman’s rho and Kendall’s tau_b statistics, which work regardless of the distribution of the variables [42], are used to report the association between commits and posts. Table 5 shows that, overall, there is a low correlation between commits and posts, with Spearman’s coefficient (ρ) = 0.426 (p = 1.000). Furthermore, Wilcoxon signed ranks for the Two-Related-Samples Tests procedure was used to compare the distributions of two variables. The results of the test in Table 6 show that for the 502 developers in the 20 projects; 140 developers had a negative rank or ncommits < nposts, 327 developers had positive rank or ncommits > nposts, and 35 developers had a balanced activity with ncommits = nposts.
S.K. Sowe et al. / Science of Computer Programming (
)
–
11
10,000
1,000
100
10
1
Seahorse
Nautilus
Metacity
Libsoup
Gnome System Tools
Gnome Screensaver
Gnome Power Manager
Gnome Media
Gnome Games
Gnome Control Center
gedit
GDM
Evolution
Evince
Epiphany
eog
Ekiga
Deskbar Applet
Brasero
Balsa
0.001
Projects
Fig. 8. Distributions of posts and commits (y-axis in log scale).
Fig. 9. Relationship between posts and commits (both axis in a log scale). Table 5 Correlations between posts and commits (N = 502). Non-parametric t-test
Variables
Kendall’s tau_b
nposts ncommits
Spearman’s rho
nposts ncommits
nposts
ncommits
Cor. Coeff. Sig. Cor. Coeff. Sig.
1.000 . .308 .000
.308 .000 1.000 .
Cor. Coeff. Sig. Cor. Coeff. Sig.
1.000 . .426 .000
.426 .000 1.000 .
5. Analyzing patterns of contribution using Bayesian networks The exploratory data analysis and subsequent discussions presented in previous sections analyzed the distribution and the nature of the commits and posts in order to examine the research hypotheses of this paper. The analysis discovered that a
12
S.K. Sowe et al. / Science of Computer Programming (
)
–
Table 6 Ranks of developers’ contribution. Variables
Tests
Ndev.
Mean rank
Sum of ranks
ncommits -nposts
Negative ranks Positive ranks Ties
140 327 35
175.46 259.06
24565.00 84713.00
Total
502
Table 7 Results of discretization process for the variable ‘‘Ekiga’’. Interval
Explanation
x_137_1818 x_137_1818_274_3636 x_1371_818 x_274_3636_411_5455
X<137.1818 137.1818
1371.818 274.3636
small percentage of developers are contributing to both repositories and that this cohort is making more commits than they are posting messages to mailing lists. We further examine the hypotheses that were previously put forward in this paper, using Bayesian networks. For hypothesis H1, we explore the probabilistic values of posting messages and commits for each project in the dataset. For hypothesis H2, we use Bayesian propagation to discover the exact balance between coding and posting to mailing lists. This is quantified in terms of the probability of achieving the desired discretized categorical values in both posts and commits. For the creation of the Bayesian Network, two variables for each project are used regarding the posts and the commits of FOSS developers. However there is a problem in using the posts and commits as BBN variables. These are continuous variables, which need to be quantified into discrete valued categories. In order to address this problem, the posts and commits were discretized using Data PreProcessor [43] with BN PowerConstructor and BN PowerPredictor for preprocessing the training data. Data pre-processing involved discretization of continuous fields into a finite number of intervals using the default equal width binning method of the Belief Network Powersoft Data Preprocessor [43] software tool. Equal width binning divides the range of possible values into N subranges of the same size using a rule which can be expressed as: binwidth =
(Maxvalue − Minvalue) N
.
(5)
For example, using the number of posts of each FOSS developer that contributed to the project ‘‘Ekiga’’, this will create 137.1818 bins. Table 7 describes an example of the intervals used in the ‘‘Ekiga’’ variable. This variable contains the probabilistic values of the posts of the ‘‘Ekiga’’ GNOME project. This analysis confirms previous findings that the percentage of developer contribution in each activity varies across the projects. This is illustrated in Fig. 10 using the sizes of the intervals (bins). Some project BBN discretized variables, automatically generated by the BBN software, have 2 intervals while others have up to 9 intervals. The Belief Network Powersoft BN PowerConstructor [44] tool was then used to provide our domain knowledge and construct the Bayesian Network model shown in Fig. 10. In the model, two BBN nodes are employed in order to model the posts and commit probabilistic values of each of the 20 different projects. The default threshold was used in order to detect only strong relations between variables. In the ordering of fields (variables), the posts variable of a project is an indirect cause of the commits variable for all projects. This procedure sped up the construction process and defined the directed cause-effect relations between the variables. In this specific model, the direction of this relationship is not important because in case a model has only 2 variables, setting evidence to any of the variables directly affects the probabilistic values of the other node. The results of the model were improved by providing domain knowledge compared with a model that contained no domain knowledge at all. However, even without providing any domain knowledge, PowerConstructor [44] would be able to detect correct non-directed cause-effect relations. In the dataset used in this paper, the different projects could not be inter-connected with cause and effect relationships because FOSS developers of the example dataset participate in different projects and the same developer does not contribute to every project of the dataset. Genie 2.0 [45] was used for further processing in order to set evidence to the BBN in Fig. 11 and produce meaningful results. Every time evidence was set to either the posts or commits node of a project, the belief values were updated in order to investigate how Bayesian propagation affects the values of the other node. This technique allowed us to set evidence to both posts and commits variables of each of the 20 projects and explore the probabilities of achieving a specific number of posts or commits given a specific number of either posts or commits. For example by setting to both of the variables posts and commits of the Nautilus project, we are able to quantify and visualize how changes in one variable affect the other. Fig. 11 illustrates three different scenarios for each variable of the Nautilus project. The Nautilus posts variable had 5 value
S.K. Sowe et al. / Science of Computer Programming (
)
–
13
Fig. 10. BBN model showing FOSS developers’ contribution.
intervals of contribution size, while the BN Pre-processor software automatically defined 8 intervals for the commits node. All nodes use the same notation shown in Table 7. Both nodes start with the smallest possible interval which indicates the size of the bin. By setting evidence to the next largest bin (206.25
14
S.K. Sowe et al. / Science of Computer Programming (
)
–
Fig. 11. Exploring the FOSS Developer Contribution BBN model.
results in an increased probability of achieving the smallest interval of the commits node of 56% and a probability of 6% for all other intervals. In order to visualize and understand how contributing a larger number of posts in this project affects the number of commits, evidence is set to the next interval which is 412.5
S.K. Sowe et al. / Science of Computer Programming (
)
–
15
of FOSS developers. Another threat caused by the application of BNs is the steep learning curve that is required to allow researchers to use BNs for research purposes. This is not a threat to validity as a plethora of research papers and books on BNs as well as software tools that are available free of charge can be used in order to obtain guidelines and best practices for the creation of BNs. Regarding the external validity of our analysis, BNs use prior data and therefore it is not possible to generalise our results to a different dataset. This is not considered a threat in our work as the resulting BNs offered meaningful results from the probabilistic exploration of the two hypotheses. 6. Concluding remarks In this paper we have put forward research questions to investigate whether FOSS developers are making more commits to code repositories than they are posting messages to mailing lists, and whether developers should aim at a balanced activity by contributing equally to the two repositories. Despite the fact that FOSS data is widely available and can be easily extracted [38], this kind of research is made difficult because of the problem associated with linking data and the subsequent identification of developers from multiple repositories (SVN and mailing lists). We have presented and discuss a methodology which alleviates these empirical research obstacles. The methodology and algorithm enabled us to locate and count the quantitative contribution of FOSS developers in 20 GNOME projects. An exploratory data analysis or EDA technique [5] was used to show that each project has its unique characteristics and developers’ contribution to either coding or mailing lists can vary tremendously. This trend was validated through descriptive statistics of the data and by observing developers participation in the projects repositories over time. Our data consists of 12,851 posters and 4762 committers who, respectively, posted 95,331 email messages and made 140,665 commits. From this data, we found out that in 55% (N = 11) of the projects there are more developers as posters, with smaller means (post per poster), than committers, with larger means (commits per committer). From this sample of posters and committers we are able to extract 502 developers who simultaneously contribute to both SVN and mailing lists. This cohort made more commits (mean = 152.1; Std. deviation = 431.171) than posts (mean = 43.19; Std. deviation = 164.353). However, this group accounts for a relatively small percentage of the overall developer community in each project. But a close examination of the percentage of developers involved in posting and committing shows that projects with small number of posters will also have a small number of committers. This is valid in 60% (N = 12) of the projects studied. There is a 50–50 split (20%; N = 4 on either side) between projects with a small percentage of posters but a large percentage of committers (Balsa, Epiphany, Evolution, and Nautilus) and those with a large percentage of posters but a smaller percentage of committers. The analysis supports our first hypothesis that developers are making more commits to a project’s code (SVN) repository (mean = 150.32, Std. deviation = 424.986) than they are posting messages to the developers mailing lists (mean = 42.63, Std. deviation = 161.852). However, while this may be true for about 60% of the projects studied, a low but significant correlation (ρ = .0426; p = 1.000) between developers’ commits and posting was also observed. This shows that, in some projects, developers are contributing equally to code repositories and mailing lists. Wilcoxon signed ranks for the Two-Related-Samples Tests further revealed that only 35 developers (less than 10%) had a balanced or tied activity. The implications of these findings may provide assurance that FOSS developers, apart from coding and committing bits of code to a project’s VCS, are also involved in knowledge brokerage [1] in mailing lists. We can conjecture from earlier findings [1,34,46,5] and our experience in this research, that this serendipity has implications for the quality of code since a large number of developers are externalizing and discussing their coding activities with other community members in the mailing lists. This kind of engagement may enable the developers to improve the quality of their code base, do more refactoring and learn about how the quality of the produced code may be improved. Future work In the future we aim to consolidate our understanding of developer dynamics and the development of appropriate community metrics [11] or indicators for the certification of both the FOSS product and process. We plan to add a qualitative element to this research by interviewing some of the α [47] or star [25] developers. This future work may also incorporate content analysis of the postings, addition of new metrics like posts/commits and how such metrics vary over time. This kind of data, metrics, and commits analysis may help us better understand the quality of the developers’ contribution, reveal any bottlenecks which may hinder the incorporation of developers’ code into the released product, and further reveal what kinds of metrics may be most appropriate when characterizing FOSS developers and projects. Furthermore, we have introduced Bayesian networks as an additional tool to help us model, visualize, and quantify the probability FOSS developers making more commits to a project’s code repository than they are posting messages to mailing lists. In our future work, we are exploring means of using Bayesian analysis to help us find the exact balance between coding and posting to mailing lists. Finally, while the conclusion drawn from this study points out certain trends in the GNOME projects, our next research will use a more heterogenous sample of projects. This will help us find out whether the patterns observed in this research can be generalized to other FOSS projects, not specifically GNOME based.
16
S.K. Sowe et al. / Science of Computer Programming (
)
–
Acknowledgments We are grateful to the FLOSSMETRICS project for providing access to the data and tools used in this study. We also wish to acknowledge the excellent comments we received from anonymous reviewers of both the OpenCert 2010 Workshop and the Science of Computer Programming Journal. The first author wishes to acknowledge the Japan Society for the Promotion of Science (JSPS) for sponsoring this research, JSPS FOSSINA Project ID: P10807. This work has also been supported by Macao Science and Technology Development Fund, File No. 019/2011/A1, in the context of the PPAeL project. The first author’s research was conducted when the author was with the United Nations University Institute of Advanced Studies (UNU-IAS), Yokohama, Japan. References [1] S.K. Sowe, I. Stamelos, L. Angelis, Identifying knowledge brokers that yield software engineering knowledge in OSS projects, Inf. Softw. Technol. 48 (2006) 1025–1033. [2] S.K. Sowe, I.G. Stamelos, G.I. Samoladas (Eds.), Emerging Free and Open Source Software Practices, IGI Global, 2007. [3] W. Scacchi, J. Feller, B. Fitzgerald, S.A. Hissam, K. Lakhani, Understanding free/open source software development processes, Softw. Process. Improv. Pract. 11 (2) (2006) 95–105. [4] G. Gousios, E. Kalliamvakou, D. Spinellis, Measuring developer contribution from software repository data, in: MSR ’08: Proceedings of the 2008 International Workshop on Mining Software Repositories, ACM, 2008, pp. 129–132. [5] S.K. Sowe, A. Cerone, Integrating data from multiple repositories to analyze patterns of contribution in FOSS projects, in: Proceedings of OpenCert 2010, in: Electronic Communications of the EASST, vol. 33, 2010, pp. 442–460. [6] SQO-OSS, Novel Quality Assessment Techniques. Deliverable Report-D7, Technical Report, Software Quality Observatory for Open Source Software. Project Number: IST-2005-33331, 2009. [7] A. Capiluppi, P. Lago, M. Morisio, Characteristics of open source projects, in: CSMR ’03: Proceedings of the Seventh European Conference on Software Maintenance and Reengineering, IEEE Computer Society, Washington, DC, USA, 2003, p. 317. [8] S.K. Sowe, I. Samoladas, I. Stamelos, L. Angelis, Are FLOSS developers committing to cvs/svn as much as they are talking in mailing lists? challenges for integrating data from multiple repositories, in: 3rd International Workshop on Public Data about Software Development (WoPDaSD). September 7th–10th 2008, Milan, Italy. [9] SQO-OSS, Overview of the State of the Art. Deliverable Report-D2, Technical Report, Software Quality Observatory for Open Source Software. Project Number: IST-2005-33331, 2009. [10] I. Stamelos, L. Angelis, A. Oikonomou, G. Bleris, Code quality analysis in open-source software development, Inf. Syst. J. 12 (1) (2002) 43–60. 2nd Special Issue on Open-Source, Blackwell Science. [11] S.A. Shaikh, A. Cerone, Towards a metric for open source software quality, in: Electronic Communications of the EASST, in: Foundations and Techniques for Open Source Certification 2009, vol. 20, 2009. [12] A. Cerone, Learning and activity patterns in OSS communities and their impact on software quality, in: Proceedings of OpenCert 2011, in: Electronic Communications of the EASST, vol. 48, 2011. [13] A. Cerone, S. Fong, S.A. Shaikh, Analysis of collaboration effectiveness and individuals’ contribution in FLOSS communities, in: Proceedings of OpenCert 2011, in: Electronic Communications of the EASST, vol. 48, 2011. [14] A. Cerone, D. Settas, Using antipatterns to improve the quality of FLOSS development, in: Proceedings of OpenCert 2011, in: Electronic Communications of the EASST, 2012, vol. 48, 2011. [15] J.M. Dalle, M. den Besten, Different bug fixing regimes? a preliminary case for superbugs, in: J. Feller, B. Fitzgerald, W. Scacchi, A. Sillitti (Eds.), Open Source Development, Adoption and Innovation, in: IFIP International Federation for Information Processing, vol. 234, Springer, 2007, pp. 247–252. [16] T. Zimmermann, R. Premraj, A. Zeller, Predicting defects for eclipse, in: PROMISE ’07: Proceedings of the Third International Workshop on Predictor Models in Software Engineering, IEEE Computer Society, Washington, DC, USA, 2007, p. 9. [17] F. Jensen, Bayesian Networks and Decision Graphs, Springer, 2001. [18] D. Settas, S.K. Sowe, I. Stamelos, Detecting similarities in antipattern ontologies using semantic social networks: implications for software project management, Knowl. Eng. Rev. 24 (2009) 287–307. [19] FLOSSMetrics—free/libre/open source metrics and benchmarking (www.flossmetrics.org), 2006–2009. Last accessed: April 18, 2011. [20] A. Mockus, R. Fielding, J. Herbsleb, Two case studies of open source software development: Apache and Mozilla, Trans. Softw. Eng. Methodol. 11 (3) (2002) 1–38. [21] G. Kuk, Strategic interaction and knowledge sharing in the KDE developer mailing list, Manag. Sci. 52 (2006) 1031–1042. [22] T.T. Dinh-Trong, J.M. Bieman, The FreeBSD Project: a replication case study of open source development, IEEE Trans. Softw. Eng. 31 (2005) 481–494. [23] B. Massey, Longitudinal analysis of long-timescale open source repository data, in: PROMISE ’05: Proceedings of the 2005 Workshop on Predictor Models in Software Engineering, ACM, New York, NY, USA, 2005, pp. 1–5. [24] G. Robles, J. Gonzalez-Barahona, Contributor turnover in libre software projects, in: E. Damiani, B. Fitzgerald, W. Scacchi, M. Scotto, G. Succi (Eds.), IFIP International Federation for Information Processing, in: Open Source Systems, vol. 203, Springer, Boston, 2006, pp. 273–286. [25] S.K. Sowe, I. Stamelos, L. Angelis, Understanding knowledge sharing activities in free/open source software projects: an empirical study, J. Syst. Softw. 81 (3) (2008) 431–446. [26] F. Brooks, The Mythical Man-Month. Essays on Software Engineering, Addison-Welsey Publishing, 1975. [27] J. Beaver, X. Cui, J. Charles, T. Potok, Modeling success in FLOSS project groups, in: PROMISE ’09: Proceedings of the 5th International Conference on Predictor Models in Software Engineering. [28] S. Haefliger, G. Krogh, S. Spaeth, Code reuse in open source software, J. Manage. Sci. 54 (2008) 180–193. [29] L. Yuyang, P.C. Wooi, K. Byung-Ki, P. Hyukro, Predict software failure-prone by learning Bayesian network, Int. J. Adv. Sci. Technol. 1 (2008) 35–42. [30] L. Yuqing, Z. Tong, Bayesian network to construct interoperability model of open source software, in: Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Vol. 03, CSSE ’08, IEEE Computer Society, Washington, DC, USA, 2008, pp. 758–761. [31] S. Wagner, A Bayesian network approach to assess and predict software quality using activity-based quality models, Inf. Softw. Technol. 52 (2010) 1230–1241. [32] T. Kremenek, From uncertainty to bugs: inferring defects in software systems with static analysis, statistical methods, and probabilistic graphical models, Ph.D. Thesis, Stanford, CA, USA, 2009. [33] LibraSoft, Cvsanaly2; https://projects.libresoft.es/wiki/cvsanaly, 2011. Last accessed: Apr 15, 2011. [34] J. Amor, G. Robles, J. Gonzalez-Barahona, Discriminating development activities in versioning systems: a case study, in: Proceedings PROMISE 2006: 2nd. International Workshop on Predictor Models in Software Engineering co-located at the 22th International Conference on Software Maintenance (Philadelphia, Pennsylvania, USA). [35] G. Robles, J. Gonzalez-Barahona, D. Cortazar, I. Herraiz, Tools for the study of the usual data sources found in libre software projects, Int. J. Open Source Softw. Process. 1 (2009) 24–45. [36] LibreSoft, Mlstats; https://projects.libresoft.es/projects/show/mlstats, 2011. Last accessed: Apr 15, 2011.
S.K. Sowe et al. / Science of Computer Programming (
)
–
17
[37] S.R.Q. Jones, G. Ravid, Information overload and the message dynamics of online interaction spaces: a theoretical model and empirical exploration, Inf. Syst. Res. 15 (2) (2004). [38] S.K. Sowe, L. Angelis, I. Stamelos, Y. Manolopoulos, Using repository of repositories (RoRs) to study the growth of F/OSS projects: a meta-analysis research approach, in: Open Source Development, Adoption and Innovation, in: IFIP International Federation for Information Processing, vol. 234/2007, Springer, Boston, 2007, pp. 147–160. [39] C. Bird, A. Gourley, P. Devanbu, M. Gertz, A. Swaminathan, Mining email social networks, in: MSR ’06: Proceedings of the 2006 International Workshop on Mining Software Repositories, ACM Press, New York, NY, USA, 2006, pp. 137–143. [40] U. Sekaran, Research Methods for Business: A Skill Building Approach, fourth ed., Wiley, 2006. [41] G. Madey, V. Freeh, R. Tynan, The open source software development phenomenon: an analysis based on social network theory, in: Americas conf. on Information Systems (AMCIS2002), pp. 1806–1813. [42] M. Norusis, Statistical Procedures Companion, Prentice Hall, Inc., 2004. [43] J. Cheng, PreProcessor System, 2000. http://www.cs.ualberta.ca/jcheng/prep.htm. [44] J. Cheng, PowerConstructor System, 1998. http://www.cs.ualberta.ca/jcheng/bnpc.htm. [45] M. Druzdzel, Genie2.0 system, 2010. [46] J. Long, Understanding the role of core developers in open source development, J. Inf. Inf. Technol. Organ. 1 (2006) 75–85. [47] S. Valverde, G. Theraulaz, J. Gautrais, V. Fourcassie, R.V. Sole, Self-organization patterns in wasp and open source communities, IEEE Intell. Syst. 21 (2006) 36–40.