Journal of Informetrics 11 (2017) 498–510
Contents lists available at ScienceDirect
Journal of Informetrics journal homepage: www.elsevier.com/locate/joi
Patterns of authors contribution in scientific manuscripts Edilson A. Corrêa Jr. a , Filipi N. Silva b , Luciano da F. Costa b , Diego R. Amancio a,∗ a b
Institute of Mathematics and Computer Science, University of São Paulo, PO Box 668, 13560-970 São Carlos, Brazil São Carlos Institute of Physics, University of São Paulo, PO Box 369, 13560-970 São Carlos, SP, Brazil
a r t i c l e
i n f o
Article history: Received 1 October 2016 Received in revised form 6 March 2017 Accepted 6 March 2017 Keywords: Authorship contributions Science of science Entropy Text mining
a b s t r a c t Science is becoming increasingly more interdisciplinary, giving rise to more diversity in the areas of expertise. In such a complex environment, the participation of authors became more specialized, hampering the task of evaluating authors according to their contributions. While some metrics were adapted to account for the order (or rank) of authors in a paper, many journals are now requiring a description of their specific roles in the publication. Surprisingly, the investigation of the relationships between credited contributions and author’s rank has been limited to a few studies. Here we analyzed such a kind of data and show, quantitatively, that the regularity in the authorship contributions decreases with the number of authors in a paper. Furthermore, we found that the rank of authors and their roles in papers follow three general patterns according to the nature of their contributions: (i) the total contribution increases with author’s rank; (ii) the total contribution decreases with author’s rank; and (iii) the total contribution is symmetric, with most of contributions being performed by first and last authors. This was accomplished by collecting and analyzing the data retrieved from PLoS One and by devising a measurement of the effective number of authors in a paper. The analysis of such patterns confirms that some aspects of the author ranking are in accordance with the expected convention, such as the first and last authors being more likely to contribute more diversely in a scientific work. Conversely, such analysis also revealed that authors in the intermediary positions of the rank contribute more in specific roles, such as collecting data. This indicates that the an unbiased evaluation of researchers must take into account the distinct types of scientific contributions. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction Scientific publications have been consolidated as the most efficient means of communication and dissemination of discoveries/findings in science, as well as being an indicator of success/productivity of a given researcher, evidenced by quantitative measures, for instance the h-index (Amancio, Oliveira, & Costa, 2012c; Hirsch, 2005; Schreiber, 2015) and its variants (such as m-index (Hirsch, 2005) and g-index (Egghe, 2006)). Even though these and other measures are used to evaluate researchers, their indiscriminate use is still controversial given the existence of problems such as lack of informative context (Wendl, 2007) and the presence of possible bias towards nationality, gender and age (Kelly & Jennions, 2007; Mishra, 2008). In addition, the use of citations based indexes are oftentimes very dependent on authors’ research field, which makes comparisons of individuals prominence across distinct disciplines a hard task (Anauati, Galiani, & Gálvez, 2016; Hutchins, Yuan, Anderson, & Santangelo, 2016; Ruiz-Castillo & Waltman, 2015; Viana, Amancio, & Costa, 2013; Waltman
∗ Corresponding author. E-mail address:
[email protected] (D.R. Amancio). http://dx.doi.org/10.1016/j.joi.2017.03.003 1751-1577/© 2017 Elsevier Ltd. All rights reserved.
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
499
& van Eck, 2013; Waltman, Yan, & van Eck, 2011). Another limitation in evaluating authors according to the number of citations motivated by their papers is that the contributions of authors in each work is mostly overlooked, as in traditional indexes such as the h-index all authors usually receive the same credit (Sekercioglu, 2008; Zhang, 2009). In recent years, however, some attempts have been proposed to take into account additional information such as the number of authors in papers (Bornmann, Mutz, & Daniel, 2008) and the consideration of authors rank (Tscharntke, Hochberg, Rand, Resh, & Krauss, 2007). The evaluation of authors contribution in scientific articles has drawn attention in recent years, mainly because of its implication in the author credit system (Greene, 2007) for quantifying authors roles in a scenario of increasing number of multi-authored papers (Regalado, 1995). The precise specification of authors contribution is essential, for example, to give credit for the idea/knowledge being conveyed in a scientific manuscript, especially in areas where the amount of authors makes unclear how each author provided its own particular contribution. The exact specification also allows for the identification of authors’ responsibilities, which might be useful to identify who is behind a scientific fraud (Wray, 2006). The growing interest in the quantification of authors role in every step of the research led several journals to adopt policies encouraging authors to disclose information regarding individual contributions. This is the case of journals such as Nature,1 Science,2 PNAS,3 PLoS One4 and Journal of Informetrics5 . This information allows for the quantification of unprecedented patterns, such as the variability of contributions, the relationship between rank (position in the list of authors) and the amount of contributions. More important than that, the knowledge of authors contribution may lead to a more informed quantification of researchers abilities, thus improving the author credit system as a whole. Despite the growing interesting in acknowledging contributions in pieces of articles, only a few studies have probed the general patterns of contributorship in scientific manuscripts. One of the first large-scale studies on this subject was conducted by Larivière et al. (2016). This study identified several interesting contributorship patterns in papers obtained from PLoS journals. Concerning the homogeneity (or regularity) of contributions, the labor division in medical discipline papers was found to be more frequent than in other disciplines. The seniority of authors was also found to play a different role in scientific manuscripts, as young scientists are oftentimes associated with technical tasks, while senior researchers are more engaged in conceptual contributions. The work in Larivière et al. (2016) also observed an U-shaped relationship between the total amount of contributions and author byline, thus showing than intermediary authors contribute less than than first and last authors. In Haussler and Sauermann (2014), the authors focus on the analysis of labor division in papers written by team members, by analysing similarities and differences between knowledge construction and existing theories created to understand general production systems (Hamilton, Nickerson, & Owan, 2003). An empirical analysis over 13,000 articles providing individual contribution statements revealed that some activities are oftentimes performed by specialized authors, which is the case of contributions associated to data analysis and manuscript writing. Another interesting finding revealed that the division of labor increases with team size, while the number of authors occupied in conceptual contributions grows slowly as the team size increases. Concerning the heterogeneity of contributions, the authors found that such quantity takes high values. However, the average individual usually perform half of all activities. In this paper, we analyze the information describing the authors contributions present in research papers. More specifically, we are interested in analyzing the variability of authors contributions. Differently from the previous studies approaching the contributorship subject, here we introduce a framework to compute the variability of contributions in a single paper using concepts from network science. As we shall show, the simple proposed framework can be used to measure the symmetry of contributions as a proxy to quantify the homogeneity in labor division. Particularly, we are also interested in measuring how many authors effectively contribute to a manuscript, a piece of information that can be further used to normalize various bibliometric indexes (Ruiz-Castillo & Waltman, 2015). As an effective contribution, we consider the main tasks associated to the creation of a scientific paper: (i) data analysis, (ii) experiment design, (iii) experiment realization, (iv) data collection, and (v) manuscript writing. These five contributions are listed in several journals requiring the specification of authors’ individual contributions. Here, the effective number of authors takes into consideration the homogeneity (or regularity) of the labor distribution among the authors of a paper. This is quantified in terms of the true diversity, which can be understood as a normalization for the Shannon entropy (Jost, 2006). Upon devising a simple framework to handle the contributions of authors as a bipartite network, we propose a measurement to quantify the effective number of authors in a paper. Unlike previous works, here we focused our analysis on a single mega journal, PLoS One, which is dominated by biomedical research papers. This particular choice of dataset allowed us to verify if the contributorship variability patterns also appears in a set of papers less heterogeneous (in comparison to datasets comprising papers of very different specialized disciplines). We found that the effective number of authors, in general, is a linear function of the total number of authors. Most importantly, we could identify that the deviation between the actual and the effective number of authors increases with the total number of authors. Using a measure to quantify
1 2 3 4 5
http://www.nature.com/nature. http://www.sciencemag.org. http://www.pnas.org. http://journals.plos.org/plosone/. http://www.journals.elsevier.com/journal-of-informetrics.
500
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
the symmetry/diversity of contributions in papers, we also found out that the regularity of contributions decreases as the number of authors increases from 1 to 10. Another object of study in this manuscript concerned the specialization of authors in specific tasks. While it has been found in the literature that first and last authors are the ones who most contribute, we found out that this may not be true if we analyze specific contributions. This study revealed the existence of three general patterns that govern how authors are ranked according to their specific roles in scientific publications. In the first pattern, we observed the specialization of first authors in performing experiments, a fact already noticed in previous works (Larivière et al., 2016). The second pattern describes high contributions of second-to-last authors in collecting data for analysis. Finally, the third pattern describes a high level of participation of first and last authors in writing the manuscript. This paper is organized as follows. Section 2 describes the considered dataset and the framework to represent and evaluate the contributions of authors in papers. Section 3 details the main results of this paper. Finally, in Section 4 we present the conclusions of the study.
2. Methodology In this section, we briefly describe the dataset used in our analysis and the pre-processing steps applied to obtain authors contributions in scientific articles. We also describe a measure to quantify the irregularity of individual contributions in papers in terms of the following quantities: the effective number of authors; and the symmetry of contributions.
2.1. Database In order to probe the authors contribution patterns in scientific papers, we created a dataset comprising articles retrieved from the PLoS One journal. We particularly chose this journal because it spans many disciplines and provides parseable information concerning authors contribution in each paper. We also chose this journal because it provides a large number of papers, which makes the analysis less prone to undesirable small sample effects. We retrieved 82,981 articles, which corresponds to all papers published between 2006 and 2014. For each article, we recovered the authors’ names and the full list of contributions. To investigate the relationship between contributions and rank, we also recovered the position of authors in the list of authors. Concerning the retrieved information, only the list of contributions had to be processed. In PLoS One, contributions are associated with the acronym of authors’ names. Because there is no clear rule about generating acronyms from names, authors may generate acronyms in distinct ways. To obtain an association between acronyms and actual names, the following procedure was adopted: (i) we created an acronym for each author based on the capital letters of his/her name; (ii) we associated perfect matches between provided acronyms and the ones generated in step (i); (iii) if a perfect match is not obtained in the previous step, a similarity measure was used in order to associate acronyms. More specifically, we used the Tanimoto’s distance because of its widespread use in comparing strings and acronyms (Tanimoto, 1958). This last step was performed in 21.4% of all names obtained in our dataset. To simplify our analysis, we first captured all possible contributions informed by authors. Then, these items were reduced to a smaller set comprising only the most frequent contributions. This set includes the following contributions, as informed by authors: (i) analyzed the data, (ii) collected the data, (iii) conceived the experiments, (iv) performed the experiments, (v) wrote the paper, and (vi) revised the manuscript. In particular, we disregarded contribution (vi) because it appeared only in 0.30% of all papers. We have also clustered very similar contributions to one of the above mentioned items. The groups of equivalent or similar contributions are: 1 Analyzed the data: this contribution also includes the following items: “analyzed the data”, “interpretated the data”, “statistical analysis”, “performed a statistical analysis”, “interpreted the results”, “data interpretation” and “contributed to the discussion”. 2 Conceived the experiments: this contribution also includes the following items: “conceived and designed the experiments”, “designed the software used in the analysis”, “designed the study”, “designed the experiments”, “conceived and designed the study”, 3 Performed the experiments: this contribution refers to a single item: “performed the experiments”. 4 Wrote the paper: this contribution also includes the following items: “wrote the paper”, “wrote the manuscript”, and “contributed writing the manuscript”. 5 Collected the data: this contribution also includes the following items: “contributed with reagent materials and analysis tools” and “collected the data”. Another pre-processing step applied to obtain relevant information from the original dataset concerns the consideration of institutions (universities, departments, etc) as collaborators. Because no specific contribution is provided for institutions, we have ignored all papers with this type of information. In addition, we have also disregarded all articles with no explicit
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
501
Contributions Analyzed the data
Authors 1st author
Collected the data Conceived the experiments
2nd author Performed the experiments 3rd author
Wrote the paper
Fig. 1. Example of bipartite network representing the relationship between authors and contributions. The list of all possible contributions was obtained from the information provided by authors who published in the PLoS One journal. Note that the total amount of authors and particular contributions vary from article to article.
information concerning authors contribution. In total,15,787 articles were removed from the analysis. The pre-processed dataset is available for download in our website.6 2.2. Effective number of authors and symmetry of contributions The analysis of the contribution of a particular author can be accomplished in several ways (Slone, 1996), as our dataset directly indicates which contributions an author may have, as shown in Fig. 1. The information provided in each article of our dataset can be regarded as a bipartite network (Newman, 2010), i.e. a network with links established only among nodes belonging to distinct groups. As illustrated in Fig. 1, the bipartite network derived from each paper establishes links between authors and its possible contributions. A interesting feature of bipartite networks representing contributions is the regularity of contributions in a paper. For simplicity’s sake, we consider that all contributions are equally important and, for this reason, we study regularity in terms of the diversity of contributions made by authors. To measure how regular is the contribution of authors in papers, we used a measure inspired in the accessibility concept, a centrality measurement employed in network science (Amancio, 2015; Travenc¸olo & Costa, 2008). This measurement was originally devised (Travenc¸olo & Costa, 2008) to capture the effective number of accessed nodes when an agent performs a random walk from a starting node. Differently from traditional measurements, such as the node degree (ki ), the accessibility use network features that go beyond the simple static network topology. The accessibility uses the probability of reaching each adjacent node to quantify the effective number of neighbors. Let pij the probability of a random walker to go from node i to node j. The accessibility ˛i is given by
⎛
˛ = exp ⎝−
⎞
pij log pij ⎠ ,
(1)
j
where the sum is performed for all neighbors such that pij > 0. In other words, the accessibility is the exponential of the (max)
= ki , which entropy of the distribution of p. As a consequence, the maximum value that the accessibility can obtain is ˛i occurs when pi1 = pi2 = . . . = piki (Travenc¸olo & Costa, 2008). Whenever the distribution of p is uneven, the random walkers tend to access preferentially specific nodes, thus decreasing effective number of accessed nodes (Travenc¸olo & Costa, 2008). This effect is illustrated in Fig. 2. In (a), the central node has an accessibility value ˛A = kA = 6, as all six neighbors are equally accessed in a random walk (edges weight are equally distributed). Conversely, in (b), the distribution of visits is more concentrated in two neighbors, because the strength of their links are much higher than the strength linking A and other neighbors. Owing to the large deviation in the distribution of p in this scenario, the effective number of neighbors drops to ˛A = 3.52, as defined in Eq. (1). Here we show that the accessibility in Eq. (1) is a particular case of the well-known Shannon diversity (or true diversity) index D(q) (Jost, 2006), which measures the effective number of types in a distribution. If there are R types, and each has a probability i of appearance, the true diversity can be written as D(q)
6
1 = = Mq−1
R 1/(1−q) q i
,
i=1
PLoS One dataset: http://cyvision.ifsc.usp.br/patternsauthors (18 sep 2016).
(2)
502
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
10 1
A
1 10
A 1
1
(b)
(a)
Fig. 2. Quantification of the accessibility (a) in a network whose links possess the same weight; and (b) in a heterogeneous network. In (a), a random walker leaving node A access all neighbors with the same probability. As such, the accessibility is the same as the node degree, i.e. ˛A = kA = 6. In (b), two nodes receives most of the random walkers leaving node A. Therefore, the effective number of neighbors drops to ˛A = 3.52, according to Eq. (1). Similar preferential random walks used applied to bibliographic analysis have been proposed e.g. in Amancio, Nunes, Oliveira, and Costa (2012), Ghosh, Kuo, Hsu, Lin, and Lerman (2011).
where q is the exponent associated to the weighted generalized mean in Eq. (2) and serves to measure the sensitivity of the index to low differences between i ’s. When q = 1, Eq. (2) is not defined; however the mathematical limit can be obtained
lim D(q) = lim
q→1
R 1/(1−q) q i
q→1
⎛ =
R
1
i=1
i=1
i
i
= exp ⎝−
⎞ i log i ⎠ ,
(3)
j
which matches the definition of accessibility in Eq. (1). Note that we have used q = 1 because this allows for a simple interpretation of the diversity measure in terms of network quantities, as it has been extensively done in several works (Amancio, Oliveira, & Costa, 2012d; Silva, Amancio, Bardosova, Costa, & Oliveira, 2016; Travenc¸olo & Costa, 2008). The proposed measure for quantifying the effective number of authors in a scientific article takes its maximum value when all authors contribute equally to the study, a scenario similar to the network configuration in Fig. 2(a). In a paper, if a given author performs most of the contributions and many authors only perform a single task, the effective number of authors drops to a value close to 1. Consequently, the effective number the authors takes low values whenever the distribution of authors contributions is uneven. The contribution of authors is quantified using the following representation of a bipartite network. Let B be the matrix storing the relationship between authors and contributions, i.e.
Bij =
1, if the i − th author performs the j − th contribution; 0,
otherwise.
(4)
In the example provided in Fig. 1, B2j = 1 only for j = “ collected the data and j = “ conceived the experiments . The total contribution of a given author a is given by
wB j j aj ca = , i
j
(5)
wj Bij
where wj is the weight associated to the j-th contribution. In this study, we consider all contributions equally important and, for this reason, we set wj = 1 for all contributions. Note that the contribution of each author (ca ) defined in Eq. (5) ranges in the interval 0 ≤ ca ≤ 1. Therefore, we can measure the diversity of the distribution of cA using the entropy H for all elements in the set of authors A: H=−
ca log ca .
(6)
a∈A
As proposed in the definition of the accessibility in Eq. (1), the definition of the effective number of authors according to the diversity of contributions can then be defined as
N = eH = exp
−
a∈A
ca log ca
.
(7)
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
503
Fig. 3. Histogram of number of authors in the PLoS One dataset reveals a irregular distribution: most of the papers includes the collaboration of 2–9 authors. Interestingly, we also observe a significant number of papers including more than 10 authors, which is in accordance with recent reported trends (Regalado, 1995; Wuchty, Jones, & Uzzi, 2007). Table 1 Frequency of appearance for each type of contribution considering all paper of the dataset. Each author was counted as a distinct occurrence, even if they appear in more than one paper of the dataset. Contribution
Fraction
Analyzed the data Performed the experiments Conceived experiments Wrote the paper Collected the data
49.97% 48.47% 45.33% 42.08% 34.07%
In our analysis, we also probed how contributions varies across authors using a normalization of the accessibility defined in Eq. (1). The normalized accessibility, referred to as symmetry of contributions, takes a range of values restricted in the interval [0, 1] and is defined as eH 1 = exp = nA nA
−
ca log ca
,
(8)
a∈A
where nA is the total number of authors in the paper. Note that is a symmetry measure because it reaches its maximum value when all authors contribute equally to the paper. 3. Results and discussion We start our analysis by quantifying specific statistics of our dataset (see Section 2.1). The distribution of the number of authors per paper is shown in Fig. 3. Note that only a few papers contain a single author, while the most common scenario are papers from 2 to 10 authors. This is perhaps the common scenario in most academic fields, where shared authorship contributes to a wider view on the academic research. However, some criticism has been put on paper authored by many authors, as this pattern of collaboration may not reflect major author contributions. In fact, multi-authored papers may contribute to boost authors individual performance, as authors with minor contributions in several papers may easily benefit from the attributed authorship (Sekercioglu, 2008). Such bias towards authors with minor contributions may even occur when other factors such as number of citations is analyzed. To clarify specific contributions in multi-authored papers, individual contributions have been described in several journals. In our dataset, the main contributions (according to their frequency in articles) are shown in Table 1. Almost half of all authors contributed to analyze the data. A similar percentage was obtained for the contributions “perform the experiments” and “conceived the experiments”. A slight smaller percentage of authors contributed to write the manuscript. Finally, about one third of all authors collected the data for the study. To investigate how individuals contributions varies in papers, we used the effective number of authors (N) as a measure of variability, as defined in Section 2.2. Since the value of ˛ varies according to the total (actual) number of authors (nA ), we show separately the values of N for each nA . In Fig. 4, the red dotted line is the reference curve N = nA and the blue circles denote the points observed in our dataset. When nA = 1, N = 1, as one should expect from Eq. (1). When nA increases, the effective number of authors also increases, thus confirming a strong correlation between these quantities. The largest deviations between these quantities (i.e. nA − N) were found for the papers authored by many authors. Note e.g. that, in general, contributions in papers authored by 22 authors are so irregular that one can consider that, in average, contributions are effectively performed by 20 authors. Considering this set of articles comprising 22 authors, the paper with the most irregular distribution of contributions has an effective number of authors of of only about 17 authors. Despite these discrepancies, we can conclude that in a typical paper authored by 1–10 authors, the difference between nA and N is very small, as the differences in amount contributions performed in these cases is not significant.
504
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
Fig. 4. Effective number of authors (N) as a function of the actual number of authors (nA ). Because in some cases some authors contribute more than others, the effective number of authors is lower than the total amount of authors in the paper. The highest deviations occur for the papers authored by many authors. (For interpretation of the references to color in text near the reference citation, the reader is referred to the web version of this article.)
Symmetry of contributions
1.00 0.95 0.90 0.85 0.80 0.75 0.70
0
5
10
15
20
25
Number of authors Fig. 5. Symmetry of contributions in papers () as a function of the total number of authors. For papers comprising 1–10 authors, the symmetry is inversely proportional to the number of authors. For papers comprising more than 10 authors, the symmetry of contributions is practically constant.
The irregularity of contributions was also investigated in terms of the symmetry of contributions, as defined in Eq. (8). In Fig. 5, we show, for each value of nA , the corresponding value of symmetry. The red dotted line represents the curve obtained by linking the points representing the average symmetry obtained for each nA . As expected, the symmetry takes its maximum value when nA = 1. The average symmetry monotonically decreases when the number of authors goes from nA = 1 to nA = 10. This means that contributions become more irregular with the number of authors. However, when more authors run in the authorship list, the symmetry of contributions remains almost constant. Owing to the large number of papers authored by 2–10 authors, outliers are more common in this subset of articles. This is evident if we note that values of symmetry ≤ 0.80 are not frequent in articles with more than 10–15 co-authors. A recurrent problem in assigning credit to researchers concerns the choice of adequate rankings according to authors specific contributions. Although using the rank of authors may lead to a good approximation of the real credit that a author deserves, there is no straightforward manner to perform ranking (Laurance, 2006). Some studies suggest good practices to rank authors, such as giving important credits to first and last authors, with distinct roles (Tscharntke et al., 2007). It has been suggested that first authors are the ones making the greatest contributions, while last authors are the ones who designed and proposed the study (Tscharntke et al., 2007). While this assertive remains true across most of disciplines, there is no major consensus on how to rank intermediary co-authors. For this reason, in this study, we also analyzed the contributions
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
505
Fig. 6. Contributions of authors in scientific papers as a function of their rank. The results are shown considering the following number of authors: (a) 2; (b) 3; (c) 4; and (d) 8. Additional figures are shown in the Supplementary Information (Fig. S1). The total contributions, ca , is given in Eq. (5). Interestingly, both first and last authors are the ones with the largest number of distinct contributions.
Fig. 7. Average author contribution as a function of its ranking. In general, the first and last authors are the ones who most contribute to the paper (quantified in number of distinct contributions). (For interpretation of the references to color in text near the reference citation, the reader is referred to the web version of this article.)
as a function of ranks to identify if there is a implicit factor leading the organization of rankings according to contributions. We also investigate how first and last authors compare to intermediary authors in terms of specific contributions. In Fig. 6, we show the distribution of contributions made by each author according to their rank. The results are organized by the total number of authors considered in Fig. 6, with papers authored by 2 (a), 3 (b), 4 (c) and 8 (d) authors (additional results are shown in the Supplementary Information, Figs. S2 and S3). In Fig. 6(a), as expected (Aziz & Rozing, 2013; Biswal, 2014), it is evident that first authors, in general, make more contributions than last authors. Nonetheless, the amount of contributions are not very different, since, in average, first and last authors make, about 60% and 40% of the contributions, respectively. When more authors are included, one can observe a very similar pattern: while first authors make most of the contributions, last authors usually are ranked as the second most contributive authors. Interestingly, disregarding first and last authors, the remaining ranking of authors reflects the amount of contributions made, i.e. second authors make more contributions than third authors, who in turn make more contributions than the fourth author and so forth. These patterns can also observed in Fig. 7, which summarizes the average contribution made in terms of authors ranking. First authors (upper blue curve) always make most of the contributions, while last authors usually appear in the second position in the ranking of contributions. As the number of authors increases, however, there is not a large difference in contributions among the authors located from fourth to second to last positions. The relationship between contributions and rankings in authorship lists may not be clear when one analyzes intermediary rankings. It is conjectured that, in general, first authors are responsible for performing experiments, while last authors usually supervise the research. However, guidelines for ranking authors are not always strictly followed, and therefore there is no widespread evidence in relating ranking of intermediary co-authors and specific contributions. To investigate the presence of patterns in ranking intermediary co-authors according to the type of their contributions, we show, in Fig. 8, the total amount of authors in a particular ranking who made specific contributions. In Fig. 8(a), we show that in papers authored
506
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
Fig. 8. Authors’ contributions organized by type of contributions and ranks. The number of authors considered are: (a) 2 authors; (b) 3 authors; (c) 4 authors; and (d) 8 authors. Additional figures are shown in the Supplementary Information (Figs. S2 and S3). The five most frequent contributions can be classified into three distinct patterns by taking into the account the relationship between amount of contribution made and authors rank.
by only 2 authors, both authors usually collect the data, write the paper and design the experiments in similar proportions. However, in most cases, first authors are responsible for performing the experiments, as one should expect. Interestingly, data analysis is mostly performed also by first authors. Specific contributions made by authors in papers with 3 authors is shown in Fig. 8(b). Note that, when comparing contributions of first and last authors, the proportions of contributions are very similar. The intermediary author also makes an intermediary contribution in performing the experiments. Concerning the data analysis and acquisition, the contribution of intermediary and last authors are very similar. However, when considering paper writing and experiment design, the intermediary author usually make less contributions than main authors. Similar patterns of contributions have also been found for papers co-authored by 4 authors (see Fig. 8(c)). In papers co-authored by many authors, three patterns of contributions could be identified (see e.g. Fig. 8(d) or Figs. S1 and S2 of the Supplementary Information): 1 Pattern A: The total amount of author contributions decreases with its ranking. This is the case of the contribution “Performed the experiments”: first authors are the ones making most of this type of contribution, while last authors are usually the ones with the lowest contributions on preparing experiments. This pattern reveals the specialization of authors in specific tasks: first authors are specialized in performing experiments. This effect in accordance with previous patterns described for specific fields (Larivière et al., 2016). 2 Pattern B: The total amount of author contributions increases from first to second-to-last authors. The last author performs an intermediary amount of contribution: he/she typically contributes more than first authors, but typically less than second-to-last authors. This pattern occurs for the contribution related to data collection and is more clear for papers authored by many authors (see e.g. Fig. 8(d)). 3 Pattern C: The total amount of author contributions displays a symmetric behavior as a function of author ranking. From the first to the middle position, the amount diminishes; while from middle to last positions, the amount increases. In other words, authors located in border positions (e.g. first, second, second-to-last and last positions) makes more contributions than intermediary authors. This type of behavior occurs for the following contributions: data analysis, manuscript writing and experiment design. The presence of three distinct patterns evidence that rankings and contributions are strongly related. These patterns also confirms that particular authors tend to make specific contributions. While most credit is devoted to first and last authors, here we show that intermediary authors may also contribute with a major frequency in contributions characterized by
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
507
patterns A and B. We note here that the above mentioned patterns are weakly dependent on the subfields represented in the PLoS One journal, as detailed in Appendix A. However, it has been shown in the literature that contributorship patterns may vary considerably if very distinct fields are compared (Larivière et al., 2016).
4. Conclusion The authorship of articles has gained attention in recent years, mainly due to lack of protocols on how the authorship should be treated. It is well known that several journals make available guidelines on authorship, but even with this information the existence of gift and ghost authorships is still a problem (Tscharntke et al., 2007). As the authorship in papers confers credit and has important academic, social and financial implications, a solution that has been adopted by several scientific journals is the identification of the role of each author in preparing a scientific paper. This solution has allowed the study of quantitative aspects of authors contribution. For example, in Larivière et al. (2016), the authors studied the contribution information provided in the PLoS One journal to address interesting features of contributorship. In their work, they focused on the analysis of the unevenness of contributions for distinct areas of science under a comparative approach. They found that the division of contributions is unbalanced in most of the disciplines and this lack of symmetry in labor division depends on each subject area. Another interesting finding arising from their study is the fact that there is a strong relationship between the contribution made by an author and his/her academic age. In this study, we complemented previous studies on contributorship by treating the relationship between authors and contributions as a bipartite network. Using a large dataset recovered from the PLoS One journal, we have first probed the symmetry of contributions as a function of the total number of authors in the paper. The symmetry/homogeneity of contributions obtained its highest average value in papers authored by 2 authors, and it diminished when the number of authors increased from 2 to 10. Interestingly, we have also found that there is no significant difference in the average symmetry of contributions when considering papers authored by more than 10 authors. We have also devised a measure to quantify the effective number of authors based on a true diversity measure. This measure could be used to correct the estimated number of authors in bibliometrics indexes and algorithms, since the number of authors plays a prominent role to normalize many variables (Amancio, Oliveira, & Costa, 2012b; Rahman, Mac Regenstein, Kassim, & Haque, 2017; Waltman, 2016). In this work, we have also found that there is a strong relationship between authors ranking and the total of contribution made by authors. We have confirmed that, in general, first and last authors make the most of the contributions to a scientific manuscript. Despite this well-known pattern, we have also identified that patterns of contributions can be described in a threefold manner, depending on the relationship between the amount of contributions and authors ranking. For example, intermediary authors are less probable to perform experiments when compared to first authors, however, intermediary authors usually make more contributions in the experimental part than last authors. Another very interesting pattern concerns the collection of data for the study, as intermediary authors (especially those closer to the final positions in the authors list) usually contribute more than first authors. Here, we considered all tasks to have the same weight when calculating the contributions of authors. However, as proposed in Rahman et al. (2017), a good choice of weights for each task depends on the type and emphasis of each paper. For instance, a paper focused on experiments must have high weights given to the tasks: “Conceived the experiments” and “Performed the experiments”; while, for a paper based on data acquisition and analysis, high weights should be associated to the tasks: “Collected the data” or “Analyzed the data”. In the first case, by considering the contribution patterns found for distinct tasks (Fig. 8), one should expect an increase of contributions made by first and last authors. In the case of defining high weights for data handling tasks, one would expect an increase of contributions for intermediary authors. As the information regarding individual contributions is provided in a growing number of journals, in future works, we intend to analyze how author contributorship profiles might be used to provide different productivity indexes for distinct contributions. This same analysis could be perform in specific journals to better understand the relationship between contributions and ranking in specific journals. This type of information could be useful to unveil novel patterns of credit assignment and to analyze if collaborations are also driven by contributorship patterns, in addition to other well-known factors (Amancio et al., 2012b; Hennemann, Rybski, & Liefner, 2012). The network framework proposed in this manuscript could also be used to classify authors according to their contributorship patterns in the bipartite network using a series of well-established network measurements (Newman, 2010). However, perhaps the most important implication of studying individual contributions in papers is its potential to improve the process of researchers evaluation through the creation of role-driven measures (Rahman et al., 2017), which could be combined with traditional, well-established indexes such as the number of citations or the h-index.
Acknowledgements E. A. Corrêa Jr. and D. R. Amancio acknowledge financial support from Google (Google Research Awards in Latin America grant). D. R. Amancio also thanks São Paulo Research Foundation (FAPESP) for support (grants no. 14/20830-0 and 16/190699). F. N. Silva acknowledges FAPESP (grant no. 15/08003-4). L. da F. Costa thanks CNPq-Brazil (grant no. 307333/2013-2) and NAP-PRP-USP for support. This work has been supported also by FAPESP grant no. 11/50761-2.
508
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
Table A.1 Groups and the respective keywords obtained from the application of the technique in Silva et al. (2016) to the co-citation network derived from the considered dataset. Group
Keywords
# articles
% articles
A B C D E F G H I J K L
Cell, expression, mouse, protein, cancer, tumor, role Brain, task, participant, measure, cognitive, performance, cortex Conclusion, method, risk, association, age, patient Specie, area, habitat, environmental, change, ecosystem, population Infection, conclusion, method, hiv, background, virus, health Sequence, gene, bacterial, strain, bacteria, genome, sequencing Sequence, specie, genetic, phylogenetic, gene, analysis, diversity Gene expression, analysis, identify, pathway, profile, microarray, network Structure, protein, structural, binding, bind, molecular, residue Protein, plant, activity, enzyme, acid, stress, arabidopsis Gene expression, protein, express, cell, mrna, level, role Meta-analysis, conclusion, search, ci, method, review, include
26,261 10,718 8066 6473 6302 5832 5199 3223 2690 2163 1434 1226
31.6% 12.9% 9.7% 7.8% 7.6% 7.0% 6.3% 3.9% 3.2% 2.6% 1.7% 1.5%
Author contributions Conceived and designed the analysis: Diego Raphael Amancio, Filipi Nascimento Silva and Edilson Anselmo Corrêa Jr. Collected the data: Edilson Anselmo Corrêa Jr. and Filipi Nascimento Silva. Contributed data or analysis tools: Edilson Anselmo Corrêa Jr., Filipi Nascimento Silva, Luciano da Fontoura Costa and Diego Raphael Amancio. Performed the analysis: Edilson Anselmo Corrêa Jr., Filipi Nascimento Silva, Luciano da Fontoura Costa and Diego Raphael Amancio. Wrote the paper: Diego Raphael Amancio, Filipi Nascimento Silva and Edilson Anselmo Corrêa Jr. Appendix A. Analysis of PLoS One subfields To better understand the contributorship results obtained from the PLoS One journal, we investigated whether the different patterns of contributions are a consequence of distinct subfields. While PLoS One provides a series of subject areas for each article, we found this information unsuitable for the analysis by automatic tools. Since articles can have many subject areas and considering that such categories are not standardized, obtaining a mesoscopic categorization of articles becomes a complex task. To avoid such problems, we considered instead the patterns of citations among these articles, which have been reported to be a good feature to identify scientific areas (Silva et al., 2016). Making such a consideration allow us to employ the methodology described in Silva et al. (2016). To derive the discipline of each article in our dataset, we initially obtained a co-citation network (Üsdiken & Pasadeos, 1995; Chen, 2004) comprehending the timespan of our dataset. For this task, we employed the normalized Jaccard index as the measurement of similarity among journals (Leydesdorff, 2008). As a consequence, we obtained an undirected weighted network. The groups representing the scientific fields were obtained by applying the summarization technique presented in (Silva et al., 2016). In short, we detected a set of communities for the network (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008; Newman, 2006), which are then labelled according to keywords presenting high importance index. The importance index was calculated in terms of the relative frequency of words present in the paper abstracts. More specifically, the importance index is the difference between the relative frequencies taking into consideration papers that are inside a certain community and those which are not. The resulting groups which comprehend a substantial portion of the dataset are shown in Table A.1. As expected, the great majority of articles are from medicine or biological fields, which represent, in total, more than 95% of the considered
Fig. A.1. Authors’ contributions organized by type of contributions and ranks in group B.
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
509
dataset. As a consequence, other disciplines, such as physics and information sciences are not very representative in our dataset. Thus, unfortunately, this undermines the statistical analysis of the authors contributions for such distinctive fields. Nevertheless, we can still find some diversification among the obtained groups. For instance, group D, which seems to be related to both ecological and environmental sciences, significantly differs, in methodological terms, from group I, defined by studies focused on the structure of proteins and their relationships. However, even among such groups, we found that their contribution distributions are no different than those found considering the entire dataset. There is still, however, a exception to this behavior, that is the distribution of contributions for the task of collecting data obtained for group B, shown in Fig. A.1. This distribution is considerably more uniform than those found for other groups and for the analysis considering all the data. Appendix B. Supplementary data Supplementary data associated with http://dx.doi.org/10.1016/j.joi.2017.03.003.
this
article
can
be
found,
in
the
online
version,
at
References Amancio, D. R. (2015). Authorship recognition via fluctuation analysis of network topology and word intermittency. Journal of Statistical Mechanics: Theory and Experiment, 3, P03005. Amancio, D. R., Nunes, M. G. V., Oliveira, O. N., Jr., & Costa, L. F. (2012). Using complex networks concepts to assess approaches for citations in scientific papers. Scientometrics, 91(3), 827–842. Amancio, D. R., Oliveira, O. N., Jr., & Costa, L. F. (2012b]). On the use of topological features and hierarchical characterization for disambiguating names in collaborative networks. EPL (Europhysics Letters), 99(4), 48002 Amancio, D. R., Oliveira, O. N., Jr., & Costa, L. F. (2012c]). Three-feature model to reproduce the topology of citation networks and the effects from authors’ visibility on their h-index. Journal of Informetrics, 6(3), 427–434. Amancio, D. R., Oliveira, O. N., Jr., & Costa, L. F. (2012d]). Unveiling the relationship between complex networks metrics and word senses. EPL (Europhysics Letters), 98(1), 18002. Anauati, V., Galiani, S., & Gálvez, R. H. (2016). Quantifying the life cycle of scholarly articles across fields of economic research. Economic Inquiry, 54(2), 1339–1355. Aziz, N. A., & Rozing, M. P. (2013). Profit p-index: The degree to which authors profit from co-authors. PLoS ONE, 8(4), 1–8. Biswal, A. K. (2014). An absolute index (ab-index) to measure a researcher’s useful contributions and productivity. PLoS ONE, 8(12), 1–10. Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, P10008. Bornmann, L., Mutz, R., & Daniel, H.-D. (2008). Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of the h index using data from biomedicine. Journal of the American Society for Information Science and Technology, 59(5), 830–837. Chen, C. (2004). Searching for intellectual turning points: Progressive knowledge domain visualization. Proceedings of the National Academy of Sciences United States of America, 101(Suppl. 1), 5303–5310. Egghe, L. (2006). Theory and practise of the g-index. Scientometrics, 69(1), 131–152. Ghosh, R., Kuo, T. T., Hsu, C. N., Lin, S. D., & Lerman, K. (2011). Time-aware ranking in dynamic citation networks. In IEEE 11th International Conference on Data Mining Workshops (pp. 373–380). Greene, M. (2007). The demise of the lone author. Nature, 450(7173), 1165. Hamilton, B., Nickerson, J. A., & Owan, H. (2003). Team incentives and worker heterogeneity: An empirical analysis of the impact of teams on productivity and participation. Journal of Political Economy, 111(3), 465–497. Haussler, C., & Sauermann, H. (2014). The anatomy of teams: Division of labor and allocation of credit in collaborative knowledge production. SSRN. https://ssrn.com/abstract=2434327 Hennemann, S., Rybski, D., & Liefner, I. (2012). The myth of global science collaboration – collaboration patterns in epistemic communities. Journal of Informetrics, 6(2), 217–225. Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National academy of Sciences of the United States of America, 16569–16572. Hutchins, B. I., Yuan, X., Anderson, J. M., & Santangelo, G. M. (2016). Relative citation ratio (rcr): A new metric that uses citation rates to measure influence at the article level. bioRxiv. Jost, L. (2006). Entropy and diversity. Oikos, 113(2), 363–375. Kelly, C. D., & Jennions, M. D. (2007). H-index: Age and sex make it unreliable. Nature, 449(7161), 403. Larivière, V., Desrochers, N., Macaluso, B., Mongeon, P., Paul-Hus, A., & Sugimoto, C. R. (2016). Contributorship and division of labor in knowledge production. Social Studies of Science, 46(3), 417–435. Laurance, W. F. (2006). Second thoughts on who goes where in author lists. Nature, 442(7098), 26. Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton’s cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1), 77–85. Mishra, D. (2008). Citations: Rankings weigh against developing nations. Nature, 451(7176), 244. Newman, M. (2010). Networks: An introduction. Oxford University Press, Inc.: New York, NY, USA. Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23), 8577–8582. Rahman, M. T., Mac Regenstein, J., Kassim, N. L. A., & Haque, N. (2017). The need to quantify authors’ relative intellectual contributions in a multi-author paper. Journal of Informetrics, 11(1), 275–281. Regalado, A. (1995). Multiauthor papers on the rise. Science, 268(5207), 25. Ruiz-Castillo, J., & Waltman, L. (2015). Field-normalized citation impact indicators using algorithmically constructed classification systems of science. Journal of Informetrics, 9(1), 102–117. Schreiber, M. (2015). Restricting the h-index to a publication and citation time window: A case study of a timed Hirsch index. Journal of Informetrics, 9(1), 150–155. Sekercioglu, C. H. (2008). Quantifying coauthor contributions. Science, 322(5900), 371. Silva, F. N., Amancio, D. R., Bardosova, M., Costa, L. F., & Oliveira, O. N., Jr. (2016). Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics, 10(2), 487–502. Slone, R. M. (1996). Coauthors’ contributions to major papers published in the AJR: Frequency of undeserved co-authorship. American Journal of Roentgenology, 167(3), 571–579. Tanimoto, T. (1958). An elementary mathematical theory of classification and prediction. pp. 1. IBM Internal Report.
510
E.A. Corrêa Jr. et al. / Journal of Informetrics 11 (2017) 498–510
Travenc¸olo, B., & Costa, L. F. (2008). Accessibility in complex networks. Physics Letters A, 373(1), 89–95. Tscharntke, T., Hochberg, M. E., Rand, T. A., Resh, V. H., & Krauss, J. (2007). Author sequence and credit for contributions in multiauthored publications. PLoS Biol, 5(1), e18. Üsdiken, B., & Pasadeos, Y. (1995). Organizational analysis in North America and Europe: A comparison of co-citation networks. Organization Studies, 16(3), 503–526. Viana, M. P., Amancio, D. R., & Costa, L. F. (2013). On time-varying collaboration networks. Journal of Informetrics, 7(2), 371–378. Waltman, L. (2016). A review of the literature on citation impact indicators. Journal of Informetrics, 10(2), 365–391. Waltman, L., & van Eck, N. J. (2013). A systematic empirical comparison of different approaches for normalizing citation impact indicators. Journal of Informetrics, 7(4), 833–849. Waltman, L., Yan, E., & van Eck, N. J. (2011). A recursive field-normalized bibliometric performance indicator: An application to the field of library and information science. Scientometrics, 89(1), 301–314. Wendl, M. C. (2007). H-index: However ranked, citations need context. Nature, 449(7161), 403. Wray, K. B. (2006). Scientific authorship in the age of collaborative research. Studies in History and Philosophy of Science Part A, 37(3), 505–514. Wuchty, S., Jones, B. F., & Uzzi, B. (2007). The increasing dominance of teams in production of knowledge. Science, 316(5827), 1036–1039. Zhang, C.-T. (2009). A proposal for calculating weighted citations based on author rank. EMBO Reports, 10(5), 416–417.