Disaggregated research evaluation through median-based characteristic scores and scales: a comparison with the mean-based approach

Disaggregated research evaluation through median-based characteristic scores and scales: a comparison with the mean-based approach

Journal of Informetrics 11 (2017) 748–765 Contents lists available at ScienceDirect Journal of Informetrics journal homepage: www.elsevier.com/locat...

5MB Sizes 0 Downloads 14 Views

Journal of Informetrics 11 (2017) 748–765

Contents lists available at ScienceDirect

Journal of Informetrics journal homepage: www.elsevier.com/locate/joi

Disaggregated research evaluation through median-based characteristic scores and scales: a comparison with the mean-based approach Gabriel-Alexandru Vîiu Research Institute of the University of Bucharest – Social Sciences Division, University of Bucharest, Panduri 90, Bucharest 050663, Romania

a r t i c l e

i n f o

Article history: Received 11 March 2017 Received in revised form 17 April 2017 Accepted 17 April 2017 Keywords: Characteristic scores and scales (CSS) Scientometrics Research evaluation Disaggregation

a b s t r a c t Characteristic scores and scales (CSS) were proposed in the late 1980s as a powerful tool in evaluative scientometrics but have only recently begun to be used for systematic, multilevel appraisal. By relying on successive sample means found in citation distributions the CSS method yields performance classes that can be used to benchmark individual units of assessment. This article investigates the theoretical and empirical consequences of a median-based approach to the construction of CSS. Mean and median-based CSS algorithms developed in the R language and environment for statistical computing are applied to citation data of papers from journals indexed in four Web of Science categories: Information Science and Library Science, Social work, Microscopy and Thermodynamics. Subject category-level and journal-level comparisons highlight the specificities of the medianbased approach relative to the mean-based CSS. When moving from the latter to the former substantially fewer papers are ascribed to the poorly cited CSS class and more papers become fairly, remarkably or outstandingly cited. This transition is also marked by the wellknown “Matthew effect” in science. Both CSS versions promote a disaggregated perspective on research evaluation but differ with regard to emphasis: mean-based CSS promote a more exclusive view of excellence; the median-based approach promotes a more inclusive outlook. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Research evaluation and the specific instruments used in its service constitute one of the main topics of debate in contemporary academia and in higher education policy. Although motivated by many political, social and economic reasons, the increased attention towards assessing research can be explained by governments’ need to monitor and manage the performance of higher education institutions, by the need to elicit accountability of these institutions to stakeholders and also by the quest to base funding decisions on objective evidence (Penfield et al., 2014). While there is consensus among academics and policy makers on the importance of competitive, high quality research for economic and social prosperity, there is no universally accepted instrument for assessing research performance, quantifying scientific impact or measuring scholarly influence. The lack of a unique answer to such scientometric problems and the idea that divergent, even contradic-

E-mail addresses: [email protected], [email protected] http://dx.doi.org/10.1016/j.joi.2017.04.003 1751-1577/© 2017 Elsevier Ltd. All rights reserved.

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

749

tory evaluations are always possible is a recurrent theme in the recent literature – see for instance Leydesdorff et al. (2016), Waltman et al. (2016) and Abramo and D’Angelo (2015). On a fundamental level all types of evaluation, including scientometric appraisal and the many indicators in its toolkit, hinge on the idea of aggregation and on the specific form that the aggregative process takes. The way in which specific informational inputs are combined to yield evaluative outcomes is critical. This fact has sparked an already long debate in scientometrics, particularly in the wake of the h-index (Hirsch, 2005) and its many variants like the g-index (Egghe, 2006) which, in essence, only offer an alternative aggregation of the underlying citation data. The moral of the continuing debates surrounding the h-index, of the separate debates around the journal impact factor as well as the moral of the more recent discussions hosted in the pages of this very journal regarding size-independent indicators versus efficiency indicators (Abramo and D’Angelo, 2016) is that in order to be confident in the outcomes of an evaluation it is crucial to be confident in the instrument used to conduct it. As aggregated scientometric indicators have become more important within national assessment processes and international university rankings, their properties, advantages and limitations have attracted increased attention, and the meaningful use of citation data has become a critical issue in research evaluation and in policy decisions (van Raan, 2005). There seems to be an indisputable consensus in the scientometric community regarding the fact that aggregated indicators are inadequate for the purpose of research evaluation since each indicator, taken separately, can only provide a partial and potentially distorted view of the performance attained by a specific unit of assessment (Hicks et al., 2015; Moed and Halevi, 2015; Van Leeuwen et al., 2003; van Raan, 2006; Vinkler, 2007). This wisdom has been affirmed a fortiori following the introduction of the Hirsch index in 2005 and the wave of Hirsch-type indicators (Bornmann et al., 2011; Schreiber, 2010) that were subsequently proposed as improvements. The overt consensus regarding the rejection of single-number indicators such as the h-index has as its corollary an implicit consensus around a more general principle: when faced with the option between an aggregated approach and a disaggregated approach to research evaluation the latter is to be preferred to the former. In other words, one should use research evaluation instruments that discard as little information as possible and offer a wide and comparatively rich picture of performance.1 One of the contemporary research evaluation instruments that adhere to these desiderata is given by characteristic scores and scales (CSS) for scientific impact assessment (Glänzel and Schubert, 1988; Schubert et al., 1987) which represent an effort towards achieving a multi-dimensional, disaggregated perspective regarding research performance. The CSS method was proposed in the late 1980s to assess the eminence of scientific journals on the basis of the citations received by the articles they publish and its cornerstone idea is that of allowing a parameter-free characterization of citation distributions in such a way that impact classes are defined recursively by appealing to successive (arithmetic) means found within a given empirical distribution. The approach is highly relevant to scientometric evaluation because it addresses one of the fundamental problems associated with the adequate statistical treatment of citation data – the skewness of science (Albarrán et al., 2011; Seglen, 1992) which makes analysis through standard statistical practice difficult and potentially biased. The aim of this article is to explore a theoretically grounded proposal to modify CSS by changing the reference thresholds used in this evaluative instrument from arithmetic means to medians. To the knowledge of the author this possibility has only been noted in a single previous study (Egghe, 2010b) where it received only a formal, theoretical exploration in a continuous Lotkaian framework. All empirical studies devoted to the application of CSS (see Section 2.1) have so far relied on the original mean-based approach. As a result, to date there are neither empirical analyses that leverage median-based CSS, nor factual comparisons of any results produced by this instrument with results relying on the mean-based approach. This article addressed these knowledge gaps by examining both mean and median-based CSS in an application to citation data of journals indexed in the Web of Science categories Information Science and Library Science, Social Work, Microscopy and Thermodynamics. The article also offers a practical implementation of the CSS algorithms in the freely available R language and environment for statistical computing. More generally, the article argues in favor of a disaggregated, inclusive approach to research evaluation and performance assessment. The article is structured as follows: Section 2 presents the CSS mechanism in more detail, reviews the state of the art with regard to the use of this instrument and puts forward the arguments that justify the need for the alternative, median-based approach; this section also examines the theoretical implications of this shift and provides information on the data used in the empirical investigation together with adjacent methodological notes. Section 3 presents the comparative results of the empirical analyses and highlights the distinctiveness inherent in the application of median-based CSS to citation data. Section 4 summarizes the results and provides a few concluding remarks.

1 Note for instance that widely used indicators like the h-index and impact factor discard very much information specifically due to their underlying aggregation.

750

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

2. Theoretical background 2.1. Mean-based CSS in evaluative scientometrics The fundamental idea of CSS is that of recursively defining certain performance classes for a given empirical distribution of published papers based on the observed number of citations they receive.2 Considering a set of n papers published in a particular field of science one starts by sorting in descending order the observed citations {Xi }ni=1 received by each paper. An ordered list of the form X1∗ ≥ X2∗ ≥ . . . ≥ Xn∗ is obtained and the parameters ˇ0 = 0 and v0 = n are defined to derive the characteristic scores and scales of the citation distribution. ˇ1 is now given by the initial sample mean of the full distribution of citations: ˇ1 =

n  X

i

n

=

i=1

n  X∗ i

i=1

v0

(1)

and the value v1 is jointly defined by Xv∗1 ≥ ˇ1

and Xv∗1+1 < ˇ1 .

(2)

Starting from these foundations the procedure can be iterated in the form

ˇk =

vk−1  Xi∗ i=1

vk−1

(3)

to define subsequent sub-sample means with the understanding that vk is chosen so that Xv∗k ≥ ˇk

and Xv∗k+1 < ˇk , k ≥ 2.

(4)

In theory the iteration can proceed until vk = 1 for some k > 0, in other words until a ˇk threshold value is reached that is so high it only allows a single (i.e. the most highly cited) paper above it, thus rendering the process of computing a further mean redundant. As a result of the iteration process one can expect to obtain an increasing sequence ˇ0 ≤ ˇ1 ≤. . . and a decreasing sequence v0 ≥ v1 ≥ . . .; these yield distinct classes of papers based on their citation characteristics. In the initial description of the CSS method (Schubert et al., 1987) the following five classes were proposed: class 0, made up entirely by uncited papers, the class of “poorly cited” papers defined on the interval (ˇ0 , ˇ1 ), the class of “fairly cited” papers defined on [ˇ1 , ˇ2 ), “remarkably cited” papers defined on the interval [ˇ2 , ˇ3 ) and, finally, the class of “outstandingly cited” papers defined as belonging to the interval [ˇ3 , ∞). It is of course possible to opt for more or fewer classes. The more recent studies either collapse the class of uncited papers into the poorly cited category (thereby obtaining a total of four classes), or use as few as three classes by further collapsing the classes of outstandingly cited and remarkably cited papers into a single category. CSS can be used in the comparative evaluation of journals, countries, research groups and even individuals to produce several interesting items of information: at a general level CSS outline the threshold values ˇk and the overall (full sample) distribution of papers across the five citation classes; at a more granular level CSS can be used to gauge the performance of individual units of assessment by determining the distributions of their own (sub-sample of) papers across the citation classes. This effectively leads to a benchmarking of each unit of assessment relative to (1) the overall reference set taken into consideration and (2) all other individual units of assessment that make up the reference set. The CSS method has an excellent potential to inform and enrich the general policy use of scientometric data, as argued by Glänzel et al. (2014). By dispensing with distributional assumptions altogether the CSS method is uniquely suited for the task of comparing research performance at multiple levels of analysis. It has been the focus of renewed interest in several studies that seek to identify publication and citation characteristics of scientific fields and subfields (Albarrán et al., 2011; Albarrán and Ruiz-Castillo, 2011; Glänzel, 2007; Li et al., 2013; Ruiz-Castillo and Costas, 2014), test the tail properties of scientometric distributions (Glänzel, 2010, 2013), evaluate and rank scientific journals (Glänzel, 2011) and conduct institutional and national comparisons of research performance (Albarrán and Ruiz-Castillo, 2011; Perianes-Rodriguez and Ruiz-Castillo, 2014, 2015). CSS have also been studied from an informetric perspective by Egghe (2010a, 2010b). More recently the use of CSS has also been suggested for the assignment of institutions to meaningful groups within international university rankings (Bornmann and Glänzel, 2016). Given the wide scale use and intuitive appeal of this scientometric instrument it is worth devoting serious thought to the theoretical and practical possibility of complementing it with similar disaggregation-oriented evaluation instruments.

2

The presentation in this sub-section is based on the account given by Glänzel (2010, pp. 704–705).

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

751

2.2. Constructing median-based CSS: theory and implications While its disaggregated outlook is a positive feature that makes CSS an appealing evaluation instrument, there are some aspects that deserve more critical reflection. Among them is the fact that there seems to be an inherent contradiction in the construction and use of mean-based CSS: they are meant to be applied to citation data that are known to be skewed (most often highly so), yet they rely on a measure of central tendency – the arithmetic mean – which is equally known for not being robust and representative when applied to skewed distributions. This contradiction constitutes the primary motivation for exploring in this article alternative CSS algorithms based on other measures of central tendency. Potential candidates for substituting the arithmetic mean used in the CSS method can be median values found in citation distributions or other types of means: trimmed arithmetic means, geometric means or harmonic means. Other theoretical considerations notwithstanding, the latter two prove impractical due to their calculation: if even a single value of 0 is found in the distribution (and there are bound to be uncited papers in any citation distribution) then the resulting initial threshold value will be 0 and it is not possible to construct sensible evaluation classes. Trimmed means may prove to be a workable solution but require the specification of a clear percentage of extreme values to be discarded and it is hard to offer convincing arguments that would justify, for instance, a 5% trimmed mean versus a 20% trimmed mean. Using the median not only avoids such arbitrary choices but also has powerful theoretical justifications which are common place in statistical analysis: the median is a uniquely robust measure of central tendency (it is unaffected by extreme values) and, in the case of highly skewed distributions, it is a much more representative indicator of the typical values encountered within the distribution (Agresti and Finlay, 2009; Sheskin, 2004). In order to construct representative CSS for research evaluation the median seems to be a more theoretically grounded option than the arithmetic mean. Using the median in technical problems related to research evaluation already has many precedents which specifically justify this approach through the fact that in the presence of skewed citation counts the median should be used as a measure of central tendency instead of the arithmetic mean. Bornmann et al. (2008) use this argument in support of their m-index (an alternative to the h-index based on the median number of citations received by papers in the so-called Hirsch core). Calver and Bradley (2009) argue that although mean and median values tend to correlate the median is a better measure of central tendency to use for comparing journal citation distributions. In a similar vein Costas et al. (2010) incorporate the median impact factor of publications in their methodology for assessing research performance of individual scientists. Leydesdorff and Opthof (2011) explicitly suggest the use of a median normalized citation score as an alternative to the mean normalized citation score (MNCS), the “new crown indicator” advocated by Waltman et al. (2011).3 Smolinsky (2016) also notes that it is possible to use the median instead of the expected number of citations in the construction of the MNCS and this also yields an indicator which satisfies the properties of homogeneous normalization and consistency (as is the case for the original MNCS). Finally, although not strictly limited to research evaluation, it is worth noting that the idea of constructing rank groups based on median values is a methodological option used in the recent edition of U-Multirank, the European initiative to compare higher education institutions.4 The specificity of citation data – large variability in citation counts among individual papers coupled with skewed citations among papers in any field (Moed, 2005) – makes a theoretically compelling case for extending the median-based approach to the construction of CSS. One might even argue that through this technical modification more characteristic scores and scales can be obtained. If one adheres to the idea that “in bibliometrics it is desirable to find the value that is most typical” (Wildgaard et al., 2014, p. 148, emphasis added) then a median-based approach to CSS might prove inherently more appealing than the mean-based approach. Algorithmically, the median-based CSS mirrors the functional form from the equations presented in Section 2.1, with the obvious difference that one has to redefine the ˇk arithmetic mean values as mk medians. This does not pose any ostensible practical problems.5 It is, however, worthwhile to devote some thought to the theoretical implications of this shift. In statistics there is a well-known mean-median-mode inequality attributed to Karl Pearson which states that in a given distribution either  ≥ m ≥ M (if the distribution has a positive skew) or  ≤ m ≤ M (if the distribution has a negative skew),  being the arithmetic mean, m the median and M the mode (Sen, 1989). Since citation distributions are right-skewed one would expect the inequality  ≥ m to hold. Since CSS rely on multiple iterations it is expected, by generalizing the mean–median inequality, that each mk median will be lower than (or at most equal to) the corresponding ˇk mean threshold. In other words, the inequality should hold not only at the level of the complete initial distribution but also at the level of the subsequent

3 The proponents of the MNCS themselves concede (p. 45) as a first limitation of this indicator the fact that it is defined as an arithmetic average – a statistical tool which they note should be used with care specifically because citation distributions tend to be highly skewed. 4 See http://www.umultirank.org/cms/wp-content/uploads/2016/03/Rank-group-calculation-in-U-Multirank-2016.pdf (last accessed March 8th 2017). 5 Note that although the algorithmic implementation is straightforward, the median-based CSS remains vulnerable to the following potential problem: for distributions with very large shares of uncited papers the first median may be 0; in this case it is not possible to construct the five CSS classes. More generally, if a distribution has a near-uniform nature then the first and second (and perhaps even the third) median values will coincide and it becomes impossible to construct the five clearly discernible CSS classes outlined above. For large empirical samples this problem of indiscernible thresholds that would lead to indiscernible ranking classes is not very likely to be encountered; nonetheless, the problem is theoretically interesting enough to be noted.

752

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

truncated distributions.6 In theory then the change from a mean-based CSS to a median-based CSS should have the following joint effects: I the class of uncited papers will remain unchanged since the same absolute number of uncited papers will be related to the same overall reference set; II the class of poorly cited papers should be marked by a certain reduction since some of the papers classified as poorly cited under mean-based CSS will rise above the first mk threshold under the median-based approach and will therefore move to the fairly cited class; this reduction will greatly depend on the exact size of the gap between mean and median values; III the classes of fairly, remarkably and outstandingly cited papers should generally be marked by a certain increase which ensues from the redistribution (across these three classes) of papers previously classified as poorly cited under meanbased CSS. Before moving to a presentation of the data and methods a final aspect should be noted. If the mean–median inequality holds for citation data7 then there is a clear practical distinction between the original mean-based CSS and the alternative median-based approach discussed in this article: mean-based CSS will naturally single out few remarkable papers and even fewer outstanding papers, while median-based CSS, which are better suited for the task of capturing more typical performance, will implicitly single out a greater number of papers as remarkable and outstanding, thus redefining high-end achievement. One might say in a sense that median-based CSS are more inclusive than the mean-based CSS because they make allowance for a greater share of papers to rank in the higher CSS classes. 2.3. Data and methods To conduct an empirical comparison of mean and median-based CSS raw citation counts of individual publications were retrieved from the Web of Science for all journals belonging to the categories Information Science and Library Science, Thermodynamics, Microscopy and Social work.8 Only articles, reviews and letters were taken into consideration and for each category a five year citation window was employed: all items published within the period 2009–2013 were taken into account, thus allowing even the more recent articles at least three years in which to accumulate citations. In order to obtain an inclusive picture of the Information Science and Library Science category Thomson Reuters’ Journal Citation Reports (JCR) issued for the years 2009–2013 were consulted. Any journal listed within the category in at least one JCR edition was taken into account, yielding a final list of 89 journals. Each journal was also inspected individually within the JCR platform to check for publication frequency, coverage and title changes. A single journal was found to have altered its name in the period of interest: Libraries & the Cultural Record which became Information & Culture in 2012. For the purpose of analysis articles published under the two names were merged under the new name of Information & Culture, thus bringing the total number of unique journals in Information Science and Library Science to 88. The same steps were taken in the case of journals that constitute the categories Thermodynamics (where 56 journals unaffected by any title changes were identified for the 2009–2013 period), Microscopy (where 11 journals were identified) and Social work (where 43 journals were identified). Four distinct citation datasets were obtained: a dataset of 18,012 documents published by the 88 journals indexed in the Information Science and Library Science category, a separate dataset of 36,631 documents published by the 56 journals indexed in the Thermodynamics category, a third dataset consisting in 5,382 documents published by the 11 journals indexed in the Microscopy category and, finally, a dataset consisting in 9,539 documents published by the 43 journals indexed in the category Social work. In total, citation data for 69,564 papers published by 198 journals were collected. It is obvious that the four subject categories chosen for analysis differ substantially with regard to the total number of journals and publications that they circumscribe and, implicitly, with regard to citation density. Microscopy and Thermodynamics are indexed in the Science Citation Index Expanded while Social work and Information Science and Library Science are indexed in the Social Science Citation Index. Citation data from the four categories have the potential to offer contrasting results regarding the effects of shifting from the mean to the median-based CSS and these categories were chosen to provide a comparative perspective regarding the outcomes that such a shift might have. To conduct an empirical comparison of mean and median-based CSS functions were written for both instruments in the R language and environment for statistical computing (R Core Team, 2016). These functions are provided in Supplementary material 1 in Appendix F. First, a global mean-based CSS function was written which is meant to be applied at the level of the

6 Von Hippel (2005) notes the mean–median inequality can sometimes fail in the case of discrete distributions and since citation distributions are by definition discrete there is the theoretical possibility that the inequality might fail in their regard. Consider for example a purely hypothetical reference set made up of 2500 papers: 1050 papers are uncited, 1080 papers are cited only once, 270 papers are each cited two times, 80 papers are each cited three times, and 20 papers are each cited four times. Although the data are clearly right-skewed the mean value is 0.78 but the median is 1. 7 Note some special features of the distribution from the hypothetical example in the previous footnote: the range of values is quite reduced (4) and the skewness is mild (about 1.12). If these two features are both needed for the mean–median inequality to fail then in the case of citation distributions (which are usually more spread out and which exhibit more pronounced skewness) empirical violations of the inequality may indeed be quite rare phenomena. 8 The data were retrieved from the Web of Science on February 17th, 2017, for Information Science and Library Science journals, on February 20th for journals in Thermodynamics, on February 23rd for journals in Microscopy and on February 24th for journals in Social work.

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

753

Table 1 Descriptive statistics of citation counts: distribution by deciles, mean and standard deviation.

Information Science and Library Science Microscopy Social work Thermodynamics

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Mean

St. dev.

0 1 0 0

0 2 1 2

1 3 2 3

2 4 3 5

4 6 4 7

5 7 6 9

8 10 7 13

12 13 11 18

21 21 16 28

504 268 260 654

8.75 9.33 7.15 12.19

17.71 13.12 11.23 20.28

full reference set taken into consideration (for example at the level of all papers published in journals indexed in Information Science and Library Science). This global function yields the ˇk reference set thresholds and the corresponding partitioning of papers across the CSS classes. The reference set thresholds can then be incorporated in a further function that is meant to be applied at the level of individual units of assessment (for example to individual journals whose publications make up the full reference set) in order to obtain the partitioning of their own papers across the CSS classes. Two analogous functions were written for median-based CSS: a global function that yields the mk reference set thresholds and the global partitioning of papers at the subject category level and an adjacent function meant to be applied at the level of journals. The application of the functions yields detailed results regarding the distribution of the papers of a particular unit of assessment (in this case journals indexed in the four subject categories) across the mean and median-based CSS classes. By applying the mean and median-based algorithms four secondary datasets were obtained, one for each subject category previously mentioned. These secondary datasets contain the journal level results: the percent distribution of journal papers across the five CSS classes under both mean and median-based approaches and the changes in class composition when moving from mean to median-based CSS (for example the difference between the percent of papers ranked as poorly cited under mean-based CSS and the percent of papers ranked as poorly cited under median-based CSS). Once these secondary datasets were obtained from the primary citation data the gaps and differences between the results of mean and medianbased CSS could be analyzed. The findings are presented in the section below and the secondary datasets themselves are provided as Supplementary material 2 in Appendix F. 3. Results and discussion 3.1. Citation distributions in the four subject categories Before discussing the comparative results of mean and median-based CSS in the four subject categories selected for analysis it is useful to first provide some preliminary information regarding the citation patterns within each category. A detailed review of the citation patterns found in the four categories is provided in Table 1 which presents the upper thresholds of the deciles of the distributions of citation counts (considered from lowest to highest values) within each category together with the mean values (effectively the ˇ1 thresholds from mean-based CSS) and associated standard deviations.9 One can also note that the values of the 5th decile (50th percentile) are effectively equivalent to the m1 thresholds from median-based CSS. Already a clear gap between ˇk and mk values is visible, widest in the case of Thermodynamics and narrowest for Social work. This already implies that the shift from mean to median based CSS should lead to a certain reduction in the class of poorly cited papers and a corresponding increase in the other classes. The subsequent section of the paper continues the exploration of this point and the detailed changes in class composition that occur when shifting from mean to median-based CSS, first at the aggregated level of subject categories, then at the more granular level of individual journals. 3.2. Comparison of mean and median-based CSS: subject categories At the general level of the four subject categories selected for analysis the application of the median-based CSS method yields significant changes compared to the mean-based approach. Table 2 provides a detailed account of this fact. The left side of the table presents the threshold values that emerge when applying the two alternative algorithms and the right side lists the corresponding distribution of papers across the CSS classes. For each subject category and at each step of the algorithm mean ˇ1 values are larger than m1 values, ˇ2 values are larger than m2 values and ˇ3 values are larger than m3 values. There is also a noticeable trend whereby the gaps between the initial thresholds (ˇ1 − m1 ) are the narrowest, while those between the successive thresholds (ˇ2 − m2 and ˇ3 − m3 ) become increasingly wider. At the level of the third thresholds mean–median gaps reach values as high as 31.84 for Information Science and Library Science, 30.98 for Thermodynamics, 21.73 for Microscopy and 18.7 for Social work. These point-wise gaps inevitably lead to substantial variation between the distribution of papers across mean-based CSS classes and the distribution of the same papers across median-based CSS classes. These are presented in the right side of the table.

9 Although not shown in the table the statistical indicator of skewness was also computed for each subject category. This indicator was found to have a value of 8.87 in the case of Information Science and Library Science, 5.70 in the case of Microscopy, 7.80 for Social work and 8.93 in the case of Thermodynamics. The distributions in all four categories exhibit a severe skewness typical of citation data.

754

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

Table 2 Global differences across subject categories between mean and median-based CSS. Threshold values

Information Science and Library Science Mean-based CSS ␤1 ␤2 24.33 8.75 m1 m2 Median-based CSS 10 4 Differences ˇ1 –m1 ˇ2 –m2 4.75 14.33 Microscopy Mean-based CSS ␤1 ␤2 21.37 9.33 m1 m2 Median-based CSS 11 6 ˇ1 –m1 ˇ2 –m2 Differences 10.37 3.33 Social work ␤1 ␤2 Mean-based CSS 7.15 17.46 m1 m2 Median-based CSS 8 4 ˇ1 –m1 ˇ2 –m2 Differences 9.46 3.15 Thermodynamics Mean-based CSS ␤1 ␤2 12.19 29.85 m1 m2 Median-based CSS 15 7 ˇ1 –m1 ˇ2 –m2 Differences 14.85 5.19

Distribution of papers across CSS classes Uncited

Poor

Fair

Remarkable

Outstanding

␤3 49.84 m3 18 ˇ3 –m3 31.84

Papers % Papers % Papers %

4070 22.60 4070 22.60 0 0.00

8748 48.57 4870 27.04 −3878 −21.53

3722 20.66 4371 24.27 649 3.61

1028 5.71 2328 12.92 1300 7.21

444 2.47 2373 13.17 1929 10.70

␤3 38.73 m3 17 ˇ3 –m3 21.73

Papers % Papers % Papers %

489 9.09 489 9.09 0 0.00

3190 59.27 2146 39.87 −1044 −19.40

1188 22.07 1263 23.47 75 1.40

350 6.50 690 12.82 340 6.32

165 3.07 794 14.75 629 11.68

␤3 31.7 m3 13 ˇ3 –m3 18.7

Papers % Papers % Papers %

1184 12.41 1184 12.41 0 0.00

5504 57.70 3163 33.16 −2341 −24.54

1997 20.94 2341 24.54 344 3.60

596 6.25 1339 14.04 743 7.79

258 2.70 1512 15.85 1254 13.15

␤3 54.98 m3 24 ˇ3 –m3 30.98

Papers % Papers % Papers %

3772 10.30 3772 10.30 0 0.00

21697 59.23 14332 39.13 −7365 −20.10

7771 21.21 9020 24.62 1249 3.41

2422 6.61 4594 12.54 2172 5.93

969 2.65 4913 13.41 3944 10.76

In the case of mean-based CSS a typical pattern – noted for instance in Glänzel et al. (2016) – emerges for each of the four subject categories: about 69 to 70% of the papers are either uncited or poorly cited, about 21 to 22% of the papers are fairly cited, about 6 to 7% are remarkably cited and only about 2 to 3% are outstandingly cited. These figures are also consistent with findings reported in Ruiz-Castillo and Waltman (2015) and underscore the already well documented scale and replicationinvariance properties of mean-based CSS: despite the diversity of scientific disciplines, despite using different aggregation levels or citation windows CSS tend to yield a stable and extraordinarily similar distribution of papers across the evaluation classes. For median-based CSS about 50% of the papers are either uncited or poorly cited (except for Social work where these two classes make up about 55% of the papers), about 23 to 24% of the papers are fairly cited, 13 to 14% are remarkably cited and about 13 to 15% are outstandingly cited. It is important to note that even though only four scientific subject categories were analyzed the results seem to converge towards a discernable pattern, similar to the pattern typical of mean-based CSS. However, more extensive applications are needed to verify whether the pattern sketched for the median-based CSS has the same stability and replicability as the 70/21/6-7/2-3 pattern found in the case of the mean-based approach. With the exception of the class of uncited papers there are visible gaps between the results of the two CSS algorithms, as argued in Section 2.2. For each of the four subject categories there is a considerable reduction (ranging between 19.40 and 24.54%) in the class of poorly cited papers. This decrease is mirrored by increases in the other classes. Specifically, the class of fairly cited papers increases with 1.40 to 3.61%, the class of remarkably cited papers increases with about 6 to 8% and the class of outstandingly cited papers increases by about 11 to 13%. The most striking changes when moving from mean to median-based CSS seem to occur at the level of opposite ranking classes: not only is there a substantial reduction in the class of poorly cited papers, but the better part of this reduction seems to translate as improvement in the highest ranking class of outstanding papers. The class of remarkably cited papers is affected to a lesser extent and the class of fairly cited papers is only marginally affected by the shift. Following the practice found in some previous studies (Albarrán et al., 2011; Ruiz-Castillo and Waltman, 2015) Table 3 provides an additional comparative analysis which focuses not on papers, but on the share of citations that each CSS class accounts for, together with mean values, standard deviations and coefficients of variation at the level of each CSS class (with the natural exception of the uncited class which by definition includes only papers with 0 citations). In the case of mean-based CSS the class of poorly cited papers accounts, on average across the four subject categories, for about 25% of the total number of citations, the class of fairly cited papers accounts for about 33% of citations, that of remarkably cited papers for about 20% and the class of outstandingly cited papers for about 22%. A similar analysis of median-based CSS reveals that only 9% of the total citations are accounted for by the class of poorly cited papers, while the classes of fairly and remarkably cited papers each account for about 19% of citations. There is a significant increase in the percent of citations accounted for

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

755

Table 3 Percent of total citations accounted for by each CSS class, mean, standard deviation (SD) and coefficient of variation (CV) at the level of each CSS class within the four subject categories. Mean-based CSS % of total citations Information Science and Library Science 19.78 Poor 33.65 Fair 21.79 Remarkable 24.78 Outstanding Microscopy Poor 27.51 32.76 Fair 19.30 Remarkable 20.44 Outstanding Social work 27.03 Poor 33.27 Fair 19.78 Remarkable 19.92 Outstanding Thermodynamics Poor 25.41 Fair 32.84 20.91 Remarkable 20.83 Outstanding

Median-based CSS Mean

SD

CV

% of total citations

Mean

SD

CV

3.56 14.24 33.4 87.92

2.21 4.33 6.81 56.68

0.62 0.30 0.20 0.64

5.75 16.85 19.05 58.36

1.86 6.07 12.89 38.74

0.80 1.68 2.28 34.81

0.43 0.28 0.18 0.90

4.33 13.84 27.67 62.17

2.47 3.31 4.64 31.38

0.57 0.24 0.17 0.50

12.26 19.61 17.96 50.17

2.87 7.79 13.06 31.72

1.40 1.41 1.68 21.74

0.49 0.18 0.13 0.69

3.35 11.36 22.63 52.66

1.90 2.73 3.74 34.56

0.57 0.24 0.17 0.66

8.95 18.08 19.09 53.88

1.93 5.27 9.73 24.30

0.82 1.10 1.42 19.70

0.42 0.21 0.15 0.81

5.23 18.88 38.56 96.04

3.31 4.64 6.91 64.98

0.63 0.25 0.18 0.68

10.22 20.18 18.91 50.69

3.18 9.99 18.39 46.08

1.68 2.25 2.55 38.71

0.53 0.22 0.14 0.84

by the class of outstandingly cited papers: on average across the four subject categories this class accounts for no less than 53% of the total citations. Considering these comparative aspects it becomes apparent that under median-based CSS the word “outstanding” takes on a truly compelling empirical substance since the outstandingly cited papers now account for more than half of the entire citation output at the level of each Web of Science subject category. Recall that under mean-based CSS the outstandingly cited class of papers accounts for less than a quarter of all citations. Finally, note that for both CSS algorithms Table 3 indicates that larger coefficients of variation are found for the highest and lowest ranking classes (poorly and outstandingly cited papers) while visibly smaller coefficients of variation (meaning greater degrees of homogeneity) obtain in the case of the two intermediate classes of fairly and remarkably cited papers.

3.3. Comparison of mean and median-based CSS: individual journals The previous section outlined the general patterns of change that emerge when moving from mean to median-based CSS at the level of subject categories. The present section undertakes a more detailed comparison of paper distributions across the CSS classes at the level of individual journals within each subject category. From an evaluation perspective it is more interesting to see the shifts which occur at this lower level of individual units of assessment. A concise picture of the differences in class composition is provided by Fig. 1 and detailed journal-level comparisons of the distributions of papers across the mean-based and median-based CSS classes are given in Fig. 2 and Appendices A–E. Note that the journal codes (having the form J xy) used in these figures are given in Supplementary material 2 in Appendix F which also includes the underlying data. Careful inspection of each of the figures reveals that the trend identified at the general level of subject categories is also echoed at the individual level of journals: in the overwhelming majority of cases – 163 of the 198 journals analyzed – changing mean-based CSS with median-based CSS leads to a substantial decrease (more than 10%) in the class of poorly cited papers coupled with corresponding increases in the other classes. For the Microscopy journals the reduction in the class of poorly cited papers extends as far as 25.49% (in the case of journal J 02), for Thermodynamics as far as 28.69% (J 06), for Social work as far as 35.06% (J 19) and for Information Science and Library Science as far as 40.66% (J 27). The visible contraction of the class of poorly cited papers is the most common phenomenon associated with the shift from mean to median-based CSS. Less common phenomena are the reduction in the class of fairly cited papers (which affects only 38 of the 198 journals) and the decrease in the class of remarkably cited papers (which affects only 6 of the 198 journals). In these rarer cases where reductions occur outside the class of poorly cited papers they are always translated upwards as increases in the class of outstandingly cited papers. As a result, with the exception of the uncited class of papers, the class of outstandingly cited papers is the only one which does not experience any reduction. At the level of this highest ranking class the shift from mean to median-based CSS either leaves journals with the same percent of outstandingly cited papers (this is the case for 36 journals) or leads to some degree of expansion which is generally non-negligible. For 94 out of the 198 journals the increase in the class of outstanding papers exceeds 5% and reaches values as high as 19.32% for Microscopy (J 10), 27.18% for Thermodynamics (J 13), 38.77% for Information Science and Library Science (J 68) and 46.71% for Social work (J 12).

756 G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

Fig. 1. Journal-level distribution of percent differences between citation classes when moving from mean to median-based CSS (P = Poor, F = Fair, R = Remarkable, O = Outstanding)a . a All figures in the article were produced in R. Figure 2 and the figures in Appendices A–E were produced in R with package ggplot2 (Wickham, 2009).

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

757

Fig. 2. Mean versus median CSS class comparison in Microscopy (all 11 journals).

In order to profile the journals which experience the highest percent increase in the class of outstandingly cited papers some additional indicators were investigated: the number of published papers, total citations, average citations per paper, h-indices and g-indices.10 Note that these journal-level indicators are included in Supplementary material 2 in Appendix F. The conclusion of this investigation is that the journals which experience the highest percent increase in the class of outstandingly cited papers also tend to score very well on the other metrics. In other words, they are journals that even typical (aggregated) scientometric indicators such as average citations per paper and the h-index indicate as being high performers. It seems that the transition from mean to median-based CSS is marked to a certain extent by a type of “Matthew effect” (Merton, 1968) incumbent in scientific production: in general it is the journals that already had a considerable share of papers in the outstanding category under mean-based CSS that enjoy an even higher share of papers in this category under median-based CSS. The position of poorer performers also improves but to a lesser extent. All things considered, only 4 out of the 198 journals analyzed are completely unaffected by changing to the median-based version of CSS (two in Information Science and Library Science: J 09 and J 16; one in Social work: J 22; one in Thermodynamics: J 05). These journals are easily identifiable in the bar chart graphs because their paper output is partitioned only between the classes of uncited and poorly cited papers under both versions of CSS. A commonality of these four journals is that their papers have jointly accumulated very few citations (at most 35 for the entire citation window). In addition to the very small number of journals that experience no changes when shifting from mean to median-based CSS there are also several journals that are subject to very limited overall improvement. These are journals with a high proportion (more than 50%) of uncited papers, for example J 01 in Microscopy or journals J 15, J 22, J 34, J 59, J 62 in Information Science and Library Science. Similar examples are to be found in Social work (J 05, J 31) and Thermodynamics (J 03). 3.4. A caveat regarding semantics Before moving to the concluding section of the article it is worth stressing an important feature of the application of CSS that usually goes unrecognized. Whether based on mean or median values, one aspect that has yet to be stressed with sufficient force is that what the CSS method offers – and what perhaps constitutes a subtle and unrecognized source of its appeal – is a common language of evaluation discernable in its threshold-based classes. This common language based on terms with a highly intuitive meaning – “outstanding”, “remarkable”, “fair”, “poor”, “uncited” – makes possible something that eludes aggregated, single-number indicators: a multi-dimensional assessment that immediately renders intelligible differences in performance and status. The great strength of this approach is that it has a universal application which can transcend important barriers such as those of scientific disciplines and their idiosyncratic citation practices. For example, knowing that a paper in microscopy is outstandingly cited essentially conveys the same substantive qualitative information as knowing that a paper in social work is outstandingly cited. On the other hand, knowing that a researcher in microscopy has an h-index of, for example 5, cannot offer the same qualitative information as knowing that a researcher in social work has the same h-index of 5; although in technical terms both researchers have 5 papers with at least 5 citations each, in one field this may be quite an accomplishment while in the other it may only represent a modest performance.

10

The last two indicators were calculated in R with functions developed in a previous study (Vîiu, 2016).

758

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

The apparent conceptual strength of the CSS method as a common language of evaluation does not come free of caveats. One has to concede that there is an air of arbitrariness in the use of the labels that describe the CSS classes and that coercing numbers to value-laden words (which is what the CSS algorithm ultimately achieves) is anything but an exact science. Consider for example the case of Thermodynamics where, as shown in Section 3.1, 90% of the papers have at most 28 citations. One can make the case that papers with 29 citations are quite “remarkable” (or even “outstanding”) relative to the typical citation values encountered in this category – papers belonging to the top 10% in their field are in fact usually acknowledged as representing scientific excellence (Bornmann and Marx, 2014) – yet mean-based CSS would only label such papers as “fair” because they fall short of the ˇ2 threshold of 29.85 (see Table 2). The point here is that there is a certain gap between the day-to-day use of the labels chosen to designate CSS classes and the technical meaning of these labels in the context of research evaluation via this method. Even the term “uncited” does not necessarily have an absolute meaning because the notion of citedness is contextually defined relative to a particular bibliometric database. It is however possible for a paper to have no citations in one database (say the Web of Science) and some citations in another (for example Scopus). The seemingly absolute meaning of the words that make up the evaluation language should not lead one to forget the essentially relative foundations upon which these words rest. Even disaggregated research evaluation instruments such as CSS may be subject to the problem of “misplaced concreteness and false precision” (Hicks et al., 2015, p. 431) which typically affects aggregated indicators. 4. Concluding remarks Disaggregation-oriented research evaluation instruments such as characteristic scores and scales are a powerful and persuasive alternative to aggregated indicators. They eschew the methodological pitfalls and simplifying tendencies of the latter and preserve a multidimensional view of research performance. Scientometrics should continually develop disaggregation-oriented instruments especially in light of the increased policy use of scientometric data. This article has offered a comparative analysis of mean-based and median-based characteristic scores and scales for research evaluation. Starting from the premise that there is an internal methodological tension in the construction of mean-based CSS a theoretical argument was developed in favor of the median-based alternative. As argued in Section 2.2, median-based evaluation instruments are increasingly advocated in the scientometric literature due to the well-documented skewness of citation data. The use of median-based CSS instead of or in tandem with mean-based CSS could be viewed as a corollary of this more general trend. In this article the theoretical implications of changing from mean to median-based CSS were explored and then empirically tested on citation data of papers circumscribed to four Web of Science subject categories within a five-year citation window. At the level of subject categories the shift from mean to median-based CSS was found to produce a substantial reduction in the share of papers belonging to the poorly cited class and a corresponding increase in the other citation classes, most notably in the highest class of outstandingly cited papers which under median-based CSS accounts for more than half of all citations in each subject category. These results are determined by the mean–median inequality which in the iterated case of the CSS algorithm seems to consistently lead to small gaps at the level of first-order thresholds (ˇ1 − m1 ) and progressively wider gaps in the case of higher-order thresholds. At the level of individual journals within each subject category the same patterns emerge and a familiar Matthew effect is found to permeate the transition from mean-based CSS to median-based CSS: journals that are already doing well under the former do even better under the latter, a fact which is most visible in the increases they accrue in the class of outstandingly cited papers. The shift from mean to median-based CSS can be justified by appealing to standard statistical practice leveraged in contexts that involve skewed data. From this perspective it might be argued that the current mean-based approach leads to a mild but nonetheless unfortunate misnomer: characteristic scores and scales are not in fact characteristic, at least not insofar as they rely on the arithmetic mean as the preferred measure of central tendency. If the idea of capturing the truly typical, characteristic performance of units of assessment outweighs the desire of reducing excellence to extreme values which naturally pull the mean upwards, then median-based CSS deserve thoughtful consideration. Whether one is inclined or not to accept a shift from mean to median-based CSS may largely depend on one’s attitude towards the quantification of excellence which remains an important and legitimate part of research evaluation. While both CSS versions share the fundamental commonality of promoting a disaggregated perspective on research evaluation they differ with regard to emphasis: mean-based CSS promote a more exclusive view of excellence whereas the median-based approach promotes a more inclusive outlook. From a practical perspective, if the aim of a specific evaluation is to single out the most atypical, high-end performers, then the mean-based CSS might seem to be a natural choice. If, however, the purpose of evaluation is to gauge the more typical, characteristic performance of a unit of assessment relative to a reference group, then the medianbased CSS might be more appropriate. Regardless of which approach is selected it will lead to a more comprehensive and intrinsically fairer picture than the snapshot offered by an aggregated indicator. Funding This research was funded by the Research Institute of the University of Bucharest through a grant awarded to the author in December 2016.

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

759

Acknowledgements The author would like to express his gratitude for the valuable comments and suggestions made by the anonymous reviewers and by the Editor-in-Chief of the journal during the review and revision stages. These helped to improve several aspects of the initial manuscript and led to a fruitful expansion of the empirical analysis. Appendix A. Mean versus median CSS class comparison in Thermodynamics (journals J 01 to J 28) Fig. A1

Fig. A1.

Appendix B. Mean versus median CSS class comparison in Thermodynamics (journals J 29 to J 56) Fig. B1

760

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

Fig. B1.

Appendix C. Mean versus median CSS class comparison in Information Science and Library Science (journals J 01 to J 44) Fig. C1

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

Fig. C1.

761

762

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

Appendix D. Mean versus median CSS class comparison in Information Science and Library Science (journals J 45 to J 88) Fig. D1

Fig. D1.

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

Appendix E. Mean versus median CSS class comparison in Social work (all journals) Fig. E1

Fig. E1.

763

764

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

Appendix F. Supplementary data Supplementary data associated with http://dx.doi.org/10.1016/j.joi.2017.04.003.

this

article

can

be

found,

in

the

online

version,

at

References Abramo, G., & D’Angelo, C. A. (2015). Evaluating university research: Same performance indicator, different rankings. Journal of Informetrics, 9(3), 514–525, https://doi.org/10.1016/j.joi.2015.04.002 Abramo, G., & D’Angelo, C. A. (2016). A farewell to the MNCS and like size-independent indicators. Journal of Informetrics, 10(2), 646–651, https://doi.org/10.1016/j.joi.2016.04.006. Agresti, A., & Finlay, B. (2009). Statistical Methods for the Social Sciences (4th ed.). Upper Saddle River, NJ: Pearson Prentice Hall. ˜ I., & Ruiz-Castillo, J. (2011). The skewness of science in 219 sub-fields and a number of aggregates. Scientometrics, 88(2), Albarrán, P., Crespo, J. A., Ortuno, 385–397, https://doi.org/10.1007/s11192-011-0407-9. Albarrán, P., & Ruiz-Castillo, J. (2011). References made and citations received by scientific articles. Journal of the American Society for Information Science and Technology, 62(1), 40–49, https://doi.org/10.1002/asi. Bornmann, L., & Glänzel, W. (2016). Applying the CSS method to bibliometric indicators used in (university) rankings. Scientometrics, 110(2), 1077–1079, https://doi.org/10.1007/s11192-016-2198-5. Bornmann, L., & Marx, W. (2014). How to evaluate individual researchers working in the natural and life sciences meaningfully? A proposal of methods based on percentiles of citations. Scientometrics, 98(1), 487–509, https://doi.org/10.1007/s11192-013-1161-y. Bornmann, L., Mutz, R., & Daniel, H.-D. (2008). Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of the h index using data from biomedicine. Journal of the American Society for Information Science and Technology, 59(5), 830–837, https://doi.org/10.1002/asi. Bornmann, L., Mutz, R., Hug, S. E., & Daniel, H. D. (2011). A multilevel meta-analysis of studies reporting correlations between the h index and 37 different h index variants. Journal of Informetrics, 5(3), 346–359, https://doi.org/10.1016/j.joi.2011.01.006. Calver, M. C., & Bradley, J. S. (2009). Should we use the mean citations per paper to summarise a journal’s impact or to rank journals in the same field? Scientometrics, 81(3), 611–615, https://doi.org/10.1007/s11192-008-2229-y. Costas, R., van Leeuwen, T. N., & Bordons, M. (2010). A bibliometric classificatory approach for the study and assessment of research performance at the individual level: The effects of age on productivity and impact. Journal of the American Society for Information Science and Technology, 61(8), 1564–1581, https://doi.org/10.1002/asi.21348. Egghe, L. (2006). Theory and practise of the g -index. Scientometrics, 69(1), 131–152, https://doi.org/10.1007/s11192-006-0144-7. Egghe, L. (2010a). Characteristic scores and scales based on h-type indices. Journal of Informetrics, 4(1), 14–22, https://doi.org/10.1016/j.joi.2009.06.001. Egghe, L. (2010b). Characteristic scores and scales in a Lotkaian framework. Scientometrics, 83(2), 455–462, https://doi.org/10.1007/s11192-009-0009-y. Glänzel, W. (2007). Characteristic scores and scales. A bibliometric analysis of subject characteristics based on long-term citation observation. Journal of Informetrics, 1(1), 92–102, https://doi.org/10.1016/j.joi.2006.10.001. Glänzel, W. (2010). The role of the h-index and the characteristic scores and scales in testing the tail properties of scientometric distributions. Scientometrics, 83(3), 697–709, https://doi.org/10.1007/s11192-009-0124-9. Glänzel, W. (2011). The application of characteristic scores and scales to the evaluation and ranking of scientific journals. Journal of Information Science, 37(1), 40–48, https://doi.org/10.1177/0165551510392316. Glänzel, W. (2013). High-end performance or outlier? Evaluating the tail of scientometric distributions. Scientometrics, 97(1), 13–23, https://doi.org/10.1007/s11192-013-1022-8. Glänzel, W., Debackere, K., & Thijs, B. (2016). Citation classes: a novel indicator base to classify scientific output. pp. 1–9. Retrieved from https://www.oecd.org/sti/051 - Blue Sky Biblio Submitted.pdf. Glänzel, W., & Schubert, A. (1988). Characteristic scores and scales in assessing citation impact. Journal of Information Science, 14(2), 123–127, https://doi.org/10.1177/016555158801400208. Glänzel, W., Thijs, B., & Debackere, K. (2014). The application of citation-based performance classes to the disciplinary and multidisciplinary assessment in national comparison and institutional research assessment. Scientometrics, 101(2), 939–952, https://doi.org/10.1007/s11192-014-1247-1. Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). The Leiden Manifesto for research metrics. Use these ten principles to guide research evaluation. Nature, 520(7548), 9–11, https://doi.org/10.1038/520429a. Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of National Academy of Sciences USA, 102(46), 16569–16572, https://doi.org/10.1073/pnas.0507655102. Leydesdorff, L., & Opthof, T. (2011). Remaining problems with the “New Crown Indicator” (MNCS) of the CWTS. Journal of Informetrics, 5(1), 224–225, https://doi.org/10.1016/j.joi.2010.10.003. Leydesdorff, L., Wouters, P., & Bornmann, L. (2016). Professional and citizen bibliometrics: complementarities and ambivalences in the development and use. Scientometrics, 109(3), 2129–2150, https://doi.org/10.1007/s11192-016-2150-8. Li, Y., Radicchi, F., Castellano, C., & Ruiz-Castillo, J. (2013). Quantitative evaluation of alternative field normalization procedures. Journal of Informetrics, 7(3), 746–755, https://doi.org/10.1016/j.joi.2013.06.001. Merton, R. K. (1968). The Matthew effect in science. Science, 159(3810), 56–63, https://doi.org/10.1126/science.159.3810.56. Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer. Moed, H. F., & Halevi, G. (2015). Multidimensional assessment of scholarly research impact. Journal of the Association for Information Science and Technology, 66(10), 1988–2002, https://doi.org/10.1002/asi.23314. Penfield, T., Baker, M. J., Scoble, R., & Wykes, M. C. (2014). Assessment, evaluations, and definitions of research impact: a review. Research Evaluation, 23(1), 21–32, https://doi.org/10.1093/reseval/rvt021. Perianes-Rodriguez, A., & Ruiz-Castillo, J. (2014). Within- and between-department variability in individual productivity: the case of economics. Scientometrics, 102(2), 1497–1520, https://doi.org/10.1007/s11192-014-1449-6. Perianes-Rodriguez, A., & Ruiz-Castillo, J. (2015). University citation distributions. Journal of the Association for Information Science and Technology, https://doi.org/10.1002/asi.23619. R Core Team. (2016). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Retrieved from https://www.r-project.org/. Ruiz-Castillo, J., & Costas, R. (2014). The skewness of scientific productivity. Journal of Informetrics, 8(4), 917–934, https://doi.org/10.1016/j.joi.2014.09.006. Ruiz-Castillo, J., & Waltman, L. (2015). Field-normalized citation impact indicators using algorithmically constructed classification systems of science. Journal of Informetrics, 9(1), 102–117, https://doi.org/10.1016/j.joi.2014.11.010. Schreiber, M. (2010). Twenty Hirsch index variants and other indicators giving more or less preference to highly cited papers. Annalen Der Physik (Berlin), 522(8), 536–554, https://doi.org/10.1002/andp.201000046. Schubert, A., Glänzel, W., & Braun, T. (1987). Subject field characteristic citation scores and scales for assessing research performance. Scientometrics, 12(5), 267–291, https://doi.org/10.1007/BF02016664.

G.-A. Vîiu / Journal of Informetrics 11 (2017) 748–765

765

Seglen, P. (1992). The skewness of science. Journal of American Society of Information Science and Technology, 43(9), 628–638, https://doi.org/10.1002/(SICI)1097-4571(199210)43:9<628::AID-ASI5>3.0.CO;2-0. Sen, P. K. (1989). The mean-median-mode inequality and noncentral chi-square distributions. Sankhya – The Indian Journal of Statistics Series A, 51, 106–114. Sheskin, J. (2004). Handbook of parametric and nonparametric statistical procedures (3rd ed.). Boca Raton: Chapman & Hall/CRC. Smolinsky, L. (2016). Expected number of citations and the crown indicator. Journal of Informetrics, 10(1), 43–47, https://doi.org/10.1016/j.joi.2015.10.007. Van Leeuwen, T. N., Visser, M. S., Moed, H. F., Nederhof, T. J., & Van Raan, A. F. J. (2003). The holy grail of science policy: Exploring and combining bibliometric tools in search of scientific excellence. Scientometrics, 57(2), 257–280, https://doi.org/10.1023/A:1024141819302. van Raan, A. F. J. (2005). Measurement of central aspects of scientific research: performance, interdisciplinarity structure. Measurement, 3(1), 1–19, https://doi.org/10.1207/s15366359mea0301 1. van Raan, A. F. J. (2006). Comparison of the Hirsch-index with standard bibliometric indicators and with peer judgment for 147 chemistry research groups. Scientometrics, 67(3), 491–502, https://doi.org/10.1556/Scient.67.2006.3.10. Vîiu, G.-A. (2016). A theoretical evaluation of Hirsch-type bibliometric indicators confronted with extreme self-citation. Journal of Informetrics, 10(2), 552–566, https://doi.org/10.1016/j.joi.2016.04.010. Vinkler, P. (2007). Eminence of scientists in the light of the h-index and other scientometric indicators. Journal of Information Science, 33(4), 481–491, https://doi.org/10.1177/0165551506072165. von Hippel, P. T. (2005). Mean, median, and skew: correcting a textbook rule. Journal of Statistics Education, 13(2). Retrieved from http://www.amstat.org/publications/jse/v13n2/vonhippel.html. Waltman, L., van Eck, N. J., van Leeuwen, T. N., Visser, M. S., & van Raan, A. F. J. (2011). Towards a new crown indicator: An empirical analysis. Journal of Informetrics, 5(1), 37–47, https://doi.org/10.1007/s11192-011-0354-5. Waltman, L., van Eck, N. J., Visser, M., & Wouters, P. (2016). The elephant in the room: The problem of quantifying productivity in evaluative scientometrics. Journal of Informetrics, 10(2), 671–674, https://doi.org/http://dx.doi.org/10.1016/j.joi.2015.12.008. Wickham, H. (2009). ggplot2: elegant graphics for data analysis. New York: Springer-Verlag., https://doi.org/10.1007/978-0-387-98141-3. Wildgaard, L., Schneider, J. W., & Larsen, B. (2014). A review of the characteristics of 108 author-level bibliometric indicators. Scientometrics, 101(1), 125–158, https://doi.org/10.1007/s11192-014-1423-3.