I
LOTKA’S
LAW: A TESTING
0300-45-i 8.’ si 00 - .oo 19X Pergamon Press Lid
PROCEDURE?
MIRANDA LEE PAO Matthen’ A. Baxter School of information and Library Science. Case Western Reserve Universit). Cleveland. OH 44106
Abstract-Instead of the commonly accepted inverse square law. Lotka’s original formulation was based on a more general inverse pow’er law: x”.y = c. The exponent and the constant must be estimated from the given set of author productivity data. A step-by-step outline is presented for testing the applicability of Lotka’s law. Steps include the computation of the values of the exponent and the constant based on Lotka’s method. and the test for significance of the observed frequency distribution against the estimated theoretical distribution derived from Lotka’s formula.
Lotka’s law states that the number of authors. T.~, each credited with s number of papers, is inversely proportional to X. which is the output of each individual author[ 11. The relation is expressed as: x”.!.,
(1)
= c
where p-t_is the number of authors making s contributions to the subject. and )I and c are the two constants to be estimated for the specific set of data. Lotka noted that eqn (I) applied to a variety of phenomena. Based on the tM.0 sets of data taken from two different subject areas-physics and chemistry-he formulated this general rule for scientific productivity. It would be of interest to demonstrate the general applicability of this relation to authors and their productivity in any subject or in any identified group. To do so, each investigation should follow a comparable procedure of data collection. Each researcher should agree upon an operational definition of what constitutes an observed data pair. After the data are collected and arrranged into a frequency distribution. one proceeds to find the proper theoretical formula to test the observed distribution. Thus, the parameters of the theoretical distribution should be optimally estimated from the data set. Finally, an appropriate statistical test of goodness-of-fit, at a specified level of significance, should be chosen to test the conformity of the observed distribution versus the theoretical distribution function. Vlachy and Potter have shown that many studies were done on a wide variety of subjects and that there were wide variations in the procedures used[2-41. Disturbingly, there are serious flaws in some of the study designs. so that results from these tests for conformity conducted in the last decade cannot be meaningfully compared[4. 51. Their conclusions are unreliable at best. In order to make meaningful statistical comparisons with Lotka’s work, his method should be adhered to as closely as possible. In this study, we shall review some of the major discrepancies found in the literature, replicate Lotka’s original calculation. attempt to extract the necessary steps to organize the data, and. finally, present a step-by-step outline of a testing procedure. REVIEW OF PREVIOUS
WORKS
In his continuing interest in the skewed distribution of author productivity. Vlachy has collected a truly remarkable collection of data relating to the applications of Lotka’s law to various subjects under varying conditions[Z, 3, 61. He cited studies with data t Supported by NIH Grant 04177 from the National Library of Medicine. 30.’
hf.L. PA0
306
taken from mere half-year
to extensive
I IO-year
data from
the subject
of paper
chro-
of subjects have been shown to exhibit this genera1 skewed distribution. Data were also drawn from world communities, as opposed to those drawn from more localized groups. such as those representing national and institutional groups. Still others were selected from a single journal pubIication, versus those selected from more comprehensive abstracting and indexing services. In a footnote, Lotka specifically indicated that “joint contributions have in all cases been credited to the senior author only”[l]. Yet many studies utilized all coauthors in the observed distribution. Recasting of such data is not always possible without details of the raw figures. Coile showed that data counts with senior authors and those with all authors produced significantly different conclusions[5]. Examination of Vlachy’s survey and studies published in the United States has convinced us that no uniform method exists for collecting and organizing data for the Lotka’s test[7]. Wide variations in the observed frequency distributions predictably produce questionable parameter estimates for the Lotka’s formula. For example, a distribution which included aft co-authors would produce a value for the exponent in eqn (I) which would be significantly different from the one without all the co-authors. Furthermore, many investigators simply assigned 2 as the value of n without estimating the value from the observed data. Lotka’s original article derived two different values for II in his tw’o experiments. That is. n was estimated from each empirical distribution. “The fact that the exponent has, in the examples shown. approximately the value 2”[1] enabled him to draw his often quoted conclusion: matography.
An amazingly
large
variety
filecrisesexonli/~eu', it is found that the number of is about one-fourth of those making one: the number one-ninth. etc: the number making t1contributions is and the proportion, of all contributors. that make a percent.
In
persons making ? contributions making 3 contributions is about about l/r? of those making one: single contribution. is about 60
What has not been emphasized is the fact that his conclusion is emphatically a qualified statement based on the two cases examined. and his results were arrived at by a more generalized method. He reported the differences and he listed the different theoretical distributions for each case. As early as 1974, Vlachy specificahy stated that although many investigators chose to attribute to Lotka the inverse square relation between authors and their productivity, Vlachy found serious discrepancies between empirical data and the inverse square Iaw[6]. In other words, in the inverse square relation, the value of n is 2 in eqn (11, whereas his survey of the literature found that the value varied from 1.2 to over 3.5. Another variable that can change the value of 11is the number of pairs of observations to be included in the calculation of $7, which is commonly computed by the feast-square method. In most cases, data points representing persons of high productivity in the sample do not fall within the linear regression line. Both Lotka and Price brought out this fact[ 1, 81. The slope is usually computed without the data points from the high scorers. Lotka simply cut off his two sets of data at the first 17 and 30 points for the physics and chemistry distributions. respectively. His decisions were based on visual inspection of the two graphs. However. the value of II would change according to the number of points utilized in the calculation. It seems that one ought to rely on the “best” cutoff with less arbitration. The estimate of the parameter c is even more problematic. The simplest solution is to accept Lotka’s qualified conclusion that “the proportion of all contributors that make a single contribution, is about 60 percent”, even though it was an approximation from the inverse square law based on his two samples. The value of L’then can be computed by the simple formula 6/n’. Most investigators chose this irrr*el-sesqftar-e Cult* for testing because of its elegance and most probably because of its simplicity in computation. Yet there is no doubt that in Lotka’s paper. his original formulation indicated an inverse power relation between the number of authors and their contributions. Lotka proceeded to derive a procedure to compute the constants from his experimental data.
Lotka‘s law : a testing procedure
307
and he obtained c = 0.6079 and c = 0.5669 for his physics and chemistry data, respectively. Although Potter and Coile[4] both discussed this very point, Potter’s test of Lotka’s chemistry data utilized the theoretical inverse square law instead of Lotka’s theoretical distribution based on n = 1.888 and c = 0.5669. Coile’s retesting of Murphy’s data was also based on the inverse square relation instead of a recalculation of the constant c and the exponent n based on the given data[S]. Vlachy did not extend his interest in the variations that existed in the value of n to the computation of c nor the statistical test of significance. Many investigators stopped short of performing an appropriate goodness-of-tit test on the observed data set. Lotka did not test his observed distributions against his theoretical constructs, even though both the observed and expected frequency distributions were included in his article. Coile criticized the use of the chi-square goodnessof-fit test[5]. The power of this test is often compromised by the need to combine data in several categories. The Kolmogorov-Smirnov one-sample test has been suggested as a more powerful statistical test. Applying this latter test to Lotka’s original data sets, and using c = 0.6079 and n = 2, instead of c = 0.5669 and n = 1.888, Potter found that the fit of the chemistry distribution to the inverse square theoretical distribution has been shown to be statistically insignificant at the 0.01 level. Therefore, it is imperative that standardized testing procedures be performed on other data sets, since the generally accepted law has been based only on the conformity of one single experiment. ESTIMATION
OF THE EXPONENT
n
Estimating the exponent n from the nonlinear eqn (1) necessitates the transformation of this equation into a linear relation between the logarithm ofx and the logarithm of y, i.e.
X”.J = c log (s”.?‘) = log r n logs
+ log!
= loge.
Thus, Lotka plotted the logarithmic values of the numbers or percentages of authors who have made 1, 2, 3, 9.. contributions to the chosen subject against the logarithms of these numbers I, 2, 3, .-a of contributions. If x and y fit the general inverse power law, eqn (I), the resulting graph should be a straight line with a negative slope n. The value of n can then be calculated by the least-square method. This was exactly what Lotka did in his two experiments. However, in practice, the linear relation of log s and log Y does not hold for those points representing the few highly prolific authors who are responsible for many articles. Therefore, by observation, he selected the first 17 points of the Auerbach physics data. and the first 30 points of the chemistry data. The determination to cut off at these two points was not disclosed, except for a footnote stating that “beyond this point fluctuations become excessive owing to the limited number of persons in the sample”[ I]. Thus, I .2% and 1.03% of the most prolific physicists and chemists were excluded from his calculation of the values of n. He obtained n = 2.021 and n = 1.888, respectively. for the two sets of data. Lotka suggested that the highly productive authors be considered separately. This suggestion was also supported by Price[8]. Since a formula which could hold for both the high and low producers has yet to be found, one must exclude the high scorers in testing Lotka’s law. Except by observation, how can one best determine the cutoff? Vlachy suggested to cut off the most prolific v/Cr.* of the population, where Ec.~ is the total sample taken. This procedure is based on Price’s approximation of the elite group of any scientific body[9]. For a sample of 500, it means the removal of 4.47~: for a small sample of 100 it means the removal of 10%. Therefore, for a large sample, the exclusion ratio can be tolerated.
308
M. L. PA0
Although there are other methods available to compute the value of the slope of a regression line. the most commonly used method is the linear least-square method. One of the several equivalent forms is: NCXY
n= NCX’
- CXCY
(2)
ccx,*
-
where N is the number of pairs of data considered X is the logarithm of x Y is the logarithm of y. This method of calculation is straightforward. Our concern is to estimate the best value from the observed distribution. Therefore, as Lotka suggested, the logarithmic values of x and y should be plotted on log-log graph paper to identify the approximate region of the cutoff to form a straight line. One may also make several computations of n, using different values of N, and using the graph as a guide. The median or the mean may be identified as the best slope for the observed distribution. Table I shows the different values for the slope for Lotka’s physics and chemistry data. Clearly, the data tend to fluctuate after the 17th point in the Auerbach data. For the chemistry data, n clusters around 1.88 at the 21st and the 30th points. Although the method of least square produces different values of the slope with different number of points in the same set of data, we could maximize our chances to obtain the best n by combining a visual inspection to identify the approximate region of cutoff with several computations of n, using different numbers of pairs of data. From the previous discussion, ?I = 2 is obviously but a special case in the inverse power law. In comparing different research data, Vlachy showed that n = 2 tends to hold true for subjects in the physical sciences. The value of n increases for technical. life and social sciences, and the humanities. The value of n is also dependent on the concentration of the group of authors studied -that is, the more specialized the group, Table
I. Sample
values Lotka’s
Auerbach Of points
h1 13 14 IS 16 17 18 19 20 21 22 23 24 25 26 27
of the exponems data Chemical Of points
Slope
n 2.041 2.0533 2.0182 2.0255 2.0210 2.0953 2.1385 2.0786 2.0726
I
1.9887 1.9989 1.9946 I .9754 1.9490 I .8485
for
Abstract Slope
N
"
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
1.8611
1.8548 1.8644 I .8594 I .8566 1.8615 I .8878
1.8924 1.9106 1.9213 1.9023 1.9149 1.9132 I .8971
1.8908 I .8878 1.8911
1.9425 I .9353 I .9420 I .9960
Lotka's law: a testing procedure
309
the smaller is the value for n. There seem to be other variables noted by Vlachy; they are not yet well-understood. Nevertheless, it is clear that the value of n is a function of the set of the observed data, and its value should be optimally estimated from the data rather than assumed to be 2 in all cases. CALCULATIONOFTHECONSTANTC
Lotka wrote, “for the special case that n = 2, the value of the constant in eqn (1) is found” to be 0.6079 or 6/n*[ I]. Some investigators merely assumed the inverse square relation as the basis for computation. In Lotka’s calculation of the two sets of data from Auerbach and Chemical Abstracr, he gave the values of the constant in each case as 0.6079 for n = 2, and 0.5669 for n = 1.888. His article presents the procedure to compute C for n = 2. We will generalize his derivation as follows: Lotka’s general law states that y.r = cl.?.
(1)
Dividing both sides of eqn (1) by CJ.~, the total number of authors, 4‘.Jx:.v, = (c/&,)+/x”).
(3)
Let f(r.,) = y.ricy.r, the fraction of authors making x contributions, and C = c/c?‘,, the new constant. expressed as a fraction of the total sample of authors. Thus eqn (3) may be written as: f( ?‘.r) = c ( 1ix”) .
(4)
Equation (4) is merely another form of Lotka’s general law, eqn (I); the percentage of authors, f(~.,), each with x number of publications, is inversely proportional to x raised to the nth power. Extrapolating from Lotka’s calculation of the special case for n = 2, the general formulation eqn (I) for any value of n is as follows:’ yr = c(l/l”) y* = c(1/2”) ?‘3 = c(1/3”) *
.
.
.
.
.
.,‘.r = c(llY).
Summing both sides of these equations, CY>. = c(l/l”
+ l/2” + l/3” +
Dividing both sides by the total number of authors. c Y.JCY1- = (C&J’(C Since the summation,
cyl-,
gives unity, and c/~Y, 1 = c cc
I/Y)
c = l/(Cl/X”).
= C,
(3 (6)
310
M. L. Pno
For eqn (6). x assumes all possible integer values. Fortunately, c l/x’ converges to 7r2/6. Therefore, eqn (6) becomes:
for n = 2, the series
c = I/(~l/x2) C = 6/n’. Thus for the special case of n = 2, C is the inverse of the summation of the infinite series 2 l/.?\‘, the limit of which equals to n’/6. However, for other non-negative fractional values of n, the summation of the series in its general form, xl/Y, may only be approximated by a function which calculates the sum of the first P terms. In Lotka’s footnote to the infinite series, he referred to two texts in which the summation may be computed. He further noted that “for method of summation when exponent is fractional, see Whittaker and Robinson Calc14l14s of Observations: 136, 1924”. Unfortunately, this reference does not offer a method of calculation, as was promised, although this series is found to converge for n greater than I. These texts assert that for n = 2, xl/? = n’i6, and for n = 4, c I/x4 = n4/90. However, for other non-negative fractional values of n, there is no easy formula for computing the sum of this infinite series. Both Potter and Coile referred to the footnote and to the special case of the inverse square without offering a method of calculating the constant for values other than n = 2 [4, 51. With the help of Professor David Singer, we followed Lotka’s derivation of eqn (6) closely and derived a function approximating the summation cl/Y’ for non-negative fractional values of II (see Appendix B). It is found that the residual error is negligible if P is set to 20. The estimation is: 2 I/X” = x= I
[
‘i’
_!_ + ll(n -
,r= , f’
l)(P”-‘)
+ 112P” + n/24(P -
1)“” 1,
(7)
We checked the calculated summation for n = 2 and found that eqn (7) yielded 1.644925393 with P = 20, whereas the limit of 2 I/.? is n2/6. or 1.644934067. The error is less than l/l 10.000. For n = 4, the calculated summation is 1.082323197, whereas the limit of 2 1/.r4 is 7~~190,or I .082323234. The error is less than l/25,000,000. This method appears to produce a close estimation to Lotka’s computation. We compared Lotka’s value of c for his chemistry data with n = 1.888 and P = 20, using our calculated summation. The following was found: C
]/
=
2
]/_y’.8”R
+ l/0.888(20” RR8 ) + 1/2(20’~sRX)+ l.888!24(19’,sR8)
= 1.68347 + 0.078754 + 0.00174834 + 0.00001595 = 1.763989. Therefore, C = l/1.763989
= 0.566897. This is in complete
agreement
with Lotka’s result of 0.5669
STATISTICAL
TEST FOR CONFORMITY
To assert that the observed author productivity distribution i’s not significantly different from a presumed theoretical distribution, a goodness-of-fit test is the appropriate statistical test. Since only one sample is taken, and the sample is known to be
Lotha’s
lau : a testing
procedure
311
taken from a non-normally, distributed population. a non-parametric test should be used. The hypothesis under test concerns a comparison betwpeen observed and expected frequencies. There are two one-sample non-parametric tests available-the chi-square one-sample test and the Kolmogorov-Smirnov one-sample test. The chi-square test requires that data be categorized in discrete classes. and that the expected frequencies are sufficiently large. If more than 20% of the expected frequencies are smaller than 5, values in adjacent categories must be combined to form larger frequencies. Since in author productivity distribution there are invariably, only a few authors in the highfrequency groups. combination of categories is a necessity. Since information is partially lost in combining categories. the power of the test is reduced. On the other hand. in the Kolmogorov-Smirnov test. the cumulative frequency distribution occurring under the known theoretical distribution is compared writh the cumulative observed frequency distribution. The expected distribution would be the expected values under the null hypothesis. It is expected that the difference between each pair of cumulative frequencies is small. This test allows the determination of the associated probability that the observed maximum divergence occurs within the limits of chance. Since the test treats individual observation separately,, it need not lose information through the grouping of values in several categories. It has been suggested that the Kolmogorov-Smirnov test be used only when one can be assured that the variable has a continuous expected distribution[lO. 1 I]. Several statistical texts assert that if the test is used wrhen the resulting distribution is discontinuous. the error occurring in the resulting probability statement is in the conservative direction[ I?]. Thus. in the event that the null hypothesis is rejected. we can have real confidence that the observed distribution is significantly different from the theoretical distribution. Consequently,. in all cases where it is applicable. the Kolmogorov-Smirnov one-sample goodness-of-fit test is the most powerful test available.
DrZT.4
PREP.4R.4TIOS
The inclusion of co-authors has been a common practice among many’ investigators[13-171. For reasons unknown, Lotka excluded all co-authors and credited each contribution solely to its senior author. In man) scientific fields. a large percentage of scientists derive their total output in co-authorship. w,hich is a means of gaining visibility, and professionalism[l8. 191. That is. they would only co-author in one or two papers in their subject. Therefore. if co-authors are included in the observed distribution. there will be proportionally more authors with one or two papers. The resulting regression line will produce a larger slope than if co-authors are excluded. A larger value for II will form a different theoretical distribution for testing. It is important to emphasize that different values of II produce substantially, different values of the constant C. Consequently. the ensuing theoretical distribution with which the empirical distribution is to be fitted would differ. Table 2 presents sample values of the exponent II with their corresponding values of the constant C. One notes that appreciable variations can occur wmitha small change in the first decimal place in the exponent, particularly when II is smaller than 2. To illustrate, Table 3 compares two sets of values for 11and C for the data recasted from Murphy’s article. Co-authors have been omitted. We plotted the logarithmic values of X, the number of contributions. and _Y.the number of authors with s papers. Visually, all the points could be included in the computation of II. Table 4 showed the least-square method in the calculation of jz. w,hich was found to be 2.6164. Substituting n = 2.6164 and P = 20 into eqn (7).
c
l/s’.6’hJ = 1.30005 C = l~l.30005 = 0.7692
M. L. PAO
311 Table
2. Sample
The exponent II
values
Theoretical 9 of authors with one contribution
0.509605 0.531288 0.551911 0.571515 0.590173 0.607930 0.624837 0.640939 0.656289 0.670898 0.722888 0.766004 0.801905 0.831908 0.857065 0.923938
50.96 53.13 55.19 57.15 59.02 60.79 62.48 64.09 65.63 67.09 72.29 76.60 80.19 83.19 85.71 92.39
1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.40 2.60 2.80 3.00 3.20 4.00
1’ c
and constant
The constant* c
1.75 1.80
* The constant
of the exponent
C is calculated
1,s” t- li(n
-
with the following:
lk20)“-’
+ !~(20,”
1
+ n/24(19)“-’
Table 3 shows the Kolmogorov-Smirnov Test with n = 2.6164 and C = 0.7692 in the theoretical distribution. We found that the maximum difference, 0.0354, falls within the critical K-S value 0.1273. Thus this set of data complies with Lotka’s formulation: _yZ.6164. ?I
0.7692
=
In contrast wlith Coile, with the assumption that n must be 2 and thus C = 0.6079. we found that the maximum deviation 0.1543 exceeded the critical K-S value. We must reject the hypothesis that this set of data conforms to the formulation of:
A’ ‘?‘., = 0.6079 Namely,. the inverse square law does not apply. but the inverse power law does. It follows that the proper determination of the values of the exponent and the constant is critical in testing Lotka’s law. Another factor to be considered is the sample itself. The extent of the samples taken by Lotka has been unmatched by subsequent studies. His chemistry data came from a decennial index of Chemical Abstract. and his selected quality physics data included the entire subject up to 1900. We recognize that the Kolmogorov-Smirnov Test is distribution-free, and its critical value is dependent on the size of the obserTable
3. Kolmogorov-Smirnov
test of observed
and expected of technology
distributions
Observed
Lotka’s
.T
?‘r
YJC.v\
CC\ ,‘C!.r)
je*
cjr
I 2 3 4 5
1’5 21 9 8
0.7622 0.1280 0.0549 0.0488 0.006
0.7622 0.8902 0.945 0.9938
0.7692 0.1254 0.0434 0.0205
0.7692 0.8946 0.9380 0.9585 0.9699
I
I
1
I .oooo
0.0114
* j<, = 0.7691 (I/.Y’~‘~) + j, = 0.6079 Cl/.?, Critical value at 0.01 level of significance = 1.63/\/164
= 0.1273
D 0.0070 0.0044 0.0071 0.0354’ 0.0301
of senior
authors
in the history
theoretical fe+ 0.6079 0.1520 0.0675 0.0380 0.0243
f< 0.6079 0.7599 0.8274 0.8654 0.8899
D 0.15431 0.1303 0.1177 0.1285 0.1103
313
Lotka's lau: a testingprocedure Table 4. Calculationof n for Murphk's data I x
2 .\
3 x = logs
4 Y = log?
1 2 3 4 5
125 21 9 8
0. 0.30103 0.477121 0.602060 0.698970
2.09691 1.3"" ____ 0.95424 0.90309 0.
0. 0.398018 0.455289 0.543714 0.
2.0792
5'765 _ ._
1.3970
I 164
n=
5 Xj
6 XX 0. 0.090619 0.227635 0.362476 0.488559 1.1693
C(1.3970)- (2.0792)t5.2765) 5t1.1693)- (2.0792)' = 2.6164
vations made. For 6891 chemists. the critical value at a 0.01 level of significance is calculated at a low of 0.0196. On the other hand, a small sample of 163 authors in Murphy’s data produces a large critical value of 0.1273 at the same level of significance. Thus, this test demands a very small deviation between the theoretical and experimental cumulative frequency functions from a large sample, and it tends to force a rejection of the null hypothesis. With a small sample, the same test can tolerate a much larger maximum difference. The implication in data collection is clear. The larger the sample size, the stronger is the test. Finally, Lotka used an unorthodox sampling technique in selecting his chemistry, data. The assumption is that names under the letters A and B form a truly representative sample of the entire population. From our K-S test of the experimental distribution (See Appendix A), it is possible that chemists and their productivity may not follow the Lotka rule; however, his peculiar sampling technique is a strong suspect. We tested names starting with A in a separate K-S test, and we repeated the test for chemists whose names start with the letter B. Neither sample conforms to the Lotka’s law. Furthermore, there are significant differences in the values of the two parameters. II and C. We conclude that as far as authors in the chemical literature are concerned. further testing is needed with better sampling design. CONCLUSlON
The generalized inverse power relation existing between the frequency of authors and their contributions implied in Lotka’s original paper has been known for quite some time[20]. That is, the frequency y of authors making x contributions w,ill be inversely, proportional to some exponential function of x, whereas the inverse square relation in which n equals 2 is but a special case of the general inverse power relationship. Although Rao has suggested that other theoretical distributions be investigated with regard to author productivity, we have shown that even Lotka’s original inverse power law has not been properly tested[lO]. Coile emphasizes that for statistical comparisons to be made to Lotka’s work, Lotka’s methodology should be followed as closely as possible[51. To test the compliance of a group of authors to Lotka’s irt\,erse power /a~~, we suggest the following steps: 1, Data collection: Collect adequate data on the number of contributions made by authors in a defined subject, crediting only the senior authors in each contribution. Co-authors are ignored. 2 Frequency distribution: Arrange data in a table with the first two columns containing values of x and y: .I is the frequency of authors making x contributions arranged in increasing order of productivity. Therefore, the first row would contain a large value of y,. the number of authors with one single contribution to the subject.
314
hf. I_. P-40
3. Calculation
of II: Expand the table into six columns. Columns 1 and 2 contain .v and j‘. Columns 3 and 4 contain X and I’. w,here X = log s, and Y = log x, respectively. Values for XY and X’ are inserted into columns 5 and 6. Plot Y against X (y-axis) on log-log graph paper. Visually inspect to determine the approximate end of the straight line. Thus the region of cutoff of the high producers points can be noted. Use the least-square method to compute the “best” value for the slope n. which is the exponent for Lotka’s law [eqn ( I )I. N is defined as the number of rows used in the data set. For the v,alue of 17. substitute the appropriate values into eqn (2). which is: NCXY
‘I = For use in of the data 4. Calculation Substitute
c=
- CXxY
NCX?
-
(-jfX)’
a statistical softw.are package such as Minitab, perform the regression of 1’. using one predictor in X. of C: the value of II obtained in Step 3. and P = 20 into the following:
1
2
l;.~” + l/it? -
I)(P”- ‘) + l/Z’”
+ t1/‘24(P -
l)“-
’1
In other words. the first term in the denominator of the equation is obtained by summing the first 19 terms of Ii.?, with s = 1. 2, 3, ..+ 19. 5. Kolmogorov-Smirnov test of goodness-of-fit of Lotka’s theoretical distribution: With the values of jr and C obtained in Steps 3 and 4, this set of observed data may be tested for conformity, with the predicted values computed by Lotka’s law. Construct another table with sev’en columns w,ith the follow,ing values: Columns I and 2 contain values of .r and .v, respectively. Column 3 contains values of f,,(.v,) = y.,/c.v,. the fraction of authors making x contributions from observed data. Column 4 contains the cumulative values of f,,(.v.,). Column 5 contains values computed by fJx, 1 = CC1.‘~“). Column 6 contains the cumulative values of f,,t~.,). Column 7 contains the differences of each pair of xf,,c.v,) and xJrt~.l-). Identify the maximum absolute value in column 7 as D,,,. Calculate the critical value for this sample of data with I .63’l’c
Y,
for sample size greater than 35: otherwise, consult a table for critical values of D in the K-S test. If I),,,,, is greater than the critical value. the null hypothesis that this set of data conforms to Lotka’s law must be rejected at 0.01 level of significance. The Kolmogorov-Smirnov. test has been applied to Lotka’s data from Auerbach and Chemical Abstract in Appendix A, with the values of 11and C derived from Steps 3 and 4. For the physics data. the first 17 points were used to compute the value of the parameters: II = 2.021 and C = 0.6151. For the chemistry data, the first 30 points were used: 13 = 1.888 and C = 0.5669. At the 0.01 level of significance. the maximum deviation for the physics data is 0.0237. which falls Gthin the critical K-S value of 0.0448. This set of data does fit the Lotka’s law. However. the maximum deviation for the chemistry data is 0.0707. w,hich exceeds the critical K-S value of 0.0196. Therefore, a maximum deviation of 0.0207 or greater has an associated probability of less than 0.01: we must reject the null hypothesis that the observed distribution is not different from the theoretical distribution of s’ HXR*?. = 0.5669.
Lotka’s law: a testing procedure
315
Acknoa,/Pdg~m~nr-The author wishes 10 thank Professor David Singer of the Department of Mathematics and Statistics ac Case Western Reserve L’niversit! for his time and interest in working out the Rieman-Zeta function for the approximation of Y”. Without his help. this study would not have been possible. REFERENCES
[II A. J. LOTKA, The frequency ington
m J.
Academy
of Sciewe
VLACHY, Evaluating
1980, 6(1-4).
distribution 1926. 16(12),
the distribution
of scientific
productivity.
Journal
of the
Wash-
317-323. of individual
performance.
Scientia
Yagosla\ica
267-275.
131 J. VLACHY. Time factor in Lotka’s
law. Prohleme
de Informare
si Docrrmenrare
1976,
10(I),
44-87. Trends 1981. 30(l), 21-39. 141 W. G. POTTER. Lotka’s law revisited. Lihrup iSI R. C. COILE. Lotka’s frequency distribution of scientific productivity. Jorrrnal of the American Society for Infortnarion Scietwe 1977. 28(6). 366-370. 161 J. VLACHY, Distribution patterns in creative communities. World dongress of Sociology, Toronto, 1974. 171 J. VLACHY,Frequency distribution of scientific performance: A bibliography of Lotka’s law and related phenomena. Scienromerrics. 1978. I, 109-130. 181 D. DE S. PRICE. Little Sc,ience, Big Sc,iettc,e. Columbia University, New York (1963). [91 D. DES. PRICE, Some remarks on elitism in information and the invisible college phenomenon in science. Journal of The Atnerican Society for Information Science 1971, 22(Z). 74-75. RAO. The distribution of scientific productivity and social change. Jour[lOI I. K. R.AVICHANDRA nal of Ihe American Soc,iety for Jnformatiott Science 1980, 31(2). I I I-122. test for goodness of fit. Journal of Amerkan Sra[III F. J. MASSEY. The Kolmogorov-Smirnov tisrical Society I95 I, 156, 68-78. Statistics for the BehalVoral Sciences. McGraw-Hill. Neu York iI21 S. SIEGEL. Nonparametric, (1956). Societ! for Infor1131 L. J. MURPHY. Lotka’s law in the humanities? Journal of The American mation
Science
1973. 24(6),
461-162.
Soc,iery .for 1141 A. E. SCHORR, Lotka’s law and map librarianship. Journal of rhe American Informarion Science 1975, 26(3). I89- 190. Quarter/y 1974. 14(l). 32-33. [ISI A. E. SCHORR,Lotka’s law and librar). science. Referewe 1975. 1161 A. E. SCHORR.Lotka’s law and history of legal medicine. Research in Librarianship 5(30),
205-209.
and R. KERNIZA~. Lotka’s law and computer science literature. Journal (171 T. RADHARKRISHNAN of the American Society for Informariou Science 1979. 30( 1). 5 l-54. [I81 D. DES. PRICEand D. DEBEAVER.Collaboration in an invisible college. American Ps~~l7olog~ 1966, 21Cll). 101 I-1018. 1979. l(7). iI91 D. DE BEAVENand R. ROSEN, Studies in scientific collaboration. Scienrometrics 133-149. 1971. DO1 S. NARANAN, Power law relations in science bibliography. Journal of Docrrmentarion 27(2),
83-97.
APPENDIX
A
a. Calculation of n for Chemical Absrracr data 1
2
3
4
X
Y
x = log x
Y = logy
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Il. 12. 13.
3991.00 1059.00 493.00 287.00 184.00 131.00 113.00 85.00 64.00 65.00 41.00 47.00 32.00
0.00000 0.30103 0.47712 0.60206 0.69897 0.77815
0.84510 0.90309 0.95424 1.tMWOO 1.04139 I .07918 1.11394
3.60108 3.02490 2.69285 2.45788 2.26482 2.11727 2.05308 1.92942 1.80618 1.81291 1.61278 I .67210 1.50515
5 XY O.OOOOO 0.91058 1.28481
1.47979 1.58304 1I.4756 I .73505 1.74244
1.72353 1.81291
1.67954 1.80450 I .67665
6
xx o.ooooo 0.09062 0.22764 0.36248 0.48856 0.60552 0.71419 0.81557 0.91058
1.ooooo I .08450 1.16463 1.24087
M. L.
316 a. Calculationof n for Cker?rit trlAhsrrwr I
2
x
!
14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.
data (Conrinrred) 4
3
x =
Y = log
log .x
J
1.44716 1.32222 1.38021 1.25527 1.27875 1.23045 1.14613 0.95424 1.04139 0.90309 0.90309 0.95424 0.95424 0.90309 l.OOOOO 0.90309 0.84510
1.14613 1.17609 1.20412 1.23045 1.25527 1.27875 1.30103 1.32222 1.34242 1.36173 1.38021 1.39794 1.41497 1.43136 1.44716 1.46240 1.47712
28.00 21.00 24.00 18.00 19.00 17.00 14.00 9.00 11.00 8.00 8.00 9.00 9.00 8.00 10.00 8.00 7.00
PA0
32.4236
46.9721
5
6
XY
xx
1.65863 1.55505 1.66194 1.54455 1.60518 1.57344 1.49115 1.26172 1.39799 1.22976 1.24645 1.33397 1.35023 1.29265 1.44716 1.32068 1.24831 43.2991
1.31361 1.38319 1.44990 1.51400 1.57571 1.63521 1.69268 1.74826 1.80210 1.85430 1.90498 1.95414 2.00215 2.04880 2.09327 2.13861 2.18189 38.9988
NCXY-CXCY
’ = =
N
c
X’ - (x X)’
30(43.2991)- (32.4236)(46.9721) 30(38.9988)- (32.4236)'
= - 1.8878 b. Kolmogorov-Smimovtest data
of observed and expected distributions of senior authors in Chemical Absrracr
Observed I
2
3
# paper
# auth
3; authors
3991. 1059. 493. 287. 184. 131. 113. 85. 64. 65. 41. 47. 32. 28. 21. 24. 18. 19. 17. 14. 9. 11. 8. 8. 9. 9. 8. IO.
0.579161 0.153679 0.07143 0.041649 0.026701 0.019010 0.016398 0.012335 0.009287 0.009433 0.005950 0.006820 0.004644 0.004063 0.003047 0.003483 0.002612 0.002757 0.002467 0.002032 0.001306 0.001596 0.001161 0.001161 0.001306 0.001306 0.001161 0.001451
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. II. 12. 13. 14. 15. 16. 17. 18. ::: 21. 22. 23. 24. 25. :;: 28.
Theoretical 4
5
6
7
cum of3
expected Q authors
cum of5
D
0.566900 0.13166 0.071236 0.041383 0.02715 0.019247 0.014387 0.011181 0.008952 0.007337 0.006129 0.005200 0.004471 0.003887 0.003412 0.003021 0.002694 0.002419 0.002184 0.001982 0.001808 0.001656 0.001523 0.001405 0.001301 0.001208 0.001125 0.001050
0.566900 0.720066 0.791302 0.832685 0.859840 0.879086 0.893473 0.904654 0.913605 0.920942 0.927071 0.932271 0.936742 0.940629 0.944041 0.947062 0.949756 0.952175 0.954358 0.956341 0.958148 0.959804 0.961327 0.962732 0.964033 0.965240 0.966365 0.96741
0.579161 0.732840 0.804383 0.846031 0.872733 0.891743 0.908141 0.920476 0.929763 0.939196 0.945146 0.951966 0.956610 0.960673 0.963721 0.967204 0.969816 0.972573 0.975040 0.977072 0.978378 0.979974 0.981135 0.982296 0.983602 0.984908 0.986069 0.987520
0.0122612 0.0127741 0.0130804 0.0133464 0.0128928 0.0126564 0.0146679 0.0158220 0.0161580 0.0182538 0.0180750 0.0196954 0.0198684 0.0200446 0.0196798 0.0201418 0.0200597 0.0203984 0.0206816 0.0207309** 0.0202292 0.0201696 0.0198080 0.0195640 0.0195693 0.0196674 0.0197035 0.0201045
Lotka’s b. Kolmogorov-Smlrnov data (Co~rincr~d)
test
of observed
law:
and
a testing
expected
procedure
317
distributions
of senior
authors
Observed
Theoretical
1
2
3
# paper
# auth
% authors
.r
!‘!
f,S.ll, !\;c
29. 30. 31. 32. 33. 34. 36. 37. 38. 39. 40. 41. 42. 44. 45. 46.
8. 7. 3. 3. 6. 4. 1. I. 4. 3 2. I. 2. 3. 4. 7
47. 49. 50. 51. 52. 53. 54. 55. 57. 58. 61.
;: I. 2 I 2 2 2 3 I. 1. ,
66. 68.
;: 3
73. 78. 80. 84. 95. 107. 109. 114. 346.
I. I. I. I. I. I. I. I.
4
5
6
7
cum of3
expected Q authors
cum of5
D
=
! ltt .f,,c.Y\,-
?,
c .f.Ay,)
0.001161 0.001016 0.000435 O.OOO435 0.000871 0.000580 0.000145 0.000145 0.000580 0.000435 0.000290 0.000145 0.000290 0.000435 0.000580 0.000290 0.000435 0.000145 0.000290 0.000145 0.000290 0.000290 0.000290 0.000435 0.000145 0.000145 0.000290 0.000145 0.000290 0.000145 0.000145 0.000145 0.000145 0.000145 0.000145 0.000145 0.000145 0.000145
i:
Absrr~~ I
in CIW~II’WI
.fJ.v\,
*
0.000983 0.000922 0.000867 0.000816 0.000770 0.000728 0.000653 0.000620 0.000590 0.000562 0.000536 0.000511 0.000488 0.000447 0.000429 0.000411 0.000395 0.000365 0.00035l 0.000339 0.000326 0.000315 0.000303 0.000294 0.000274 0.000266 0.000241 0.000208 0.000197 0.000172 0.000152 0.000145 0.000132 0.000105 0.000084 0.000081 0.000074 o.OOOOO9
0.988681 0.989697 0.990132 0.990567 0.991438 0.992019 0.992164 0.992309 0.992889 0.993325 0.993615 0.993760 0.994050 0.994486 0.995066 0.995356 0.99579' 0.995937 0.996'27 0.996572 0.996662 0.996953 0.997243 0.997678 0.997823 0.997968 0.998259 0.998404 0.998694 0.998839 0.998984 0.999129 0.999274 0.999419 0.999565 0.999710 0.999855 1.oooooo
c
f,(?!)
0.968398 0.969320 0.970187 0.971003 0.971773 0.972501 0.973154 0.973775 0.974365 0.974927 0.975462 0.975974 0.976462 0.976909 0.977378 0.977750 0.978145 0.978510 0.978861 0.979200 0.979526 0.979841 0.980145 0.980438 0.980?13 0.980978 0.98l"O 0.981& 0.981624 0.981796 0.981918 0.982093 0.982"5 __ 0.982329 0.982413 0.982494 0.982568 0.982577
c
f<(Y))
I
0.0'0'8'5 ___ 0.0203764 0.0199452 0.0195644 0.0196650 0.0195175 0.0190092 0.0185338 0.0185'4 0.018?;78 0.0181525 0.0177864 0.0175882 0.0175762 0.0177278 0.0176067 0.0176471 0.0174271 0.0173659 0.0171725 0 0171363 0.0171117 0.0170980 0.0172398 0.0171105 0.0169901 0.0170389 0.0169759 O.OI7069.s 0.0170426 0.01'0359 0.0170363 0.017049i 0.0170900 0.0171516 0.01?2160 0.0172869 0.0174230
6891 Calculated
with
fJ.v,)
=
0.5669
(l!r’
*“‘I
“D ma%= 0.0207 At the 0.01 levelof significance, the critical value
= 1.63/\/c y, = 0.0196
c. Calculationof n for Auerbach‘s data 1
2
x
)
1. 2. 3. 4. 5. ,P\, .‘I JII
784. 204. 127. 50. 33.
3 x
=
4 log x
O.OOOOO 0.30103 0.47712 0.60206 0.69897
Y =
log)
2.89432 2.30963 2.10380
1.69897 1.51851
5
6
XY
xx
0.00000 0.69527 I .00377 I .02288 I .06140
0.00000 0.09062 0.22764 0.36248 0.48856
M. L. Pno
318
c. Calculationof n for Auerbach's data (Conrinrred)
I
?
x
?
6. 7. 8. 9. 10. Il. 12. 13. 14. IS. 16. 17.
28. 19. 19. 6. 7. 6. 7. 4. 4. 5. 3. 3.
4
3
x =
5
Y = log)
log.1
6
xx
XY
0.77815 0.84510 0.90309 0.95424 1.00000 I.04139 1.07918 1.11394 I.14613 1.17609 1.20412 1.23045
1.44716 1.27875 1.27875 0.77815 0.84510 0.77815 0.84510 0.60206 0.60206 0.69897 0.47712 0.47712
1.12611 1.08067 1.15483 0.74255 0.84510 0.81036 0.91201 0.67066 0.69004 0.82205 0.57451 0.58707
0.60552 0.71419 0.81557 0.91058 1.OOOOO 1.08450 I.16463 1.24087 I.31361 1.38319 1.44990 1.51400
14.55106
20.63372
13.79928
14.36586
I7 (13.79928)- (14..551061 (20.63372)
n=
I7 (14.36586)- (14.55106) = - 2.0210
d. Kolmogorov-Smirnov
testof observed
and expected distributions of senior authors in Auerbach data
Observed
Theoretical 4
5
6
7
7c of authors
cum of3
expected 57 authors
cum of5
D
0.591698 0.153962 0.095849 0.037736 0.024906 0.021132 0.014340 0.014340 0.004528 0.005283 0.004528 0.005283 0.003019 0.003019 0.003774 0.002264 0.002264 0.000755 o.ocQ755 0.002264 0.002264 0.001509 0.000755 0.000755 0.000755 0.000755 0.001509
0.591698 0.745660 0.841509 0.879245 0.904151 0.925283 0.939623 0.953962 0.958491 0.963774 0.968302 0.973585 0.976604 0.979623 0.983396 0.985660 0.987925 0.988679 0.989434 0.991698 0.993962 0.995472 0.996226 0.996981 0.997736 0.998491 1.009000
0.615100 0.151553 0.066786 0.037341 0.023786 0.016455 0.012050 0.009200 0.007251 0.005861 0.004834 0.004054 0.003449 0.002969 0.002583 0.002267 0.002005 0.001787 0.001308 0.001191 0.000999 0.000920 0.000787 0.000636 0.000494 0.000416 0.090246
0.615100 0.766653 0.833439 0.870779 0.894566 0.911021 0.923071 0.932271 0.939523 0.945384 0.950217 0.954272 0.957720 0.960690 0.963272 0.965539 0.967544 0.969331 0.970639 0.971830 0.972829 0.973749 0.974537 0.975173 0.975667 0.976083 0.976329
I
2
3
# of papers
# of authors
784. 204. 127. 50. 33. 28. 19. 19. 6. 7. 6. 7. 4. 4. 5. 3. 3. I. I. 3. 3. 2. 1. I. 1. I. 2.
:: 3. 4. 5. 6. 7. 8. 9. 10. Il. 12. 13. 14. 15. 16. 17. 18. 21. 22. 24. 2s. ::: 34. 37. 48.
1325 * Calculated with f,(y,) = 0.6151 (I/x*~~')
**D,,, = 0.0237. At the 0.01 levelof significance. the critical value I 1.63/d/c yr = 0.0448.
0.0234019 0.0209925 0.0080709 0.0084660 0.0095853 0.0142623 0.0165514 0.0216908 0.0189677 0.0183901 0.0180845 0.0193132 0.0188833 0.0189331 0.0201240 0.0201214 0.0203801 0.0193481 0.0187945 0.0198676 0.0211328 0.0217224 0.0216898 0.0218082 0.0220688 0.022407 0.023671*'
Lotka‘s law: a testing procedure APPENDIX
319
I OF c l/x”
B: ESTIMATION
The infinite series 1 + 1’2” + 113” + 114” + ... l/x” converges when n is greater than unity, and there is no precise formulation for its summation x7 11~”except for n = 2 and n = 4. Hence we may only estimate the sum by calculating the sum of the first P terms. However, the sum of the remaining terms from (P + I) to infinity would produce a residual error on the true summation. Therefore we shall attempt to minimize the error by finding the data under the curve from P to infinity.
P
Estimating the area by the trapezoid
P+
P+2
1
rule. he obtain
1 ‘x” dr < ( I ,211I ‘7”’ + 1’CP + I)“] + (1’2)(l!P
1::p’;
<
P+3
2
+
P-
(Note that trapezoids are circumscribed
+ 1)” + li(P + 2)“l +
.
1 :.y I
because IY.u”is concave upwards.) Transposing,
we have the estimate
1‘x” dr - 1/‘P”.
(BI)
L
2. Error in eq,, (Bl) In computing
the area under a curve by the trapezoid
f(x) dr - (l”2)f.fo)
+ f(b)]
rule, one derives the error estimate:
(b -
a)
< (lill)[M!(b
where M is the upper bound for ~f”(x) / on [a. b]. In our case. f(x) we may choose M = n(n + l)..?-’ on [.Y.s + I]. Thus.
I)“] -
+ 1q.r f
I
a?]
and f”(x)
= n(n + I)/.?“‘,
< [n(n + 1)/x”+‘] (x + 1 - #/]2
‘-I lir” dr - (1.!2)[1::~”+ l/(x + I)“](x + 1 - x) 0 < (1/2,[1’x”
= I/x”,
-
r7,
I
l/x” d.x < n(n + 1)!(12x”‘?)
for x = P. P + 1. P A 2, ...
Summing these inequalities
Lc
< c [n(n + I)i(lZr”*?)] P 0-c
KL,2P”
>
1
2(P + I)”
x
I
1
+2(p+
2(P + 2)” 1 + ...
1
I
1’~” dr < C (n(n + I)/(&“-~)] -/ P P * x
o< This may be rewritten
as:
1!2P” + c P-l
l1.Y”-
* l/x” dr < x [n(n + 1)/12x”+2)]. P
so
320
M.
3. Sivlplificclliorl B! cucumscribing
rectangles.
one derives
L. p.40
the crude
estimate:
tB3) Ustnp
eqn tB31 to estimate
x (H(N + I’
x 7
1)~12.\-“~~)]:
_ [II( 11 _& l)‘,,,,.-“~’
)] < [n(n
+
l).‘l?]
l/x”‘?
J-;,
= ntn 4 1)112~[.~-“‘+“/-tn
= The explicit
n’]?(fJ
dx +
I)] I;_,
1)“”
-
(B4)
integration I
I p 1:~” dr
= P-“+‘&I =
Substituting
eqns (B4).
(B5)
l’(n
-
-
1)
l)(P”-‘).
tB5)
eqn (B2).
into
I
[lj(rr
-
l)(P“-1)
-
1:2P”]
<
y_ P+,
Kr”
< [I/(,?
-
l)(P”_‘)
-
IW”]
+ [n/lZ(P
-
Since ue hake the sum x;_, 1x” trapped between an approximation A and A + E. where I)“- ‘. we can cut the error in half b! using the estimate A + E/2. Thus il
I” = *
I’s”
P-
1.r” + I’P’
the second
and the fourth r
P-
c
1:X”=
less than
n.‘24(P
-
I)“_
Testing
+ n/24(P
-
I)“-
?“-
c;
+
l)(P”-I)
-
l!?P”]
+ n’241P
-
I)“+‘.
(B6)
I!01
-
l)(P”-‘)
+
l/?P”
+ n:24(p
-
I)“_’
l/Y”
A n.241 P -
1 )” -
+ 2’24(20-
I)“’
tB7)
‘. Therefore
1 ‘s” +
darn
I.‘01 -
wirh
1 )( P ‘Z 1’) f
’1
(B8)
n = 2. and P = 20.
I
c 1,.P = 2 I I
summation
-
terms.
eqn (B7) OII rhe Auerbach I
The actual
l/?P”]
I
x
/[,
+ [I’(,7
I’X“
I
P-
1
=
4.
-
I
1
I errors
1)tP”-‘)
1‘.Y?t
=
1.593663’44
=
1.644925393
1 ‘.I-~is n’i6.
1112 -
+
1)(70’-‘)
+
1!2(20)’
Ii800
+
l;12(191)
Thus.
the error
1’20 +
or 1.644934067. c
=
is 0.0000086733
= 1~125000. Therefore,
l/l.644925393
= 0.6079 Thus 5.
the sum of the first
Tesring :
-
I
x
I
with
-
E = n;l?(P
/ =
Combining
+ [I’Oi
I)““]
20 terms
(i.e.
eqn (B7) on rlre Chemical ].‘.rI 8KX = 2%’
= -g
I
,;*I
R8R +
, Il. I *RR+
= 1.683468616
i
P = 20) would
Abstract
dara.
produce
x.irh n =
I!(!.888
-
1)(20’ssp-‘)
l/12.6978
+
l/571.9718
+
+
0.080518’47
= 1.763986863. Therefore.
C = VI.763986863 = 0.56689.
a close
approximation
in eqn tB7)
1.888 and P = 20
l/2(20’R*4)
+
1.888,‘118372.7239
1.888’24(20
-
I)““+’