Ranking techniques and the empirical log law

Ranking techniques and the empirical log law

0306-4573184 $3.00 + .I0 Pergamon Press Ltd. RANKING TECHNIQUES AND THE EMPIRICAL LOG LAW BERTRAM C. BROOKES 64 Abbots Gardens, London N2 OJH, Englan...

697KB Sizes 0 Downloads 14 Views

0306-4573184 $3.00 + .I0 Pergamon Press Ltd.

RANKING TECHNIQUES AND THE EMPIRICAL LOG LAW BERTRAM C. BROOKES 64 Abbots Gardens, London N2 OJH, England

Abstract-Four empirical laws of bibliometrics-those of anomalous numbers, of Lotka, Zipf and Bradford, together with Laplace’s notorious “law of succession” and de Solla Price’s cumulative advantage distribution, are shown to be almost identical. Some of these laws are expressed as frequency distributions, some are frequency-ranked. A simple model which discriminates these various forms is described. It shows that the frequency forms conform with an inverse square law over the appropriate interval and that the equivalent rank distribution-the Log Law-has the Df

Q(r) = log, (r + 1) where b is the rank interval. It is further shown that frequency distributions discard empirical statistical information which the equivalent rank distributions retain for analysis. So that rank distributions offer theoretical advantages in this field. The paper concludes with comments on the analysis of the empirical hybrid forms which arise. The reduction of the above laws, empirical and hypothetical, to a single law is achieved by NOT equating the ordinals 1st 2nd, 3rd,. . to the numbers 1,2,3, . . as is commonly done. 1. THE EMPIRICAL

LAWS OF BIBLIOMETRICS

The following empirical laws are known: (a) The ‘anomalous’ law of numbers (1906) which expresses the behaviour of numbers in social use as, for example, in the finance or sports pages of our newspapers. The numbers counted are the first digits of the numbers observed. Its Df is P(m) = log (m + l)/log (n + 1)

(AlI

where 1 d m d 9. A derivation of this law was given by FELLER [l] thought it is not wholly convincing. In this unique case, the frequency and the rank form are identical but I distinguish them by changing the notation to: Q
+ l)/log(n

+ l)=log,(r

+ 1)

642)

where b = n + 1. Because the rank order of the numbers is also their natural order, this distribution is useful in the sampling theory of rank distributions. (b) Lotka’s law (1926) which describes the performance of authors in publishing papers in some specific field though, as VLACHY [2] has shown, it can be found in many other phenomena also. LOTKA [3] expressed it as an inverse power law with pdf p(m)

= k/m’

where k is a constant and the exponent y z 2. (c) Zipf’s law (1935) which describes the distributions 37

(Ll)

of words

or of morphemes

in

9. C.

3x

language

texts. ZIPF [4] expressed

BKOOKES

the law in two ways:

p(m) = k/m’.

. the frequency

form

(Zl)

and q(r) = K/r.

the rank form.

(22)

The Zipf distribution still attracts statistical debate to which MANDELBROT [5]. SIMON [6] and HILL [7] have contributed, but problems remain. The law has been applied to many contexts outside linguistics also. (d) Bruc!forcl’s IUM’(I 935) which describes the rank distribution of papers over journals in scientific bibliographies (or data-base print-outsj. BRADFORD [8] formulated his law somewhat ambiguously and it was first clearly expressed in analytical form by Leimkuhler [9]. The laws of Bradford and of Zipf are almost identical, but Bradford cumulated the ranked scores and Zipf did not. The following ranked form is formally identical Leimkuhler’s but more convenient for practical work: Q(r)=k{log(a+r)-logui

(Bi)

where k and a are parameters estimated from the data. This formulation makes no reference to the “nucleus” which Bradford described and illustrated graphically. What is sometimes thought to be a nucleus disappears with the change of origin. If a nucleus remains after the change, I regard the distribution as a “hybrid” requiring further analysis in terms of (Bl). It is therefore misleading to call (Bl) by itself “the Bradford law” and I shall refer to it as the “empirical Log Law”. The simplest way of checking whether a set of data conform with any of these laws is to draw the log/log or semi-log graph which expresses the expected formulation as a linearity (Fig. I).

(A)

(B)

(C )

k

I:

IDi

Lotka

Bradford

2 P

-r

E

\;;;

x

Fig. I. The empirical

laws of bibliometrics

Ranking

techniques

2,LAPLACE’S

and the empirical

“LAW

OF

Log Law

39

SUCCESSION”

This law, rule or principle, as it is variously described, was derived by LAPLACE [lo] over 200 years ago. But the validity of his derivation, based on the “principle of indifference” and other theoretical dubieties, has long been disputed by probabilists and philosophers of induction. Essentially it concerns the categorisation of a set of A-things into subsets A,, A,, A,, . . A, and so on. The rule can be expressed thus: If, of n A -things examined, m of them can be assigned to subset A,, then the probability that the next A-thing examined is also a member of the subset A, is (m +

l)/(n + 2).

(2.1)

Among those approving the rule were de Morgan and Karl Pearson; among those who thought it ill-founded were Boole, Charles Peirce and R. A. Fisher. One of the difficulties has been to contrive a working model from the limited hardware that pure probabihsts, restrained by their theoretical principles, are compelled to rely on-dice, urns, coloured balls and roulette wheels. One of the critics, VON WRIGHT [l I], a philosopher of induction who analyses the logical subtleties of the problem very clearly, doubts whether any example of the law can be found in “nature”. But what is meant by “nature” here?-the physical world only? Had the philosophers of induction asked Bradford to prepare a bibliography of the problem for them, they might have noticed a statistical regularity of some interest. The version of the rule given by (2.1) has often been simplified and applied improperly to sunrises and ravens. It can also be adapted to our present problem: If, of n A-things examined, all n of them can be assigned to the subsets A,, A,, A,, . . . , so that m = n in (2.1) then the probability that the next A-thing examined can also be assigned to one of these subsets is (n + l)/(n + 2). The probability that it must be assigned to some rze~’ subset is therefore l/(n + 2).

(2.2)

One can use this modified form (improperly no doubt) to generate a probability distribution of A-things over an array of subsets by putting n = 0, 1,2,3, . . . r . . in succession. This process generates the sequence l/2, l/3 . . l/r. . and leads to the interesting conclusion that the pdf of the distribution is p(r) = i - __ r

1

(r + 1)’

(2.3)

This result will be familiar to those who have worked on theoretical aspects of the empirical distributions. Laplace’s rule can be derived from first principles but only by maing certain assumptions the purists do not accept and by using the calculus (of which Laplace was, of course, a grand master) but which, in this context, is forbidden. The fact that Laplace’s hypothetical law is realised in the empirical distributions of bibliometrics-if nowhere else-re-opens questions about the basic principles of probability theory but also illuminates the search for an integrated theory of these empirical distributions. 3.

SORTING

OUT

THE

CONFUSION

Anyone who has worked seriously with these several distributions will have no doubt that beneath their confusions there lurks a simple distribution which embraces them all but which remains to be identified. Derek de Solla Price has tackled the problem on what I might call the macro-scale because he has been primarily concerned with large ensembles of data and has been seeking general conclusions. By contrast, I have been working on a micro-scale, seeking close fits to relatively small data-sets and exploring the goodness of fit of various laws in various contexts. My approach has been very empirical, seizing on small discrepancies which Price, with his wider sweep, could afford to ignore. The equations of my summary above are very mixed and ill-defined: some are

40

B. C.

BROOKES

frequency distributions, some are ranked; some variables are discrete, some are continuous; some frequencies are cumulated, some are not. So my first task has been to reduce them all to the same basis-to frequency distributions because they are then most easily compared. The main results of such work are: (a) Lotka. I recently looked again at the data from which Lotka derived his inverse power law and found that the frequency form of the Bradford law, because of its parameter a, fitted the data (applying x2 tests) better than Lotka’s own law. After testing it on other Lotka-type data, I see no further need for Lotka’s formulation which, in any case, is very intractable to work with. (b) Zipf. The original Zipf laws were also very stark but Zipf had large sets of data and was also seeking generalities. However, Mandelbrot [5] found it desirable to introduce a parameter which enabled the modified law to give better fits with the rank form (22). Nevertheless, when Zipf data are cumulated as Bradford cumulated them, I find that Zipf s law can be subsumed under the modified Bradford law (Bl) with its own different parameter a. (c) Bradford. The bibliometric laws can therefore all be covered by the Bradford law in its rank form (Bl) or its frequency form obtained by converting (Bl). The exact form of the frequencies is f(l)=[l

-exp(-a/k)]-‘-[I

-exp{-(a+

1)/k]]-’

but as a/k is very small this becomes f(2)

and so on. These frequencies can be expressed of Lotka!) as in Fig. 2. Thus:

f(l)=

s

k = k - ~ (a + 2)’ (a + 1)

in terms of the continuous

k

li ’

0

and

I -a&z

function

(3.1)

k/s’

(shades

Ranking

techniques

and the empirical

Log Law

41

we happen to observe over some arbitrary time interval to collect our sample data, as with Lotka’s or Bradford’s law, or over some arbitary sample length of text, as with Zipf s law. So I want to introduce the notion of extending time or of length into the model. From the beginning of our observations we begin to count events. And I have to emphasize that all such counts always begin at 1. (a) The discrete model Consider a linear scale marked 1,2, 3, . . . r . . . at regular intervals. This scale is initially wholly covered by a movable shutter. At a given signal, the shutter begins to move at a uniform speed and so reveals in sequence the numbers on the scale. As soon as the shutter reveals the first number-l, random sampling of the revealed scale numbers begins with “instantaneous replacement”. The sampling observations continue at a steady rate also (not necessarily that of the moving shutter). While the number 1 only is visible, only l’s can be observed. But the probability of getting l’s falls abruptly in steps as further numbers are revealed-from 1 to l/2, from l/2 to l/3, and so on. What is the distribution of the numbers in the sample? It can be seen (Fig. 2) that the expected number of l’s is given by the column over the interval l-2 which is divided into rectangles decreasing height. By the time the number n is revealed or, preferably, up to the moment just before the number (n + 1) is revealed, the expected number of l’s is: d(1) = k(1 - l/2) + (l/2 - l/3). . . [l/n - l/(n + l)]

where k is a constant depending on the ratio of shutter and sampling speeds. Similarly, d(r) = k{i

---&}+k/r

as n increases.

So this model, which simulates the continuous sampling of a continuously extending uniform distribution, also generates the law of succession and the cumulative advantage distribution.

I

0 I

2

I 3

4

I 5

Fig. 2. The scale and shutter

6

model;

7

d(r) discrete,

8

I

I

9

IO

c(r) continuous.

I

42

B. C. BROOK~S

The continuous model We now regard the scale numbers as representing, say, 1000’s of units with the necessary subdivisions also marked. We here sample the numbers revealed as before but now the sampled numbers are of four figures from 1001, 1002, .2001,2002, and so on. But, as with the anomalous law, we take note only of the first digits. The successive rectangles are packed so closely that it is reasonable to work with their envelope----the continuous graph of k/sPinstead of the many rectangles. By the time the n x 1000 point on the scale has been reached. we have, for the first digit. 1, (b)

s

?O(M)k

c(l) =

-d.\--k/n=kln2-kjn low -xz

and, similarly, c(r) = k In (r + 1) - k In r -k/n which, as r and n increase, tends to k In (I + l/r) = k/r. For large enough values of r we therefore find that c(r) and d(r) are practically equal. But, putting k = 1000 for convenience, we have

r

I

2

3

4

5

I0

I 00

c.(r)

693 1000

405 500

283 333

223 250

1x2 200

95 100

IO IO

d(r)

The two distributions are therefore somewhat different for smull values of r--as the diagram indicates. though theoretically the disparities can be resolved by considering the 4-figure numbers also individually. Note that probability density is not distributed uniformly over the graph of Fig. 2; it decreases on the vertical scale in the ratio l/n as n increases. (c) The model of the Log LaH The moving shutter is dispensed with. A logarithmic scale is substituted for the linear scale. The log scale is endowed with uniform probability density along its whole length. Any finite interval of interest is illuminated, the remainder is invisible. The range of numbers thus displayed by the interval on the log scale is then sampled at random with equal frequencies along its linear length. The Log Law is a distribution which has a uniform probability density over some finite interval of a logarithmic scale. 5. INFORMATION

RANK

ASPECTS OF FREQUENCY DISTRIBUTIONS

AND

Bibliometric data sets typically relate to, say, N entities each of which carries an observed “score” and all such scores (no zeros) are positive integers. Given such a set of data, most analysts begin work by organising its frequency distribution. But Zipf and Bradford somewhat unconventionally ranked the entities, giving the entity with top score the rank list. Frequency distributions are more familiar than rank distributions (except for the sports pages of newspapers) and, if needed, there is a sophisticated distribution and sampling theory to support the analyst. There is, however, no comparable support for the rank distribution analyst (and even rank correlation theory is, in my view, ill-founded). So bibliometric data continue to be analysed almost wholly by frequency distribution techniques. Yet rank distributions retain for analysis all the statistical information of the data set whereas frequency distributions discard much of it before analysis begins. That there is some loss is evident from the fact that it is always possible to construct the frequency from the rank distribution but not the rank distribution from the frequency distribution. The loss can be measured: There are n! possible ways of ranking N entities and there

43

Ranking techniques and the empirical Log Law

are N!/f( l)!f(2)! . . . .f(r)! . . . ways of organising a frequency frequencies. The rank distribution therefore provides

distribution

with the given

Z(R) = log, N! bits for analysis

while the frequency

distribution

provides

Z(F) = log, N! - log,f(r)

bits.

The fraction thus discarded is logf(r)!/log N! which is often 50% and can be as high as 90%. A second advantage of rank distributions is that they give priority to the highest scoring entities which are usually also the most prominent and easiest to count exactly. At the same time, those entities with the lowest scores also represent the “also-rans” in the context of observation and the most elusive to find to count exactly. In frequency distributions, on the other hand, the highest scoring entities become isolated items in the remote tail and may be grouped together for analysis. Such crude treatment of data means that differences discernible in the ranked data are lost in the noise inherent in empirical observations plus that generated by reducing the data to a frequency distribution. In this field of analysis frequency distributions trawl too coarse a mesh to capture the discriminations we seek. 6. ANALYTICAL

TECHNIQUES

AND

HYBRID

FORMS

The Laplacian categorisation of N entities conforms with the Inverse Square Law and that in turn with the rank Log Law. But as N is finite the categorisation process ends when the Nth entity has been assigned to its subset. If the whole set thus conforms with the Log Law, I regard it as “homogeneous”. Sets of data which are homogeneous in this sense pose no problems: I regard them as natural occurrences. In such cases, it seems to me, every member of the set has been exposed to the same combination of drives, pressures and other social forces though, since the observed effects are the outcome of human activities and interests, the responses are somewhat different. An example of a homogeneous set is depicted in the graph of Fig. 3. It is of the well-known ORSA bibliography (1958) of operations research which I recently analysed[l3]. Note that this bibliography has no Bradford nucleus. A bibliography which has a nucleus cannot of course be homogeneous because it has at least two groups. The Log Law equation of the ORSA bibliography is: G(r) = 294.12 In (1 + r/0.8345)

(6.1)

where 1 d r d 352. It estimates a total of 352 journals and 1725 papers as against the 370 journals and 1763 papers actually listed. This bibliography could thus be said to be “more than 100% complete” but I interpret the excess as indicating that those responsible for compiling this authoritative list were “playing safe”. If one categorises the natural languages in which the papers of a scientl$c bibliography are first published, one does not find a homogeneous set. The set divides into two groups: Group A of those who publish in the few major scientific languages of the world, and Group B who publish in one of the remainder. Each of the two Groups conforms with its own Log Law but neither is complete-as the ORSA bibliography is complete. By comparing the actuality with the hypothetical homogeneous Log Law, one observes a net transfer of papers from Group B to Group A. Scientists whose first language is one of the major languages do not face the p-oblem of, say, a Dutchman or Dane or Finn who seeks a wider readership than publication in his own first language offers him. But few scientists whose first language is in Group A choose to publish in the languages of Group B. So there is a net transfer of papers from Group B to Group A. This net transfer can be quantified on the basis of the technique and my assumptions. The strategy I have now adopted is to fit the first Log Law, push it as far as it will IPM Vol. 20, No.

1/2-D

B. C. BROOKES

I

I

IO log

Fig. 3. The ORSA

bibliography

I 400

I00 r

(no nucleus)

and the ranked

log law.

go, then fit a second Log Law to the remainder, push that as far as it will go, and continue this process until I have captured all the data by Log Laws of different parameters. What this strategy is implicitly recognising is that in a data set there may be TWOor more levels of categorisation. The graph of Fig. 4 is of the vocabulary of Biblical Hebrew conveniently organised for analysis by ranking techniques[l4]. The text consists of the Books of Leviticus, Numbers and Deuteronomy in which the vocabulary of 237 1 words generates some 60,000 tokens. Applying the above technique to this data set, I divided it into 4 groups:

GiWip No. of words Av. tokens per word Token: “4 of total

A

B

c

D

195 251 81

401 19 12

415 5.5 4

1360 1.5 3

I have suggested that Group A, the nucleus of 195 words most frequently used, which together constitute more than 80% of the text, could be regarded as vocabulary of “basic Biblical Hebrew”. Some sections of these Books are historical records, some are moral teachings so it is not surprising that the total vocabulary does not constitute a set which is homogeneous and complete in the sense described above. A more detailed account of the analysis is given in a recent publication[ 151. So the strategy I have adopted here is basically very simple. I regard conformity with the Log Law as the normal response of social entities wholly free to respond as they wish and I use this idea as an instrument of analysis. I can claim support for this strategy from the hypothetical Laplace law and from the empirical laws from which the Log Law was derived. The fact that I look for hybrid sub-groups when necessary implies that I regard the social world to which the empirical laws apply as too complex to be grasped by any all-embracing single formulation. The particular idea which made it possible to reduce all the empirical laws to the

45

Ranking techniques and the empirical Log Law

I I I I

I I

01 I

I

1

2

IO

A

t

I

I3

100

L

c

I

I I

I I I

+Ll: IO00

2cnm

Log I-

Fig. 4. The cumulated

Zipf distribution

(Biblical

Hebrew)

and its 4 subsets,

A, B, C and D.

Inverse Square Law of frequencies and the Log Law of ranks was that, in the analysis of ranked data, it is dead wrong to begin by putting 1st = 1,2nd = 2, 3rd = 3, and so on. The first need is to evaluate the parameter a which, as the origin of the rank scale, can take any finite positive value on the line of real numbers but which is very rarely 1 and certainly not zero as is often implicitly assumed.

Acknowledgement-The Fellowship, B.C.B.

work reported

here is part of an on-going

study supported

by a Leverhulme

Emeritus

APPENDIX Equivalence of the inverse square frequency and the log law ranked distributions. A given data set can be analysed either as a frequency distribution or as a ranked distribution. Here it is shown that a frequency distribution which conforms with the density function k/x* over some finite interval a to (a + n) of the x-axis transforms into the ranked distributions of Zipf and Bradford. To simplify the transform here, it is performed as though all the variables concerned were continuous. From

f(x) = k/x*,

x=

1)

2, 3, . . . n,

we have

F(x)

,k_k

=

a

a+x

(AlI

which is the distribution function of the inverse square law. In ranked distributions, the highest score, which corresponds to the highest value of x in the frequency distribution, is ranked 1st. All other scores are arranged in descending order and ranked 2nd, 3rd,. and so on. If the highest value of x is n, the total number of scores, and therefore of ranks, is F(n)=+.

46

B. C. The relation

between

BROOKES

the rank Y and the variable

.Y is then given by

r = F(n) - F(_Y - 1) with F(0) = 0 k z.zp-___ atx-I

k

where 1: = k/(a + s - 1) and a = k/(a + n). For the Bradford distribution, the ranked distribution function Q(r) where

Q(r)=Q(n)-QC-

s u ,

=

which is the log law as required. distribution is

,I

-

(A.3)

u+n

scores are successively

cumulated

to give the rank

I) k

rfr ~ dx = k In (v/a) = k In __ z 0 + I: I -xs As Zipf did not cumulate

de(r) 4(r)= ~ dr =k/(r

the scores,

the corresponding

Zipf

+r).

REFERENCES [I] W. FELLER, An Introduction

[2] [3] [4] [5] [6] [7] [8] [9] [lo] [ll]

to Probability Theory and its Applications, Vol. 2. New York, Wiley (1948). J. VLACHY, A bibliography of Lotka’s law and related phenomena. Scientometrics 1978, 1. A. J. LOTKA, The frequency distribution of scientific activity. J. Wushington Acad. Sci. 1976, 16. 317-323. G. K. ZIPF, The Psychobiology oj’ Language. M.I.T. Press, Cambridge, Mass. (1965). B. MANDELBROT, An information theory of the structure of language. Communication Theory Edited by W. JACKSON). Butterworth, London (1953). H. A. SIMON, On a class of skey distribution functions. Biometrika. 1955, 42, 425-440. B. M. HILL, The rank-frequency form of Zipfs law. J. A,. Stat. Assn. 1975, 70, 1017~1026. S. C. BRADFORD, Documentation. Crosby Lockwood, London (1948). F. F. LEIMKUHLER, The Bradford distribution. J. Docum. 1967, 23(3), 197-207. P-S. LAPLACE, Mimoire sur la probabilitt! des causes par les PvPnements. Paris (1774). G. H. VON WRIGHT, A Treatise on Induction and Probability. Routledge and Kegan Paul. London (1951).

[ 121 D. DE SOLLA PRICE, A general theory of bibliometric

and other cumulative advantage processes. J. Am. Sot. Inform Sci. 1976, 27(5), 292-307. [ 131 B. C. BROOKES,A critical commentary on Leimkuhler’s ‘exact’ formulation of the Bradford 1a.w. J. Docum. 198 I, 37(2), 77-88. [l4] P. M. K. MORRIS and E. JAMES, A critical word-book of Leviticus, Numbers and Deuteronomy. The Computer Bible. Vol. VIII. University of Montana (1975). [15] B. C. BROOKES, Quantitative analysis in the humanities: the advantage of ranking techniques. Studies on Zipf s law, Quantitative Linguistics (Edited by H. GUITER and M. V. APAROV), Vol. 16. Brockmeyer, Bochum (1982).