Modelling count response variables in informetric studies: Comparison among count, linear, and lognormal regression models

Journal of Informetrics 9 (2015) 499–513 Contents lists available at ScienceDirect Journal of Informetrics journal homepage: www.elsevier.com/locate...

Download PDF

638KB Sizes 0 Downloads 35 Views

Report

PDF Reader
Full Text

Journal of Informetrics 9 (2015) 499–513

Contents lists available at ScienceDirect

Journal of Informetrics journal homepage: www.elsevier.com/locate/joi

Modelling count response variables in informetric studies: Comparison among count, linear, and lognormal regression models Isola Ajiferuke a,∗ , Felix Famoye b a b

Faculty of Information and Media Studies, University of Western Ontario, London, Canada N6A 5B7 Department of Mathematics, Central Michigan University, Mount Pleasant, MI 48859, USA

a r t i c l e

i n f o

Article history: Received 7 January 2015 Received in revised form 10 May 2015 Accepted 11 May 2015 Keywords: Count response variable Linear regression model Count regression models Negative binomial regression model Lognormal regression model Informetric studies

a b s t r a c t The purpose of the study is to compare the performance of count regression models to those of linear and lognormal regression models in modelling count response variables in informetric studies. Identiﬁed count response variables in informetric studies include the number of authors, the number of references, the number of views, the number of downloads, and the number of citations received by an article. Also of a count nature are the number of links from and to a website. Data were collected from the United States Patent and Trademark Ofﬁce (www.uspto.gov), an open access journal (www.informationr.net/ir/), Web of Science, and Maclean’s magazine. The datasets were then used to compare the performance of linear and lognormal regression models with those of Poisson, negative binomial, and generalized Poisson regression models. It was found that due to overdispersion in most response variables, the negative binomial regression model often seems to be more appropriate for informetric datasets than the Poisson and generalized Poisson regression models. Also, the regression analyses showed that linear regression model predicted some negative values for ﬁve of the nine response variables modelled, and for all the response variables, it performed worse than both the negative binomial and lognormal regression models when either Akaike’s Information Criterion (AIC) or Bayesian Information Criterion (BIC) was used as the measure of goodness of ﬁt statistics. The negative binomial regression model performed signiﬁcantly better than the lognormal regression model for four of the response variables while the lognormal regression model performed signiﬁcantly better than the negative binomial regression model for two of the response variables but there was no signiﬁcant difference in the performance of the two models for the remaining three response variables. © 2015 Elsevier Ltd. All rights reserved.

1. Introduction In many studies, a linear regression model is often employed for parameter estimation, goodness of ﬁt, or response variable prediction. In a linear regression model, the response variable follows a continuous normal distribution which implies that predicted values could be any real value that is negative or positive. However, for many experiments, survey, observations, etc., the response variables are of a count nature, i.e. they follow a discrete (or count) distribution. Using a

∗ Corresponding author. Tel.: +1 519 661 2111x81364; fax: +1 519 661 3506. E-mail addresses: [email protected] (I. Ajiferuke), [email protected] (F. Famoye). http://dx.doi.org/10.1016/j.joi.2015.05.001 1751-1577/© 2015 Elsevier Ltd. All rights reserved.

500

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

linear regression model for such variables might lead to negative predicted values. So, how has this problem been addressed in the literature? The approach in many ﬁelds of study is to use a count regression model instead of a linear regression model. For example, in health-related studies, count regression models have been used to model the number of incidents of physical aggression or substance abuse (Gagnon, Doron-LaMarca, Bell, O’Farrell, & Taft, 2008), the number of malaria cases (Achcar, Martinez, Pires de Souza, Tachibana, & Flores, 2011), the number of medically attended childhood injuries (Karazsia & van Dulmen, 2008), number of health beneﬁts received per patient (Czado, Schabenberger, & Erhardt, 2014), and number of sub-health symptoms (Xu, Li, & Chen, 2011). In other ﬁelds of study, they have also been used to estimate recreational trip demands (Wang, Li, Little, & Yang, 2009), number of auto insurance claims (Meng, 2009), number of roadway accidents (Nassiri, Najaf, & Amiri, 2014), and number of hardware failures or occurrences of disease or death (Gulkema & Goffelt, 2008). In informetric studies, especially in the subﬁelds of citation analysis, patent analysis, webometrics, and altmetrics, possible count response variables include: number of papers published by a scholar, institution, or country; number of papers collaborated by scholars, institutions, or countries; number of citations received by a paper; number of times two papers are co-cited; number of other patents referencing a patent; number of inlinks to a website; number of co-links to two websites; and number of views or downloads received by an online paper. So, how have informetric studies modelled these count response variables? Many studies were only interested in the correlation between these variables (Bornmann, Schier, Marx, & Daniel, 2012; Buter & van Raan, 2011; Guerrero-Bote & Moya-Anegón, 2014; Jamali & Nikzad, 2011; Kim, 1998; Kreider, 1999; Liu, Fang, & Wang, 2011; Moed, 2005; Nieder, Dalhaug, & Aandahl, 2013; Schlögl & Gorraiz, 2010; Schlögl, Gorraiz, Gumpenberger, Jack, & Kraker, 2014; Tsay, 1998; Yuan & Hua, 2011) and some other variables while a couple of studies have employed logistic regression models by recoding the count variable into a dichotomous variable (Gargouri et al., 2010; Willis, Bahler, Neuberger, & Dahm, 2011). Many other studies have also employed the linear regression to model these count variables (Ajiferuke, 2005; Ayres & Vars, 2000; Bornmann & Daniel, 2007; Habibzadeh & Yadollahie, 2010; Landes & Posner, 2000; Lokker, McKibbon, McKinlay, Wilczynski, & Haynes, 2008; Peters & van Raan, 1994; Vaughan & Thelwall, 2005; Willis et al., 2011; Xia, Myers, & Wilhoite, 2011; Yoshikane, Suzuki, Arakama, Ikeuchi, & Tsuji, 2013; Yu, Yu, Li, & Wang, 2014). However, negative binomial regression models, which are a type of count regression models, have been used in a very few informetric studies (Baccini, Barabesi, Cioni, & Pisani, 2014; Chen, 2012; Didegah & Thelwall, 2013; Lee, Lee, Song, & Lee, 2007; McDonald, 2007; Thelwall & Maﬂahi, 2015; Walters, 2006; Yoshikane, 2013; Yu & Wu, 2014). These studies, except the one by Yoshikane, did not compare the performance of negative binomial regression models and linear regression models. In the case of Yoshikane, the negative binomial regression and logistic regression models were used to conﬁrm the signiﬁcant factors found in the linear regression model for patents’ cited frequency. Also, in a recent article by Thelwall and Wilson (2014), the abilities of negative binomial regression, lognormal regression and general linear regression models in detecting factors affecting citation scores were compared. Assuming that citation counts tend to follow a discrete lognormal distribution, the authors simulated discrete lognormal citation data and regressed it with one binary factor. The results of the study showed that “negative binomial regression applied to discrete lognormal data will identify non-existent factors at a higher rate than expected by the signiﬁcance level” (p. 969), and the authors recommended the following strategy for citation data that follows a discrete lognormal distribution: take the logarithm of the citation data after discarding zeros and then apply the general linear model OR add 1 to the data before taking the logarithm and then use the general linear model. It should be noted that: not all informetric response variables follow discrete lognormal distribution (in fact, not even all citation data follow the discrete lognormal distribution); the above study made use of simulated data, and not real data; and the study used only one simple factor, and no continuous covariates or multiple factors as is often the case in major studies. Furthermore, there are other reasons, apart from the ability to detect factors affecting the response variable, why one regression model may be preferred to another. Hence, the objectives of this paper are to: • Illustrate the pitfalls of using linear regression models for count response variables with real data sets, especially given their frequent use in empirical studies by informetric researchers; • Investigate the suitability of lognormal regression model in modelling response variables with multiple covariates/factors using real data sets; and • Compare the performance of linear, count, and lognormal regression models in modelling informetric count response variables. 2. Linear, lognormal, and count regression models A brief review of linear, lognormal, and three count regression models will be given in this section. 2.1. Linear regression A linear regression model stipulates that the response variable y can be written as yi = ˇ0 + ˇ1 xi1 + ˇ2 xi2 + · · · + ˇp−1 xi,p−1 + εi ,

i = 1, 2, 3, . . ., n

(1)

where xs are the predictors, ˇs are the regression parameters and error εi is assumed to have a normal distribution with mean 0 and constant variance 2 . Hence, the response variable y has mean E(Yi |xi ) = (xi ) = ˇ0 + ˇ1 xi1 + ˇ2 xi2 + · · · + ˇp−1 xi,p−1

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

501

and constant variance 2 . The mean (xi ) is a linear function of the regression parameters (ˇ0 , ˇ1 , . . ., ˇp−1 ) and its value is any real number which can be negative or positive. The probability density function of Yi is given by

f (yi |xi ) =

1 1 exp − √ 2 2

y − (x ) 2 i i

.

(2)

2.2. Lognormal regression model In general, count response variable y is skewed. Many authors, including Thelwall and Wilson (2014), have suggested using a (discrete) lognormal distribution. To the best of our knowledge, a discrete lognormal distribution is yet to be deﬁned in the literature. In this sub-section, we deﬁne the continuous lognormal distribution that has been used to approximate the discrete version. If Yi has a lognormal distribution, its probability density function is given by

f (yi |xi ) =

1 1 exp − √ 2 yi 2

log(y ) − (x ) 2 i

i

,

(3)

where parameter 2 > 0 and (xi ) = ˇ0 + ˇ1 xi1 + ˇ2 xi2 + · · · + ˇp−1 xi,p−1 . The mean and variance of the response variable Yi are respectively given by E(Yi |xi ) = exp[(xi ) + 2 /2] and V(Yi |xi ) = [exp( 2 ) − 1] × exp[2(xi ) + 2 ]. One problem with applying lognormal regression model is that the model is not deﬁned for Y = 0. If observations of zero are present in the data, the problem can be avoided by adding 1 to all values before evaluating Eq. (3). 2.3. Count regression models The three count regression models to be considered in this paper are Poisson regression model, negative binomial regression model, and generalized Poisson regression model. These are brieﬂy described below. 2.3.1. Poisson regression model The Poisson distribution is perhaps the most used discrete distribution because of its simplicity. The Poisson probability mass function is given by P(Y = y) = y e− /y!,

y = 0, 1, 2, . . .

The mean and variance of the Poisson distribution are equal to . Thus, for Poisson regression, we have E(Yi ) = = (xi ) = eˇ0 +ˇ1 xi1 +ˇ2 xi2 +···+ˇp−1 xi,p−1 ,

(4)

Hence, the Poisson regression model is P(Y = yi |xi ) = [(xi )]yi e−(xi ) /yi ! for yi = 0, 1, 2, . . .

(5)

The mean and variance of the Poisson regression model in Eq. (5) are equal to (xi ). Thus, the Poisson regression model is appropriate for data where the mean and variance are about equal. The values of a count response variable y are non-negative. Hence, the mean function (xi ) ensures that the predicted values of y are also non-negative. This is unlike the linear regression in Eq. (1) where the mean function is linear and can be negative. A drawback for using a linear regression model to ﬁt a non-negative count response variable y is that a predicted value of y could be negative. 2.3.2. Negative binomial regression model While the Poisson distribution is often the ﬁrst to be considered for ﬁtting count data, however, if the mean is very much less than the variance of the data, then the data are over-dispersed with respect to the Poisson model. Another way we can correct over-dispersion is to ﬁt instead a negative binomial model as an alternative to the over-dispersed Poisson model. The probability mass function for the negative binomial distribution (NBD) is given by 1/a + y − 1 1/a (6)P(Y = y) = y (1 − ) , y = 0, 1, 2, . . . y The mean and variance of the above model are respectively given by = /[a(1 − )] and 2 = /[a(1 − )2 ]. From the mean and variance, we see that the NBD is over-dispersed, that is the variance exceeds the mean. Suppose the mean of the NBD depends on some predictor variables xi , then we can write (xi ) = /[a(1 − )] and this gives = a(xi )/[1 + a(xi )]. Thus, the negative binomial regression (NBR) model can be written as

P(Y = yi |xi ) =

1/a + yi − 1 yi

a(x ) yi i 1 + a(xi )

1 1 + a(xi )

1/a

,

yi = 0, 1, 2, . . .

(7)

502

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

The mean and variance of the negative binomial regression model are respectively given by E(Y) = (xi ) and V(Y) = (xi )[1 + a(xi )]. The NBR model reduces to the Poisson regression model when the dispersion parameter a goes to zero. 2.3.3. Generalized Poisson regression model The probability mass function for the generalized Poisson distribution (GPD) is given by P(Y = y) = y (1 + ay)y−1 e−(1+ay) /y!,

y = 0, 1, 2, . . .

(8)

The parameter > 0 and the parameter a can be negative or positive. The mean and variance of the model in Eq. (8) are respectively given by = /(1 − a) and 2 = /(1 − a)3 . Suppose the mean of the GPD depends on some predictor variables xi , then we can write (xi ) = /(1 − a) and this implies that = (xi )/[1 + a(xi )]. Thus the generalized Poisson regression (GPR) model can be written as P(Y = yi |xi ) =

(xi ) 1 + a(xi )

yi

−(xi )(1 + ayi ) (1 + ayi )yi −1 , exp yi ! 1 + a(xi )

yi = 0, 1, 2, . . .

(9)

The mean and variance of the GPR model are respectively given by E(Y) = (xi ) and V(Y) = (xi )[1 + a(xi )]2 . Famoye (1993) and Wang and Famoye (1997) considered the GPR model for over-dispersed data. The GPR model reduces to the Poisson regression model when the dispersion parameter a = 0. If a > 0, then V(Yi ) > E(Yi ) and the GPR will model count data with over-dispersion. Similarly, when a < 0, then V(Yi ) < E(Yi ) and the GPR will model under-dispersed count data. 2.4. Modiﬁed count data regression models Modiﬁed count data regression model can be in form of inﬂated, truncated, censored, or grouped count data model. The ﬁrst two types are described below. 2.4.1. Zero-inﬂated count regression models Over-dispersion in say for example, a Poisson model as observed in the previous sub-section can also arise as a result of too many occurrences of zeros than would normally be expected from a Poisson model. That is, there could be too many zeros than can be assumed theoretically or expected under such model. If this happens, we would then say that the Poisson model is in this case, zero inﬂated. In some cases, these zeros can be “structural” zeros in which it is impossible to observe an occurrence. For instance, if we are using a particular citation database to determine the number of citations received by a set of publications, there might be some publications not indexed in that database. Such publications will have the number of citations recorded as zeros but in actual fact, such zeros will be structural as it would be difﬁcult to determine this in that database. On the other hand, there might be some other publications that are indexed in that database but have not been cited, then such publications would have a count of zero and the zeros in this case would be referred to as “sampling” zeros. Consider a discrete distribution with mean (parameter) . The zero inﬂated discrete distribution can be viewed as a mixed discrete distribution, comprising the part with zero inﬂated and the other part consisting of the zeros that would normally be expected under the discrete model. A zero-inﬂated discrete model has the probability function of the form

P(Yi = yi ) =

ω + (1 − ω)f (0),

yi = 0

(1 − ω)f (yi ),

yi > 0

,

(10)

where f(yi ) may be Poisson, negative binomial or generalized Poisson distribution and 0 ≤ ω < 1. Suppose we have E(Yi ) = i and V (Yi ) = ∗2 for the non-inﬂated model, then the mean and variance of the zero-inﬂated model can be written as E(Yi ) = (1 − ω)i and V (Yi ) = (1 − ω)(∗2 + ω2i ) respectively. Hence, we have the following for the zero-inﬂated models: Name

Mean

Variance

Zero-inﬂated Poisson model Zero-inﬂated negative binomial model Zero-inﬂated generalized Poisson model

(1 − ω)i (1 − ω)i (1 − ω)i

(1 − ω)i (1 + ωi ) (1 − ω)i (1 + ˛i + ωi ) (1 − ω)i [(1 + ˛i )2 + ωi ]

Since 0 ≤ ω < 1, the variances of the above zero-inﬂated models always exceed their corresponding means. Hence, a zero-inﬂated model may account for both extra zeros and over-dispersion simultaneously. 2.4.2. Zero-truncated count models The zero-truncated regression models arise in those situations where there is no zero by nature of the data. An example for instance is the number of authors of a publication. Once a document is published, the authorship has to be attributed to at least one entity, whether a person or an institution. In a situation where the author is anonymous, the number of authors still has to be regarded as one. Thus the zeros cannot be observed in this case for all publications. For a random variable Y with a discrete distribution, where the value of Y = 0 cannot be observed, then the zero-truncated random variable Yt has the probability mass function P(Yt = y) = P(Y = y)/P(Y > 0), Y = 1, 2, 3, . . .

(11)

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

503

For the zero-truncated Poisson, with parameter ; P(Y > 0) = 1 − P(Y = 0) = 1 − e− . Hence, the probability mass function of zero-truncated Poisson random variable Yt is y e− , y!(1 − e− )

P(Yt = y) =

y = 1, 2, 3, . . .

(12)

Note that the mean of zero-truncated Poisson model is not equal to . The mean is actually /(1 − e− ). Thus, if is the mean of an un-truncated model, the mean of a zero-truncated model is given by /P(Y > 0). The probability mass functions for the zero-truncated negative binomial model and the zero-truncated generalized Poisson model can be similarly deﬁned. All these three models can be used in count data regression when the data is truncated at point zero. Here, only the zero-truncated Poisson regression (ZTPR) model is deﬁned. It is given by P(Yt = yi |xi ) =

(xi )yi e−(xi ) yi !(1 − e−(xi ) )

y = 1, 2, 3, . . .

,

(13)

2.4.3. Hurdle regression models The hurdle model (King, 1989; Mullahy, 1986) for a Poisson distribution is a mixture model like the zero-inﬂated Poisson model in Eq. (10). In the hurdle model, a binomial probability governs the binary outcome that a count response variable is zero or greater than zero (i.e. positive). This is often called the transition stage. If the value of the response variable is positive, the hurdle is crossed and the conditional distribution of the positives is zero-truncated Poisson distribution in Eq. (13). Thus, one distribution governs the zeros while another distribution governs the non-zero counts. In the zero-inﬂated model, the same distribution addresses all counts. For the hurdle Poisson model, we have

P(Yi = yi ) =

1 − ω,

yi = 0

ωfT (yi ),

yi > 0,

(14)

where fT (yi ) is the zero-truncated Poisson model in Eq. (13). In a similar way, other Hurdle models can be deﬁned. The ω in Eq. (14) is taken as a constant in this paper because the ω for the zero-inﬂated model in Eq. (10) is a constant. One can assume that both ω’s are functions of covariates by using ω(xi ). 2.5. Estimation and goodness-of-ﬁt statistics Suppose we observe yi (i = 1, 2, . . ., n), count response variables, each with predictor variables xi1 , xi2 , . . ., xi,p−1 . The likelihood function for the Poisson regression model is the product of the respective probabilities in Eq. (5) and this is given by L(ˇ0 , ˇ1 , . . .ˇp−1 ) =

n

[(x )]yi e−(xi ) i

i=1

yi !

.

(15)

The log-likelihood function is obtained by taking the log of Eq. (15) to obtain log(L) = log L(ˇ0 , ˇ1 , . . .ˇp−1 ) =

n

yi log[(xi )] − (xi ) − log(yi !) .

(16)

i=1

The parameters in the count regression models are estimated by the method of maximum likelihood. In this method, the function (16) is maximized over the parameters by using a statistical software like SAS, SPSS or R. The log-likelihood functions for NBR and GPR can be obtained from Eqs. (7) and (9), respectively. A common goodness-of-ﬁt statistic for count regression models is the Pearson’s 2 statistic deﬁned by

n

2

=

i=1

(yi − ˆ i)

2

v( ˆ i)

(17)

where v( ˆ i ) is the variance function evaluated at the estimated mean. One can also use the log-likelihood statistic in Eq. (16) as a goodness-of-ﬁt statistic. However, both the log-likelihood statistic and the Pearson’s chi-square in Eq. (17) do not take into consideration the number of estimated parameters in the model. For the linear regression, the NBR, and the GPR models, the number of estimated parameters is p* = p + 1. The extra 1 is from the parameter 2 in both the linear and lognormal regression models, and the dispersion parameter in the NBR and GPR models. For goodness-of-ﬁt statistics, we use the Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC), both take into consideration p* , the number of estimated parameters. In general, the higher the number of model parameters, the greater the log-likelihood. Both AIC and BIC penalize for the number of parameters. In addition, the BIC takes into consideration the number of observations in the data set. The AIC is deﬁned as AIC = −2 log(L) + 2p∗

(18)

504

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

while the BIC is deﬁned as BIC = −2 log(L) + p log(n), wherenis the sample size

(19)

For both goodness-of-ﬁt statistics in Eqs. (18) and (9), the smaller the better the regression model. The method of maximum likelihood estimation is used to estimate the parameters in the linear regression model in Eq. (1) since the goodness of ﬁt measures in Eqs. (18) and (19) depend on the log-likelihood statistic. Also, the parameters in the lognormal regression model are estimated by the method of maximum likelihood estimation. The log-likelihood functions for the linear and lognormal regression models can be obtained from Eqs. (2) and (3), respectively. 2.6. Model comparisons When two models are nested, one can use the likelihood ratio test to compare them. However, for non-nested models, one can apply a test proposed by Vuong (1989). The statistic for the Vuong test has an approximate standard normal distribution. For more details, see Vuong (1989). In this study, we will apply the test to compare the NBR and the log normal regression models. Our null hypothesis is that the NBR and log normal regression are equivalent. If NBR is better, the test statistic will be positive and if it exceeds 1.96, then the NBR is better at 5% level of signiﬁcance. On the other hand, if the test statistic is negative, it favours the lognormal regression model. If the test statistic is less than −1.96, the log normal is better at 5% level of signiﬁcance. 3. Methodology 3.1. Data collection Four datasets were collected for this study to represent four broad informetric subﬁelds. The ﬁrst dataset was data on Google patent. We searched patent database at United States Patent and Trade Ofﬁce (USPTO) website for patents assigned to Google (assignee name) and issued between January 1, 2001 and December 31, 2010 (issued date). The dates chosen were to ensure that we have a manageable recent dataset as well allow enough time to have passed for the patents to attract citations. The number of patents retrieved was 560 and the following variables were collected on each patent: issue year, number of inventors, number of US patents cited, number of foreign patents cited, number of other references, number of claims, and number of patents citing it. The second dataset partly represents the altmetrics subﬁeld and was collected from the website of Information Research: an international electronic journal (http://www.informationr.net/ir/). The ﬁrst issue of the journal was published in 1995/96 but the journal did not record the number of times an article was viewed prior to year 2000. Hence, data were collected for the period 2000–2011 (2011 chosen as end date to allow the articles to attract a reasonable number of views and citations). Each volume contains 4 issues and 2 issues were randomly selected from each volume but data were collected on all refereed papers (working papers and workshop papers excluded) contained in each selected issue. For each article, the following data were collected: year of publication, number of authors, number of references, number of views, number of citations from Google Scholar (a link to Google Scholar was provided on the website), and number of citations from Scopus (no link to Scopus was provided but we collected this data to see the number of citations received by these articles from one of the proprietary citation databases; Scopus was chosen over Web of Science because it contains more of the articles than the latter). If an article was not found in Google Scholar (likewise Scopus), then the number of citations received from that database was recorded as zero. We also collected the number of readers for each article from Mendeley.com, where the number of readers represents the total number of Mendeley users who have this article in their Mendeley personal library. Again, if an article was not found in Mendeley.com, the number of readers was assumed to be zero. The third dataset was more on the general area of informetrics and was collected from the Web of Science. The database was searched for documents containing the phrase “knowledge management” in the title ﬁeld and published between 2001 and 2010 (again, dates chosen to include recent documents with adequate time to attract citations). The retrieved set of 4075 documents was reﬁned to include only articles (thus excluding meeting, review, editorial, and abstract document types). For each of the 1337 articles (1338 in fact but with one duplicate) retrieved, the following data were collected: number of characters in the title, number of authors, year of publication, number of pages, number of references, and number of citations. The fourth dataset represents the webometrics subﬁeld and was on inlinks to Canadian Universities websites. Even though the Association of Universities and Colleges of Canada listed 97 institutions in its 2014 Directory of Canadian Universities (Association of Universities and Colleges, 2014), some of these institutions are regarded as having a strictly religious or specialized mission. Hence, only the 49 major universities listed in the latest Maclean’s annual ranking of universities (Honey, 2014) were initially chosen for inclusion in the dataset. Maclean’s was our source of university type classiﬁcation (i.e. Medical Doctoral, Comprehensive, or Primarily Undergraduate), the amount of social sciences and humanities grants, amount of medical/science grants, and total research dollars. The student size for each university was obtained from the Association of Universities and Colleges website while the faculty sizes for universities in Ontario, Atlantic, and Quebec provinces were taken from Common University Data Ontario (http://www.cou.on.ca/statistics/cudo.aspx), Atlantic Common University Data set (http://www.atlanticuniversities.ca/statistics/atlantic-common-university-data-set), and Common University Data-Quebec

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

505

(http://www.crepuq.qc.ca/EducQ/PCan.html) respectively. For the universities not belonging to any of the three regional bodies, each university website was visited to obtain the faculty size. Google search engine was used to collect data on the number of web pages indexed from each university website and number of inlinks to the website. However, we were unable to obtain the number of inlinks for ﬁve of the universities. Hence, only 44 universities were included in the dataset. 3.2. Data analysis For each dataset having year of publication or issue as a variable, a new variable was computed, namely number of years since publication by subtracting year of publication from 2014. For the Google patent data, the following regression analysis was done: (i)

Response variable: Citations Explanatory variables: Ysince, Inventor, US Patents, Foreign Patents, OtherRef, Claims.

For the Information research dataset, the following regression analyses were done: (i)

Response variable: Views Explanatory variables: Ysince, Authors, and References. Response variable: Readers Explanatory variables: Ysince, Authors, References, and Views. Response variable: Number of Citations from Google Scholar (Citation Google) Explanatory variables: Ysince, Authors, References, Views, and Readers Response variable: Number of Citations from Scopus (Citation Scopus) Explanatory variables: Ysince, Authors, References, Views, and Readers

(ii) (iii) (iv)

For the knowledge management datasets, the following regression analyses were done: (i)

Response variable: References Explanatory variables: Title Characters, Authors, and Pages. Response variable: Authors Explanatory variables: Title Characters, References, and Pages. Response variable: Citations Explanatory variables: Title Characters, Authors, Ysince, Pages, and References.

(ii) (iii)

For the Canadian Universities dataset, the following regression analysis was done: (i)

Response variable: Inlinks Explanatory variables: Pages, Type (this is a categorical variable and wasrepresented in the regression analyses by dummy variables Type1 and Type2), SocialScience Grants, Science Grants, Research Grants, Student Size, and Faculty Size.

SPSS was used to perform linear regression analyses while programs written in SAS were used to perform lognormal and count model regression analyses. The backward selection method was used to obtain the best regression model. Only variables that were signiﬁcant at 5% were retained in the ﬁnal best regression models. 4. Results 4.1. Model estimation and goodness of ﬁt 4.1.1. Google patent dataset For the Google patent dataset, the response variable being modelled is the number of citations received by a patent (i.e. the number of other patents citing the patent). From Appendix A, it can be seen that the variance for this variable is much greater than its mean. This implies over-dispersion and the most appropriate count regression model to compare with the other two models is the negative binomial regression model. The results of the regression analyses showed that the number of years since the issue of a patent (Ysince) and the number of US patents cited were signiﬁcant in the linear and lognormal regression models while the number of other references along with those two variables were found signiﬁcant in the negative binomial regression model (see Table 1). In addition, all the goodness of ﬁt statistics for the negative binomial regression model were better than those for the linear regression model but slightly worse than those for the lognormal model. 4.1.2. Information research dataset For the information research dataset, the response variables being modelled are the number of views, number of readers, number of citations from Google Scholar, and number of citations from Scopus. From Appendix B, we could see that the variance of each of these response variables is much greater than its mean, i.e. over-dispersion. Hence, the most appropriate count regression model to compare with the other two models is the negative binomial regression model. However, in the case of the number of citations from Google Scholar, we used the zero-inﬂated negative binomial regression model. The reason being that 74 out of 144 cases (i.e. about 51.39%) had zero values for the number of citations from Google. Inﬂated zeros assumed because zero was recorded in certain cases where the paper was not found in Google Scholar and not because the number of citation was zero. It should also be noted that for many of these cases, Scopus had a non-zero number of citations; in fact, there were only 13 cases with zero number of citations from Scopus. We also tried the hurdle negative binomial regression model for this response variable. The regression analyses showed that: for the number of views as

506

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

Table 1 Parameter estimates and ﬁt statistics for response variable citations (Google patent dataset). Signiﬁcant parameters*

Linear regression model

Constant Ysince US Patents OtherRef Dispersion/scale

* ** ***

0.056 (0.182) 0.467 (0.033) 0.004 (0.001) 0.006 (0.003)

–0.181 (0.171) 0.443 (0.032) 0.006 (0.001)

24.900 (0.744)

0.991 (0.062)

0.997 (0.03)

5197.9 5215.2

AIC BIC

Lognormal model***

NBR model

−52.765 (4.277)** 13.495 (0.798) 0.098 (0.023)

4063.1 4084.8

4056.9 4074.2

Signiﬁcant parameters refer to the ones statistically signiﬁcant at 0.05 level. Standard error of the estimate in parenthesis. 1 was added to the response variable count prior to ﬁtting the lognormal model.

Table 2 Parameter estimates and ﬁt statistics for response variable number of views (information research dataset). Signiﬁcant parameters

Linear regression model*

Constant Ysince References

−12521.46 (7048.817) 3494.583 (772.031)

Lognormal model

6.653 (0.227) 0.285 (0.021) 0.008 (0.003)

31291.0 (1842.2)

Dispersion/scale

7.153 (0.178) 0.222 (0.019)

0.619 (0.067)

3395.9 3404.8

AIC BIC *

NBR model

0.796 (0.047)

2978.1 2990.0

2950.4 2959.3

13 negative predicted values for the response variable.

Table 3 Parameter estimates and ﬁt statistics for response variable number of readers (information research dataset). Linear regression model*

Signiﬁcant parameters Constant Ysince Views

13.666 (3.882) −1.321 (0.45) 5.476E−5 (4.57E−5)

* **

2.414 (0.248) −0.052 (0.029) 1.674E−5 (4.12E−6)

17.175 (1.012)

Dispersion/scale

1.121 (0.147)

1235.6 1247.4

AIC BIC

Lognormal model**

NBR model

2.222 (0.232) −0.067 (0.027) 1.38E−5 (2.736E−6) 1.028 (0.061)

960.7 972.6

968.3 980.2

6 negative predicted values for the response variable. 1 was added to the response variable count prior to ﬁtting the lognormal model.

Table 4 Parameter estimates and ﬁt statistics for response variable number of citations from Google scholar (information research dataset). Signiﬁcant parameters Constant Views Readers Ysince Dispersion/scale Omega AIC BIC * **

Linear regression model*

Zero-inﬂated NBR model

Hurdle NBR model

Lognormal model**

−17.603 (2.959) 9.81E−4 (1.07E−4) 2.067 (0.147)

1.707 (0.315) 6.53E−6 (3.75E−6) 0.019 (0.006) 0.137 (0.04)

1.392 (0.298)

1.078 (0.142)

0.023 (0.006) 0.187 (0.035)

0.030 (0.005)

0.729 (0.145)

0.796 (0.162)

1.537 (0.091)

0.503 (0.043) 800.0 817.8

0.486 (0.006) 800.8 815.7

31.210 (1.839) 1407.6 1419.5

951.9 960.8

55 negative predicted values for the response variable. 1 was added to the response variable count prior to ﬁtting the lognormal model.

response variable, Ysince was the only signiﬁcant variable in the linear and lognormal regression models while the negative binomial regression model had Ysince and the number of references; for the number of readers, the three models had Ysince and number of views as signiﬁcant variables; for the number of citations from Google Scholar, all the models had the number of readers as a signiﬁcant variable, linear regression model had the number of views as an additional signiﬁcant variable, hurdle negative binomial regression model had Ysince as an additional signiﬁcant variable while the zero-inﬂated negative binomial regression model had both the number of views and Ysince as additional signiﬁcant variables; and for the number of citations from Scopus, the linear model had Ysince, number of views and number of readers as signiﬁcant variables while both the negative binomial and lognormal regression models had only Ysince and number of readers as signiﬁcant variables (see Tables 2–5). For all the four response variables, some of the predicted values from the linear regression models were

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

507

Table 5 Parameter estimates and ﬁt statistics for response variable number of citations from Scopus (information research dataset). Linear regression model*

Signiﬁcant parameters

−4.864 (2.312) 0.718 (0.265) 0.691 (0.048) 1.87E−4 (3.69E−5)

Constant Ysince Readers Views

9.813 (0.578)

Dispersion/scale

1076.4 1091.2

AIC BIC * **

NBR model

Lognormal model**

0.961 (0.237) 0.107 (0.025) 0.028 (0.006)

1.036 (0.209) 0.078 (0.023) 0.021 (0.003)

0.829 (0.113)

0.929 (0.055)

944.6 956.4

955.9 967.8

7 negative predicted values for the response variable. 1 was added to the response variable count prior to ﬁtting the lognormal model.

Table 6 Parameter estimates and ﬁt statistics for response variable number of references (knowledge management dataset). Signiﬁcant parameters

Linear regression model

Constant Title Characters Pages

–0.697 (1.891) 0.074 (0.019) 1.949 (0.085)

0.753 (0.033)

11801.0 11821.0

AIC BIC *

1.905 (0.097) 0.004 (0.001) 0.086 (0.005)

19.909 (0.385)

Dispersion/scale

NBR model

11431.0 11452.0

Lognormal model* 1.315 (0.099) 0.004 (0.001) 0.102 (0.004) 1.048 (0.02) 11751.0 11771.0

1 was added to the response variable count prior to ﬁtting the lognormal model.

Table 7 Parameter estimates and ﬁt statistics for response variable number of authors (knowledge management dataset). Signiﬁcant parameters Constant Title Characters References Dispersion/scale AIC BIC

Linear regression model 2.299 (0.134) 0.005 (0.001) −0.004 (0.002) 1.519 (0.029) 4919.2 4940.0

Zero-truncated GPR model 0.660 (0.072) 0.002 (7.7E−4) −0.002 (9.62E−4) 0.043 (0.012) 4411.9 4432.7

Zero-truncated NBR model 0.661 (0.072) 0.002 (7.66E−4) −0.002 (9.6E−4) 0.086 (0.025) 4412.8 4433.6

Lognormal model 0.596 (0.047) 0.002 (5.3E−4) 0.55 (0.011) 4315.8 4331.4

negative, which show that the linear models may not be appropriate for these count response variables. Furthermore, AIC and BIC values showed that the negative binomial regression model performed better than the lognormal regression model in all cases except for the number of views as the response variable. In the case of the number of citations from Google Scholar, there was no signiﬁcant difference in the performance of the zero-inﬂated and the hurdle negative binomial regression models. What is most striking about the results for the two models is the remarkable similarity of the coefﬁcient estimates and their standard errors. 4.1.3. Knowledge management dataset For the knowledge management dataset, the response variables being modelled are the number of references, the number of authors, and the number of citations. While the number of authors is not usually considered as a response variable in most informetric studies, it was included to showcase the use of zero-truncated count regression model as the minimum number of authors is 1; in the case of an anonymous author, the number of authors is assumed to be one. From Appendix C, the variance for the number of references or the number of citations is greater than the mean, hence the negative binomial regression model is used in comparison with the other two models in both cases. However, for the number of authors, the variance is slightly lower than the mean, i.e. there is a small under-dispersion. Hence, the zero-truncated generalized Poisson regression is the most appropriate count regression model but we used it along with the zero-truncated negative binomial regression, linear regression and lognormal regression to model the number of authors. For the number of references as response variable, all the three models included number of title characters and number of pages as signiﬁcant variables (see Table 6). For the number of authors as response variable, three of the models considered (i.e. linear, zero-truncated generalized Poisson and zero-truncated negative binomial) included the number of title characters and number of references as signiﬁcant variables while the lognormal regression model included only the number of title characters (see Table 7). Also, in the case of the number of citations, all the three regression models included Ysince and number of references as signiﬁcant variables (see Table 8). The comparison of goodness of ﬁt statistics revealed that for all the three response variables, the linear model performed worse than the other regression models while the count regression model outperformed the lognormal in two cases with the lognormal being superior in the remaining case. In addition, in the case of number of citations as response

508

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

Table 8 Parameter estimates and ﬁt statistics for response variable number of citations (knowledge management dataset). Signiﬁcant parameters Constant Ysince References Dispersion/scale AIC BIC * **

Linear regression model* −17.14 (1.936) 2.203 (0.188) 0.347 (0.023) 19.164 (0.371) 11699.0 11720.0

NBR model −0.160 (0.141) 0.168 (0.013) 0.031 (0.002)

Lognormal model** −0.056 (0.113) 0.106 (0.011) 0.028 (0.001)

1.811 (0.078) 8419.0 8439.8

1.119 (0.022) 8525.5 8546.3

115 negative predicted values for the response variable. 1 was added to the response variable count prior to ﬁtting the lognormal model.

Table 9 Parameter estimates and ﬁt statistics for response variable number of inlinks (Canadian universities dataset). Signiﬁcant parameters Constant Research Grants Student Size Science Grants SocialScience Grants Type2

Linear regression model 73.03 (56.634) 0.002 (7.94E−4) 0.013 (0.003) 0.005 (0.002)

NBR model 6.018 (0.135) 2.32E−6 (6.79E−7) 1.12E−5 (3.56E−6) 3.03E−5 (1.39E−6) −0.619 (0.122)

Dispersion/Scale

224.33 (23.91)

AIC BIC

611.2 620.1

0.064 (0.014) 586.6 597.3

Lognormal model 6.135 (0.125) 2.90E−6 (6.46E−7) 1.13E−5 (3.80E−6) 0 −0.682 (0.125) 0.28 (0.03) 591.0 599.9

variable, the linear model predicted some negative values. Also, in the case of the number of authors as the response variable, the performance of the zero-truncated NBR model was comparable to that of the zero-truncated GPR model. 4.1.4. Canadian universities dataset For the Canadian universities dataset, the only response variable being modelled is the number of inlinks to the university website. From Appendix D, it can be seen that the variance for this variable is much greater than its mean. So, again we had to use the negative binomial regression model for comparison with the other two models. The results of the regression analyses show that the amount of Science grants, the total amount of research grants and the student size were signiﬁcant in the linear regression model, for the lognormal model, the signiﬁcant variables were the total amount of research grants, the student size and university Type2 while university Type2, amount of Social Science grants, total amount of research grants and student size were found signiﬁcant in the negative binomial regression model (see Table 9). For the goodness of ﬁt statistics comparison, the negative binomial model was the best, followed by the lognormal model, and then the linear model. 4.2. Model validation To further compare the performances of the negative binomial and lognormal regression models, we employed the model validation technique of using a large part of a dataset to estimate the parameters of a model and then using the remaining part of the dataset for model evaluation. The mean square error (MSE) was then used to determine the closeness of the predicted values to the observed values. Given that only two datasets are large enough to be partitioned into two, we used the Google patent dataset for one response variable and knowledge management dataset for three response variables. The sample size for the Google patent dataset is 560 and 400 randomly selected data points were used for parameter estimation while the remaining 160 data points were used for model evaluation. The model evaluation results showed that the negative binomial regression model has a smaller mean square error (see Table 10). For the knowledge management dataset, 1000 randomly selected data points were used for parameter estimation and the remaining 337 data points used for model evaluation. In the case of two of the response variables, the MSE for the negative binomial regression model was smaller than that of the lognormal regression model while for the third response variable, the MSE was approximately the same for the two models (see Tables 11–13). 5. Discussion The response variables used in this study represent the major response variables normally investigated in the main subﬁelds of informetrics and almost all of them were found to be over-dispersed. This implies that the negative binomial regression model often seems to be more appropriate for informetric datasets than the Poisson and generalized Poisson regression models. In fact, for all the datasets used in this study, the Poisson regression model performed worse than the

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

509

Table 10 Model validation results for response variable citations from Google patent dataset. Signiﬁcant parameters from training data (n = 400) Constant Ysince US Patents OtherRef Test data ﬁt statistics (n = 160) Raw mean Predicted mean MSE Log-likelihood statistic

NBR model −0.122 (0.209) 0.483 (0.038) 0.004 (0.001) 0.007 (0.003) 20.125 15.921 818.741 −621.42

Lognormal model −0.394 (0.196) 0.472 (0.037) 0.006 (0.001)

20.125 18.484 845.172 −618.20

Table 11 Model validation results for response variable number of references from knowledge management dataset. Signiﬁcant parameters from training data (n = 1000) Constant Title Characters Pages Test data (n = 337) Raw mean Predicted mean MSE Log-likelihood statistic

NBR model 1.914 (0.113) 0.004 (0.001) 0.085 (0.006) 29.982 32.229 724.174 −1438.42

Lognormal model 1.369 (0.118) 0.004 (0.001) 0.100 (0.005) 29.982 43.594 2248.73 −1474.52

Table 12 Model validation results for response variable number of authors from knowledge management dataset. Signiﬁcant parameters from training data (n = 1000) Constant Title Characters References Test data ﬁt statistics (n = 337) Raw mean Predicted mean MSE Log-likelihood statistic

Zero-truncated NBR model 0.667 (0.086) 0.002 (9.23E−4) −0.003 (0.001) 2.602 2.554 2.270 −559.26

Lognormal model 0.577 (0.056) 0.002 (6.28E−4)

2.602 2.554 2.264 −547.87

Table 13 Model validation results for response variable number of citations from knowledge management dataset. Signiﬁcant parameters from training data (n = 1000) Constant Ysince References Test data ﬁt statistics (n = 337) Raw mean Predicted mean MSE Log-likelihood statistic

NBR model −0.173 (0.161) 0.169 (0.015) 0.031 (0.002) 11.139 12.569 298.311 −1068.42

Lognormal model −0.050 (0.13) 0.103 (0.013) 0.029 (0.002) 11.139 13.036 314.017 −1350.21

negative binomial regression model (for example, see Table 14). Also, all the standard errors of the parameter estimates from the Poisson model were underestimated because the model cannot account for the overdispersion in the response variable. It is equally noteworthy that almost all informetric studies (Chen, 2012; Didegah & Thelwall, 2013; McDonald, 2007; Thelwall & Maﬂahi, 2015; Walters, 2006; Yoshikane, 2013) that have applied a count regression model made use of one form or another of the negative binomial regression model. Of all the nine response variables modelled from the four datasets, the linear regression model predicted some negative values for ﬁve of the response variables. These are the number of readers, number of views, number of citations from Google scholar and number of citations from Scopus, which are all from the information research dataset, as well as the number of citations response variable from the knowledge management dataset. It is interesting to note that the linear regression model predicted some negative values for all the three response variables pertaining to the number of citations received by a scholarly paper; the number of citations variable in the Google patent dataset refers to the number of citations received by a patent. The number of citations received by a scholarly paper is one of the most widely modelled variables in informetric studies, and if a study would involve predicting the number of citations received by a paper, then it is not advisable to

510

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

Table 14 Comparison of the performance of Poisson regression and negative binomial regression models for the Google patent dataset. Signiﬁcant parameters Constant Ysince US Patents OtherRef Inventor

Poisson regression model

NBR model

−0.141 (0.04) 0.472 (0.005) 0.004 (1.68E−4) 0.004 (5.39E−4) 0.071 (0.005)

0.056 (0.182) 0.467 (0.033) 0.004 (0.001) 0.006 (0.003)

Dispersion parameter AIC BIC

0.991 (0.062) 10382.0 10403.0

4063.1 4084.8

Table 15 Results of the Vuong closeness test between negative binomial regression model and lognormal regression model. Response variable

Vuong statistic

Signiﬁcantly better model

Google patent dataset: number of citations Information research dataset: number of views Information research dataset: number of readers Information research dataset: number of citations from Google scholar Information research dataset: number of citations from Scopus Knowledge management dataset: number of references Knowledge management dataset: number of authors Knowledge management dataset: number of citations Canadian universities dataset: number of inlinks

−0.307 −2.59 1.344 7.845 2.033 11.82 −5.459 4.125 1.548

No difference Lognormal No difference Negative binomial Negative binomial Negative binomial Lognormal Negative binomial No difference

Table 16 Number of signiﬁcant explanatory variables included in the models. Response variable

Linear model

NBR model

Lognormal model

Google patent dataset: number of citations Information research dataset: number of views Information research dataset: number of readers Information research dataset: number of citations from Google scholar Information research dataset: number of citations from Scopus Knowledge management dataset: number of references Knowledge management dataset: number of authors Knowledge management dataset: number of citations Canadian universities dataset: number of inlinks

2 1 2 2 3 2 2 2 3

3 2 2 3 2 2 2 2 4

2 1 2 1 2 2 1 2 3

make use of linear regression model as it might predict negative citation values. In addition, the linear regression model performed worse than both negative binomial and lognormal regression models for all response variables when it comes to the goodness of ﬁt. The negative binomial regression model outperformed the lognormal regression model in six of the cases while the lognormal regression model was better in the other three cases. Vuong closeness test showed that the difference in performance of the two models to be insigniﬁcant in three cases, negative binomial regression model to be signiﬁcantly better in four cases, and the lognormal regression model to be signiﬁcantly better in two cases (see Table 15). Model validation analysis for the two large datasets also showed the negative binomial regression model to be slightly better than the lognormal regression model In the study by Thelwall and Wilson (2014) which compared the abilities of negative binomial, lognormal, and linear regression models to detect factors affecting citation scores, the conclusion based on simulated data was that the negative binomial regression had the tendency to identify non-existent factors. However, from our own study with real data, this fear is not justiﬁable as the difference in number of signiﬁcant explanatory variables detected by the negative binomial regression and the other two models for each response variable is minimal (see Table 16). For three of the response variables, all the three models detected the same number of explanatory variables, for three other response variables, the negative binomial detected one more signiﬁcant explanatory variable than the other two models, for one response variable, the negative binomial detected one more signiﬁcant variable than the linear model, which in turn detected one more signiﬁcant variable than the lognormal model, for another response variable, the lognormal detected one less signiﬁcant variable than the other two models while for one response variable, it was the linear model that detected one more signiﬁcant explanatory variable than the other two models. Also, we noted that for all the response variables, the estimates of each signiﬁcant explanatory variable that is common to the models have the same sign. Thus, there is an agreement among the models on the type of relationship the explanatory variable (or predictor) has with the response variable. Finally, the three major statistical packages, SPSS, SAS and R, can be used to perform negative binomial regression analysis, though the latter two might need some programming while SPSS might not be able to perform the analysis for the zerotruncated or zero-inﬂated version of the model without programming. However, informetric researchers should not shy

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

511

away from applying negative binomial regression model whenever appropriate because of the extra efforts needed for the analysis. 6. Conclusions Datasets from four main subﬁelds of informetrics were collected in order to compare the performance of linear regression, lognormal regression and count regression models in modelling popular count informetric response variables. These include patent data collected from the United States Patent and Trade Ofﬁce database, altmetrics data collected from the open access journal Information Research, general informetrics data collected from Web of Science, and webometrics data on Canadian universities. Based on the ﬁndings of this study and those of a few other informetric studies that have applied count regression models, and the fact that most informetric response variables are over-dispersed, the negative binomial regression model in general seems to be the most appropriate of the three count regression models for modelling informetric response variables. The main reason why some other ﬁelds of research have rejected linear regression models in favour of count regression models in modelling count response variables manifested itself in our regression analyses as the linear regression model predicted some negative values for ﬁve of the nine response variables investigated. Also, the use of Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC) as measures of goodness of ﬁt showed that the linear regression model is worse than both the negative binomial and lognormal regression models for all the response variables. In addition, the negative binomial regression model performed signiﬁcantly better than the lognormal regression model for four of the response variables while the lognormal regression model performed signiﬁcantly better than the negative binomial regression model for two of the response variables but there was no signiﬁcant difference in the performance of the two models for the remaining three response variables. Hence, we recommend the adoption of the negative binomial regression model in modelling informetric count response variables, especially in studies that involve making predictions for the count response variable. Acknowledgements The authors would like to thank Mary Aderayo Bamimore for collecting the knowledge management dataset from the Web of Science database. The authors are grateful to the Department of Mathematics, CMU for the Collaborative Research Grant that partially supported the research work. The authors are also grateful to the Editor-in-Chief of the Journal of Informetrics and the two anonymous reviewers for their constructive comments that greatly improved the quality of the paper. Appendix A. Descriptive statistics for the Google patent dataset (n = 560) Variable

Minimum

Maximum

Mean

Standard deviation

Ysince Inventor US Patents Foreign Patents OtherRef Claims Citations

4 1 0 0 0 1 0

11 13 498 58 214 109 321

4.966 2.809 31 1.807 12.13 25.434 17.293

1.344 1.855 46.577 5.151 19.258 13.572 30.698

Appendix B. Descriptive statistics for the information research dataset (n = 144) Variable

Minimum

Maximum

Mean

Standard deviation

Ysince Authors References Views Readers Citation Google Citation Scopus

3 1 7 857 0 0 0

14 10 180 278,437 255 842 232

8.472 1.979 40.208 17,085.424 11.826 23.597 12.583

3.415 1.265 25.877 33,609.503 24.437 83.113 24.318

Appendix C. Descriptive statistics for the knowledge management dataset (n = 1337) Variable

Minimum

Maximum

Mean

Standard deviation

Title Characters Authors Ysince Pages References Citations

20 1 4 1 0 0

249 16 13 43 167 237

84.18 2.57 8.312 12.34 29.55 11.44

28.412 1.526 2.897 6.469 23.822 21.169

512

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

Appendix D. Descriptive statistics for the Canadian universities dataset (n = 44) Variable

Minimum

Pages Social science grants ($) Science Grants ($) Total research dollars Student size Faculty size Inlinks

15,700 693 3179 12,740 2420 113 137

Maximum 15,300,000 15,910 186,236 457,063 85,900 6500 2470

Mean

Standard deviation

1,833,261.364 6525.818 52,751.091 115,521.977 21,929.932 1173.58 846.818

2,784,564.849 4010.335 40,900.445 98,833.527 18,398.923 1371.198 633.756

References Achcar, J. A., Martinez, E. Z., Pires de Souza, A. D., Tachibana, V. M., & Flores, E. F. (2011). Use of Poisson spatiotemporal regression models for the Brazilian Amazon forest: Malaria count data. Revista Da Sociedade Brasileira De Medicina Tropical, 44(60), 749–754. Ajiferuke, I. (2005). Inter-university collaboration in Canada. Canadian Journal of Information and Library Sciences, 29(4), 407–418. Association of Universities and Colleges of Canada. (2014). Directory of Canadian Universities (48th ed.). Retrieved from http://www.aucc.ca/media-room/ publications/directory-of-canadian-universities/ Ayres, I., & Vars, F. E. (2000). Determinants of citations to articles in elite law reviews. Journal of Legal Studies, 29(1), 427–450. Baccini, A., Barabesi, L., Cioni, M., & Pisani, C. (2014). Crossing the hurdle: The determinants of individual scientiﬁc performance. Scientometrics, 101(3), 2035–2062. Bornmann, L., & Daniel, H. D. (2007). Multiple publication on a single research study: Does it pay? The inﬂuence of number of research articles on total citation counts in biomedicine. Journal of the American Society for Information Science and Technology, 58(8), 1100–1107. Bornmann, L., Schier, H., Marx, W., & Daniel, H. D. (2012). What factors determine citation counts of publications in chemistry besides their quality? Journal of Informetrics, 6(1), 11–18. Buter, R. K., & van Raan, A. F. J. (2011). Non-alphanumeric characters in titles of scientiﬁc publications: An analysis of their occurrence and correlation with citation impact. Journal of Informetrics, 5(4), 608–617. Chen, C. (2012). Predictive effects of structural variation on citation counts. Journal of the American Society for Information Science and Technology, 63(3), 431–449. Czado, C., Schabenberger, H., & Erhardt, V. (2014). Non nested model selection for spatial count regression models with application to health insurance. Statistical Papers, 55(2), 455–476. Didegah, F., & Thelwall, M. (2013). Determinants of research citation impact in nanoscience and nanotechnology. Journal of the American Society for Information Science and Technology, 64(5), 1055–1064. Famoye, F. (1993). Restricted generalized Poisson regression model. Communications in Statistics – Theory and Methods, 22(5), 1335–1354. Gagnon, D. R., Doron-LaMarca, S., Bell, M., O’Farrell, T. J., & Taft, C. T. (2008). Poisson regression for modelling count and frequency outcomes in trauma research. Journal of Traumatic Stress, 21(5), 448–454. Gargouri, Y., Hajjem, C., Lariviere, V., Gingras, Y., Carr, L., Brody, T., et al. (2010). Self-selected or mandated, open access increases citation impact for higher quality research. PLoS ONE, 5(10), e13636. Guerrero-Bote, V. P., & Moya-Anegón, F. (2014). Relationship between downloads and citations at journal and paper levels, and the inﬂuence of language. Scientometrics, 101(2), 1043–1106. Gulkema, S. D., & Goffelt, J. P. (2008). A ﬂexible count data regression model for risk analysis. Risk Analysis, 28(1), 213–223. Habibzadeh, F., & Yadollahie, M. (2010). Are shorter article titles more attractive for citations? Crosssectional study of 22 scientiﬁc journals. Croatian Medical Journal, 51(2), 165–170. Honey, K. (Ed.). (2014). The Maclean’s guide to Canadian universities. Toronto: Rogers Publishing. Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and citations. Scientometrics, 88(2), 653–661. Karazsia, B. T., & van Dulmen, M. H. M. (2008). Regression models for count data: Illustrations using longitudinal predictors of childhood injury. Journal of Pediatric Psychology, 33(10), 1076–1084. Kim, M. J. (1998). A comparative study of citations from papers by Korean scientists and their journal attributes. Journal of Information Science, 24(2), 113–121. King, G. (1989). Event count models for international relations: Generalizations and applications. International Studies Quarterly, 33(2), 123–147. Kreider, J. (1999). The correlation of local citation data with citation data from Journal Citation Reports. Library Resources & Technical Services, 43(2), 67–77. Landes, W. M., & Posner, R. A. (2000). Citations, age, fame, and the Web. (2000). Journal of Legal Studies, 29(1), 319–344. Lee, Y., Lee, J., Song, Y., & Lee, S. (2007). An in-depth empirical analysis of patent citation counts using zero-inﬂated count data model: The case of KIST. Scientometrics, 70(1), 27–39. Liu, X. L., Fang, H. L., & Wang, M. Y. (2011). Correlation between download and citation and download-citation deviation phenomenon for some papers in Chinese medical journals. Serials Review, 37(3), 157–161. Lokker, C., McKibbon, K. A., McKinlay, R. J., Wilczynski, N. L., & Haynes, R. B. (2008). Prediction of citation counts for clinical articles at two years using data available within three weeks of publication: Retrospective cohort study. British Medical Journal, 336(7645), 655–657. McDonald, J. D. (2007). Understanding journal usage: A statistical analysis of citation and use. Journal of the American Society for Information Science and Technology, 58(1), 39–50. Meng, S. W. (2009). Over-dispersed claim counts regression models and their applications in auto insurance. Recent advance in statistics application and related areas (Vols. I and II). Moed, H. F. (2005). Statistical relationships between downloads and citations at the level of individual documents within a single journal. Journal of the American Society for Information Science and Technology, 56(10), 1088–1097. Mullahy, J. (1986). Speciﬁcation and testing of some modiﬁed count data models. Journal of Econometrics, 33(3), 341–365. Nassiri, H., Najaf, P., & Amiri, A. M. (2014). Prediction of roadway accident frequencies: Count regressions versus machine learning models. Scientia Iranica, 21(2), 263–275. Nieder, C., Dalhaug, A., & Aandahl, G. (2013). Correlation between article download and citation ﬁgures for highly accessed articles from ﬁve open access oncology journals. SpringerPlus, 2, 261. Peters, H. P. F., & van Raan, A. F. J. (1994). On determinants of citation scores – A case study in chemical engineering. Journal of the American Society for Information Science, 45(1), 39–49. Schlögl, C., & Gorraiz, J. (2010). Comparison of citation and usage indicators: The case of oncology journals. Scientometrics, 82(3), 567–580. Schlögl, C., Gorraiz, J., Gumpenberger, C., Jack, K., & Kraker, P. (2014). Comparison of downloads, citations and readership data for two information systems journals. Scientometrics, 101(2), 1113–1128. Thelwall, M., & Wilson, P. (2014). Regression for citation data: An evaluation of different methods. Journal of Informetrics, 8(4), 963–971.

I. Ajiferuke, F. Famoye / Journal of Informetrics 9 (2015) 499–513

513

Thelwall, M., & Maﬂahi, N. (2015). Are scholarly articles disproportionately read in their own country? An analysis of Mendeley readers. Journal of the Association for Information Science and Technology, 66(6), 1124–1135. Tsay, M. Y. (1998). The relationship between journal use in a medical library and citation use. Bulletin of the Medical Library Association, 86(1), 31–39. Vaughan, L., & Thelwall, M. (2005). A modeling approach to uncover hyperlink patterns: The case of Canadian universities. Information Processing and Management, 41(2), 347–359. Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econmetrica, 57(2), 307–333. Wang, E., Li, Z. Z., Little, B. B., & Yang, Y. (2009). The economic impact of tourism in Xinghai Park, China: A travel cost value analysis using count data regression models. Tourism Economics, 15(2), 413–425. Wang, W., & Famoye, F. (1997). Modeling household fertility decisions with generalized Poisson regression. Journal of Population Economics, 10(3), 273–283. Walters, G. D. (2006). Predicting subsequent citations to articles published in twelve crime-psychology journals: Author impact versus journal impact. Scientometrics, 69(3), 499–510. Willis, D. L., Bahler, C. D., Neuberger, M. M., & Dahm, P. (2011). Predictors of citations in the urological literature. BJU International, 107(12), 1876–1880. Xia, J. F., Myers, R. L., & Wilhoite, S. K. (2011). Multiple open access availability and citation impact. Journal of Information Science, 37(1), 19–28. Xu, T., Li, W., & Chen, T. (2011). Application of zero-inﬂated models on regression analysis of count data: A study of sub-health symptoms. Zhonghua Liuxingbingxue Zazhi, 32(2), 187–191. Yoshikane, F. (2013). Multiple regression analysis of a patent’s citation frequency and quantitative characteristics: The case of Japanese patents. Scientometrics, 96(1), 365–379. Yoshikane, F., Suzuki, Y., Arakama, Y., Ikeuchi, A., & Tsuji, K. (2013). Multiple regression analysis between citation frequency of patents and their quantitative characteristics. Procedia Social and Behavioral Sciences, 73, 217–223. Yu, F., & Wu, Y. (2014). Patent citations and knowledge spillovers: An analysis of Chinese patents registered in the USA. Asian Journal of Technology Innovation, 22(1), 86–99. Yu, T., Yu, G., Li, P., & Wang, L. (2014). Citation impact prediction for scientiﬁc papers using stepwise regression analysis. Scientometrics, 101(2), 1233–1252. Yuan, S. B., & Hua, W. N. (2011). Scholarly impact measurements of LIS open access journals: Based on citations and links. Electronic Library, 29(5), 682–697.

Modelling count response variables in informetric studies: Comparison among count, linear, and lognormal regression models

Modelling count response variables in informetric studies: Comparison among count, linear, and lognormal regression models

Recommend Documents