On residuals in generalized Johnson SB regressions

On residuals in generalized Johnson SB regressions

Accepted Manuscript On residuals in generalized Johnson SB regressions Artur J. Lemonte, German ´ Moreno–Arenas PII: DOI: Reference: S0307-904X(18)3...

756KB Sizes 1 Downloads 18 Views

Accepted Manuscript

On residuals in generalized Johnson SB regressions Artur J. Lemonte, German ´ Moreno–Arenas PII: DOI: Reference:

S0307-904X(18)30506-7 https://doi.org/10.1016/j.apm.2018.10.015 APM 12502

To appear in:

Applied Mathematical Modelling

Received date: Revised date: Accepted date:

7 August 2018 7 October 2018 10 October 2018

Please cite this article as: Artur J. Lemonte, German ´ Moreno–Arenas, On residuals in generalized Johnson SB regressions, Applied Mathematical Modelling (2018), doi: https://doi.org/10.1016/j.apm.2018.10.015

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • We have proposed a new residual for the class of generalized Johnson S B regression models • The new residual is better than the usual residual considered in generalized Johnson S B regressions

AC

CE

PT

ED

M

AN US

CR IP T

• Monte Carlo simulation experiments and empirical applications using real and simulated data are provided

1

ACCEPTED MANUSCRIPT

On residuals in generalized Johnson SB regressions Artur J. Lemonte∗

Germ´an Moreno–Arenas

CR IP T

Departamento de Estat´ıstica, Universidade Federal do Rio Grande no Norte, Natal/RN, Brazil

AN US

Escuela de Matem´aticas, Universidad Industrial de Santander, Bucaramanga, Colombia

Abstract

M

An important challenge in the class of generalized Johnson SB regression models is to define residuals which are capable of identifying departures from the model assumptions, as well as to assess the overall goodness-of-fit of the model. On this regard, we propose a new residual for this class of models, and numerically evaluate its behaviour relative to the deviance residual initially proposed for this class of regression models. Monte Carlo simulation experiments and empirical applications using real and simulated data are provided. Overall, the results favour the residual we propose.

ED

Key-words: Deviance residual; Model checking; Proportions; Quantile residual; Rates.

PT

Math. Subject Classification: 62J05; 62J20.

Introduction

CE

1

AC

[1] developed a system of transformations of the normal distribution which received substantial popularity in the second half of the 20-th century. He used the translation method to generate statistical distributions that assume a wide variety of shapes by the transformation   X −γ −1 , (1) Y =h δ where X ∼ N(0, 1), h(·) is a monotone function (and h−1 (·) is its inverse), and γ ∈ IR and δ > 0. By considering h(z) = log(z/(1 − z)) in (1), we obtain a distribution with bounded support, whose ∗

Corresponding author: [email protected]

2

ACCEPTED MANUSCRIPT

δ g({δ[h(y) − h(ξ)]}2 ) , y(1 − y)

ED

f (y; ξ, δ) =

M

AN US

CR IP T

probability density function (PDF) takes the form f (y; γ, δ) = δ φ(γ + δ h(y))/[y(1 − y)], for y ∈ (0, 1), where φ(·) is the standard normal PDF. The cumulative distribution function (CDF) of Y is given by F (y; γ, δ) = Φ(γ + δ h(y)), for y ∈ (0, 1), where Φ(·) is the standard normal CDF. The above distribution has been applied in several fields like meteorology, medicine, biology, forestry and other sciences. The reader is referred to [1] and [2] for more details about this distribution. [3] introduced the generalized Johnson SB (‘GJS’ for short) distribution with the initial assumption given in (1) being relaxed by the following: X ∼ S(0, 1; g), where X ∼ S(0, 1; g) means that the random variable X follows the standardized form of the symmetric family of distributions for some function g(·), which is typically known as density generator. The reader is referred to [4] for discussions about symmetric distributions. Some distributions which belong to the symmetric family of distributions are: normal, Cauchy, Student-t, logistic, double exponential (Laplace) and power exponential (PE) distributions, among many others. The corresponding density generators are given, respectively, by: g(u) = (2π)−1/2 exp(−u/2); g(u) = [π(1 + u)]−1 ; g(u) = ν ν/2 B(1/2, ν/2)−1 (ν + u)−(ν+1)/2 , where ν > 0 and B(·, ·) is the beta function; g(u) = c e−u (1+e−u )−2 , where c ≈ 1.484300029 is the normalizing R∞ √ constant obtained from 0 u−1/2 g(u)du = 1; g(u) = exp(− u)/2; and g(u) = c(k) exp(−0.5u1/(1+k) ), where −1 < k ≤ 1, c(k)−1 = Γ(1 + (k + 1)/2)21+(1+k)/2 , and Γ(·) is the gamma function. The main motivation for replacing X ∼ N(0, 1) with X ∼ S(0, 1; g) is based on the search for distributions with bounded support with greater or lesser kurtosis than that of the Johnson SB distribution, thus achieving a Johnson SB distribution that is more platykurtic or leptokurtic, among many other interesting properties. The PDF of the GJS distribution (parameterized in a convenient way) can be expressed in the form y ∈ (0, 1),

(2)

CE

PT

where 0 < ξ < 1 is the median (i.e., the location parameter), and δ > 0 is the dispersion parameter. [3] pointed out that the PDF in (2) can display quite different shapes depending on the values of the two parameters ξ and δ. In particular, it can be symmetric (when ξ = 1/2) or asymmetric (when ξ 6= 1/2), bimodal, among many other shapes. The CDF of the GJS distribution takes the form Z δ[h(y)−h(ξ)] F (y; ξ, δ) = g(u2 )du, y ∈ (0, 1). (3) −∞

AC

The reader is referred to [3] for further details about the GJS distribution. Hereafter, if Y follows the GJS distribution, then the notation used is Y ∼ GJS(ξ, δ; g). On the basis of the GJS distribution, [3] proposed the GJS regression model for response variables which are restricted to the interval (0, 1). In this class of regression models, the median response is related to a linear predictor through a link function and the linear predictor involves covariates and unknown regression parameters, thus allowing for parameter interpretation in terms of the response in the original scale and also allowing the investigator to choose a link function. Additionally, the GJS regression model 3

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

is also indexed by a dispersion parameter, which is not assumed to be constant across observations but it is allowed to vary, leading to the variable dispersion regression model. [3] also provided a real data application in which is showed the usefulness of the GJS regression model in practice. In particular, they verify that the GJS regression model can be an interesting alternative to the well-developed beta regression model [5, 6], mainly due to the robustness with respect to outlying observations that the GJS regression model can assume. It is worth emphasizing that the residual analysis is an efficient way to detect various deficiencies with respect to the fitted model, particularly in regression models, such as the presence of outlying observations, the absence of components in the systematic part of the model, and departures from the error and variance assumptions. However, finding appropriate residuals in non-normal regression models has been an important topic of research. In general, the preference would be for residuals that are well-defined, easily computable and whose empirical distribution is close to normality (a unified reference distribution that facilitates the diagnosis and goodness-of-fit checking of regression models). Obviously, the standard normal reference distribution for residuals is not always mandatory. For example, the probability-scale residual [see, for example, 7] has no standard normal distribution and its range of possible values is symmetric about zero in (−1, 1). Additionally, if the sample size is sufficiently large, the probability-scale residual under the true model is approximately uniformly distributed on (−1, 1) and hence a quantilequantile (QQ) plot of the empirical quantiles of the probability-scale residual versus the theoretical quantiles of the standard uniform distribution U(−1, 1) can be used to detect lack of fit [7]. On the other hand, it is worth emphasizing that the search for residuals normally distributed is a continuous topic of research. This paper goes in this direction considering a new residual whose empirical distribution is very well approximated by the standard normal distribution in the class of GJS regression models. Additionally, the lack of residual studies for the GJS regression model reinforces the need for the studies developed in this paper. Maximum likelihood inference in the class of GJS regression models was provided by [3], and they also provided some guidelines for diagnostic analysis, including the use of a residual (i.e., the deviance component residual). They also provided a very small Monte Carlo simulation experiment to study the empirical distribution of the deviance residual in the class of GJS regression models. The chief goal of this paper is to propose a new residual for the class of GJS regression models, and evaluate its merit relative to that proposed by [3]. Additionally, we shall provide extensive Monte Carlo simulation experiments to verify the empirical distribution of the proposed residual, as well as the empirical distribution of the deviance residual. In particular, from the Monte Carlo experiments, we found out that the empirical distribution of the deviance residual does not follow a standard normal distribution for some GJS regression models, especially for the GJS-PE and GJS-Laplace regression models (i.e., when the PE and Laplace density generators are taken into account, respectively), whereas the empirical distribution of the residual we propose is very well approximated by the standard normal distribution in all cases. Overall, the nu4

ACCEPTED MANUSCRIPT

The GJS regression model

AN US

2

CR IP T

merical results favour the residual we propose, and the new residual allows practitioners to identify model misspecification, as well as to assess the overall goodness-of-fit of the model effectively. Additionally, the deviance residual fails to provide useful information for detecting misspecification in the response distribution due to the lack of a unified reference distribution under the true model in some cases, while the quantile residual works very well in detecting misspecification in the response distribution. Finally, we present real and simulated data applications. The paper unfolds as follows. The class of GJS regression models as well as maximum likelihood estimation of model parameters are presented in Section 2. Section 3 presents the associated residuals, including our proposal. Monte Carlo simulation results can be found in Section 4. In Section 5, we present the results of two applications, one based on observed (real) data and the other based on simulated data. The paper ends up with some concluding remarks in Section 6.

Let Yi ∼ GJS(ξi , δi ; g), where i = 1, . . . , n. Also, consider the following functional relations for the median and dispersion parameters of Yi d1 (ξi ) = η1i = x> i β,

d2 (δi ) = η2i = s> i τ,

AC

CE

PT

ED

M

where d1 (·) and d2 (·) are link functions, β = (β1 , . . . , βp )> , τ = (τ1 , . . . , τq )> , η1i and η2i are the linear > predictors, and x> i = (xi1 , . . . , xip ) and si = (si1 , . . . , siq ) are p and q known independent variables. The median link d1 : (0, 1) → IR and the dispersion link d2 : (0, ∞) → IR are assumed to be strictly monotonic and twice differentiable. For example, a useful link for the median is the logit (d1 (ξ) = log(ξ/(1 − ξ))), whereas for the dispersion is the logarithmic (d2 (δ) = log(δ)). Let X = [x1 , . . . , xn ]> and S = [s1 , . . . , sn ]> be the model matrices with rank p and q, respectively. The covariates in S are commonly, but not necessary, a subset of the covariates in X. The log-likelihood function, except for a constant term, is given by `(β, τ ) =

n X

`i (ξi , δi ),

i=1

−1 where `i (ξi , δi ) = log(δi ) + log[g({δi [h(yi ) − h(ξi )]}2 )], with ξi = d−1 1 (η1i ) and δi = d2 (η2i ). The score function is given by the (p + q)-vector U (β, τ ) = (Uβ (β, τ )> , Uτ (β, τ )> )> , with

Uβ (β, τ ) = X > ΛW T1 ξ ∗ ,

Uτ (β, τ ) = S > T2 δ ∗ ,

where T1 = diag{1/d˙1 (ξi )}, T2 = diag{1/d˙2 (δi )}, W = diag{wi Wg (wi2 )}, Λ = diag{δi }, ξ ∗ = (ξ1∗ , . . . , ξn∗ )> , δ ∗ = (δ1∗ , . . . , δn∗ )> , ξi∗ = −2/[ξi (1−ξi )], δi∗ = δi−1 +2 δi−1 wi2 Wg (wi2 ), and wi = δi [h(yi )− 5

ACCEPTED MANUSCRIPT

Table 1: The function Wg (u) for some distributions. Normal Student-t Logistic Laplace Power exponential 1 − 2

1 − e−u − 1 + e−u

ν+1 − 2(ν + u)

1 − √ 2 u

k

u− 1+k − 2(1 + k)

3

GJS regression residuals

AN US

CR IP T

h(ξi )]. Here, d˙j (z) = ddj (z)/dz for j = 1, 2, and diag{ai } stands for a diagonal matrix with typical element ai (i = 1, . . . , n). Table 1 lists Wg (u) for some distributions. The maximum likelihood (ML) estimators βb = (βb1 , . . . , βbp )> and τb = (b τ1 , . . . , τbq )> of β = (β1 , . . . , βp )> and τ = (τ1 , . . . , τq )> , b τb) = 0p and Uτ (β, b τb) = 0q , where 0l respectively, can be obtained by solving the equations Uβ (β, denotes a l-dimensional vector of zeros. We refer to [3] for more details about point and interval estimation of β and τ .

In this section we consider two residuals for the class of GJS regression models. The first residual is given by n h io1/2 rd = sign(yi − ξbi ) 2 `i (yi , δbi ) − `i (ξbi , δbi ) , i = 1, . . . , n,

M

i

CE

PT

ED

−1 > −1 > >b > b b) are the ML estimates of ξi = d−1 where ξbi = d−1 1 (xi β) and δi = d2 (si τ 1 (xi β) and δi = d2 (si τ ), respectively, and sign(·) denotes the signum function. Note that rid measures the discrepancy of a fit as twice the difference between the maximum log-likelihood achievable (saturated model) and that achieved by the model under investigation. [3] named rid as the i-th “deviance residual”, once the idea for defining it comes from the generalized linear models [see, for instance, 8]. Next, we propose the normalized quantile residual to check the adequacy of the GJS regression model fitted to the data. This residual is given by

riq = Φ−1 (F (yi ; ξbi , δbi )),

i = 1, . . . , n,

AC

>b where F (yi ; ξbi , δbi ) is the CDF of the GJS distribution (provided in equation (3)) available at ξbi = d−1 1 (xi β) and δbi = d−1 (s> τb), and Φ−1 (·) is the quantile function of a standard normal distribution. Note that the 2

i

quantile residual riq can be easily computed for the class of GJS regression models. For example, the quantile residual for the GJS-Normal regression model (i.e., when the normal density generator is taken into account) is simply given by riq = w bi , i = 1, . . . , n, 6

ACCEPTED MANUSCRIPT

where w bi = wi (yi ; ξbi , δbi ) = δbi [h(yi ) − h(ξbi )]. For the GJS-Student-t regression model it follows that     1 sign(w bi ) ν 1 q −1 ri = Φ + 1 − Im(wbi ) , , i = 1, . . . , n, 2 2 2 2

whereas for the GJS-PE regression model we have

1

k + 1 (w b2 ) k+1 , i 2 2

!!

,

i = 1, . . . , n,

AN US

riq = Φ−1

1 sign(w bi ) + γ 2 2Γ((k + 1)/2)

CR IP T

where ν > 0 is the degrees of freedom, m(z) = ν/(ν + z 2 ), Iz (a, b) = B(z; a, b)/B(a, b) is the regularized incomplete beta function, and B(z; a, b) is the incomplete beta function. Also, for the GJS-Logistic regression model it becomes   exp(w bi ) q −1 , i = 1, . . . , n, ri = Φ 1 + exp(w bi )

where −1 < k ≤ 1, and γ(s, x) is the (lower) incomplete gamma function. Finally,    bi )  1 sign(w q −1 −|w bi | + ri = Φ 1−e , i = 1, . . . , n, 2 2

Numerical results

AC

4

CE

PT

ED

M

for the GJS-Laplace regression model. The true normalized quantile residuals have a standard normal distribution if the model is correct [see, for example, 9]. However, it is very important to investigate its empirical distribution when the sample size is small. It is worth emphasizing that the quantile residual has never been considered in GJS regressions and, therefore, all numerical results regarding the residual riq in GJS regressions are new. Additionally, we would like to point out that the empirical distribution of the deviance residual rid was briefly studied in [3] by considering a very small Monte Carlo simulation experiment. In the next section, we shall provide a much more complete Monte Carlo simulation study under several scenarios in order to put some light about the empirical distribution of the deviance residual as well as the empirical distribution of the quantile residual.

In what follows, all simulations were performed using the object-oriented matrix language Ox [10], and all graphics were obtained using the statistical software R [11]. Our first Monte Carlo simulation experiment regarding the finite-sample behavior of rid and riq was carried out using   ξi log = η1i = β1 + β2 xi , i = 1, . . . , n, 1 − ξi 7

ACCEPTED MANUSCRIPT

log(δi ) = η2i = τ1 + τ2 si ,

i = 1, . . . , n,

AC

CE

PT

ED

M

AN US

CR IP T

where the covariate values were obtained as random draws of the standard unform distribution U(0, 1), and were held constant throughout the simulations. The number of Monte Carlo replications is 10,000, and the sample size is n = 25. Following [12], we also measure the intensity of nonconstant dispersion by λ = δmax /δmin , and report results for λ = 20 (τ1 = −0.5 and τ2 = 3.0) and λ = 90 (τ1 = −0.5 and τ2 = 4.5). Additionally, it is also interesting to consider scenarios where the response values are close to one, close to zero, and scattered on the standard unit interval. So, there are three different scenarios, namely: (i) β1 = −1.4 and β2 = −2.5, which resulted in median response values close to zero, ξ ∈ (0.02, 0.20); (ii) β1 = 1.2 and β2 = −2.1, which yielded ξ ∈ (0.29, 0.77); and (iii) β1 = 1.4 and β2 = 1.7, which resulted in median response values close to one, ξ ∈ (0.80, 0.96). We also consider the following density generators in the Monte Carlo simulations: normal, logistic, Student-t, PE and Laplace density generators. Finally, we also consider different values for ν and k in the GJS-Student-t and GJS-PE regression models, respectively. The mean, standard deviation (sd), skewness and kurtosis of each residual (per observation) when ξ ∈ (0.29, 0.77) and λ = 90 are provided in Section 1 of the Online Supplementary Material for the GJSNormal, GJS-Logistic, GJS-Student-t, GJS-PE and GJS-Laplace regression models. We assume ν = 4 for the GJS-Student-t, and k = −0.8 and 0.2 for the GJS-PE regression model. The numerical results show that overall there is good overall agreement between sample and population (standard normal) values for the normalized quantile residuals riq in all cases considered; that is, the residuals riq have approximately zero mean and unit standard deviation, have skewness close to zero, which indicates that these residuals are approximately symmetrical, and the kurtosis is near three. So, it is an indication that the empirical distribution of the normalized quantile residuals riq seems to follow a standard normal distribution. On the other hand, note that the sample standard deviation, skewness and kurtosis of the deviance residuals rid are far from the population (standard normal) values for the GJS-PE regression model (mainly when the shape parameter is not close to zero), and for the GJS-Laplace regression model; that is, it seems that the empirical distribution of the deviance residual rid does not follow a standard normal distribution, at least for these two GJS regression models. Finally, notice that for the GJS-Normal regression model the residuals rid and riq are exactly the same, as expected. The normal QQ-plots of the mean order statistics of rid and riq for the GJS-Normal, GJS-Logistic, GJS-Student-t, GJS-PE and GJS-Laplace regression models are presented in Section 1 of the Online Supplementary Material. Note that the empirical distribution of rid becomes far from the standard normal distribution for the GJS-PE regression model mainly when the shape parameter k is not close to zero (see Figures 6 and 9 in Section 1 of the Online Supplementary Material, for example), as well as for the GJSLaplace regression model (see Figure 10 in Section 1 of the Online Supplementary Material). However, the empirical distribution of the quantile residual riq proposed here seems to be well approximated by the standard normal distribution in all GJS regression models considered. Again, these numerical results 8

ACCEPTED MANUSCRIPT

CR IP T

confirm, at least for the GJS-PE and GJS-Laplace regression models, that the empirical distribution of the deviance residual is far from normality. As suggested by an anonymous referee, there are other interesting scenarios that can be considered to verify the efficacy of rid and riq in detecting many forms of model inadequacies. It is worth emphasizing that the ideas for the next Monte Carlo scenarios come from the paper by [13]. So, our second Monte Carlo simulation experiment consists of evaluating the performance of rid and riq for detecting non-linearity in the median covariate effect in the GJS-Laplace and GJS-Logistic regression models. We simulate a covariate x ∼ U(−1.5, 1.5) of size n = 1000. The response variable is simulated with   ξi = η1i = 1.2 − 5.0 x2i . log 1 − ξi

AN US

Then, we consider fitting a wrong model assuming   ξi log = η1i = 1.2 − 5.0 xi . 1 − ξi

AC

CE

PT

ED

M

For both situations, we assume log(δi ) = η2i = −0.5 + 2.3 si , where the covariate si was obtained as random draw of the standard unform distribution U(0, 1). The results are presented in Section 2 of the Online Supplementary Material. Figures 11 and 12 display scartter plots of the deviance and quantile residuals versus the covariate x for the true and wrong models by considering the GJS-Laplace and GJSLogistic regression models, respectively. Under the true model, note that the scatter plot of the deviance residual for the GJS-Laplace regression (Figure 11) exhibits a block pattern (i.e., with residual values below and above zero similar to two separate blocks), whereas the quantile residual substantiates that the true model fits the data very well with residuals randomly scattered around zero. Additionally, the quantile residual clearly indicates that the wrong model does not fit the data well and the curve pattern suggests the necessity to consider a quadratic term in x in the median linear predictor. The results for the GJS-Logistic regression (Figure 12) suggest that both residuals work satisfactorily. We also present the corresponding QQ-plots of the deviance and quantile residuals in Figures 13 and 14 to examine the normality of these residuals for the GJS-Laplace and GJS-Logistic regression models, respectively. Note that the deviance residuals do not follow a normal distribution regardless under the true or wrong models in the GJS-Laplace regression model (Figure 13), while the quantile residuals under the true model follow a standard normal distribution perfectly and under the wrong model, the residuals deviate from the diagonal line and hence are not normally distributed. For the GJS-Logistic regression model (Figure 14), both residuals follow a standard normal distribution under the true model, and are not normally distributed under the wrong model. We also replicate the above experiment by simulating 1000 datasets from the true model. For each replication, we fit the true and wrong models and compute both residuals. We then apply the AndersonDarling test [see, for example, 14] to test the normality of the residuals. The results are presented in 9

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Section 2 of the Online Supplementary Material. Figure 15 confirms that the deviance residual is not normally distributed in the GJS-Laplace regression model since the observed values of the AndersonDarling test statistic are greater than the critical point for both true or wrong models, whereas all observed values of the Anderson-Darling test statistic for the quantile residuals are lesser than the critical point under the true model (which confirms that the quantile residual is normally distributed), and are greater than the critical point under the wrong model. Additionally, note that there are values of the AndersonDarling test statistic for the deviance residuals which are greater than the critical point in the GJS-Logistic regression model (Figure 16) even under the true model, and under the wrong model, both residuals are not normally distributed. Next, our third Monte Carlo experiment evaluates the performance of rid and riq in detecting misspecification of density generators in the GJS regression model. So, a random sample of size n = 1000 is generated from the GJS-PE regression model with k = −0.5, log(ξi /(1 − ξi )) = η1i = 1.2 − 2.1 xi and log(δi ) = η2i = −0.5 + 4.5 si , but we will assume a GJS-Logistic regression model to fit these data. The covariate values were obtained as random draws of the standard unform distribution U(0, 1). The results are presented in Section 2 of the Online Supplementary Material. Figure 17 displays QQ-plots of the deviance and quantile residuals for the true and wrong models, indicating that the deviance residuals do not follow a standard normal distribution regardless under the true or wrong models, while the quantile residuals under the true model follow a standard normal distribution and under the wrong model, the residuals deviate considerably from the diagonal line in both the upper and lower tails, which clearly indicates a lack of fit. We also replicate this experiment by simulating 1000 datasets from the true model. For each replication, we compute the Anderson-Darling test statistic to test the normality of both residuals. Figure 18 reveals that the deviance residual is not normally distributed under the true or wrong models since the observed values of this statistic are greater than the critical point. On the other hand, all observed values of this statistic for the quantile residuals are lesser than the critical point under the true model, and are greater than the critical point under the wrong model. Finally, we generate a sample of size n = 1000 from the beta regression model [see, for example, 5, 6], say Yi ∼ Beta(µi , φi ) for i = 1, . . . , n with log(µi /(1 − µi )) = 1.2 − 2.1 xi , log(φi ) = −0.5 + 4.5 si , and the covariate values were obtained as random draws of the standard unform distribution U(0, 1). However, we shall assume that the data come from the GJS-Laplace and GJS-Logistic regression models, where   ξi = η1i = β1 + β2 xi , log(δi ) = η2i = τ1 + τ2 si . log 1 − ξi Figure 19 in Section 2 of the Online Supplementary Material displays QQ-plots of the deviance and quantile residuals for the GJS-Logistic and GJS-Laplace regression models. Note that both residuals were capable of identifying lack of fit (i.e., misspecification in the response distribution) when considering the GJS-Logistic regression, since they deviate considerably from the diagonal line in both the upper and lower tails. Additionally, it is evident that the deviance residual indicates that the GJS-Laplace regression 10

ACCEPTED MANUSCRIPT

5

Applications

ED

M

AN US

CR IP T

model fits adequately the data and hence it was not capable of identifying misspecification in the response distribution when assuming the GJS-Laplace regression model to fit the data. Remembering that the above simulation results have indicated that the deviance residual is not normally distributed in the GJSLaplace regression model, which means that the use of this reference distribution in such a case can lead to misleading conclusions. On the other hand, the quantile residual correctly indicates that the GJS-Laplace regression model is not suitable to fit the data, once it deviates considerably from the diagonal line in the lower tail. In conclusion, after these extensive Monte Carlo simulation experiments, it is worth stressing that, besides the simplicity of its computation, the quantile residual riq seems to be an excellent choice in checking the adequacy of the GJS regression model in practical applications, mainly because its empirical distribution follows the standard normal distribution (a unified/standard reference distribution) for all GJS regression models considered. Therefore, the standard normal distribution can indeed be used as a reference distribution for the quantile residual in the class of GJS regression models for model checking. On the other hand, as the Monte Carlo experiments make clear, it is necessary to have attention when using the deviance residual in practice, since its empirical distribution can deviate considerably from the standard normal distribution, mainly for the GJS-Laplace and GJS-PE regression models. So, misleading conclusions regarding the overall goodness-of-fit of the fitted GJS regression model to a given real dataset can be obtained on the basis of the deviance residual when using the standard normal distribution as its reference distribution. We strongly recommend, therefore, to use the quantile residual we propose to detect lack of fit in the class of GJS regression models.

Application that uses simulated data

AC

5.1

CE

PT

In what follows we shall present and discuss two empirical applications. The first application uses simulated data, whereas the second one employs real data. All computations were carried out using the Ox program, and all graphics were obtained using the R program. Estimation was performed using the quasiNewton optimization algorithm known as Broyden-Fletcher-Goldfarb-Shanno (BFGS) with analytic first derivatives through the subroutine maxBFGS(...).

The aim main here is to assess diagnostics analyses by using the deviance and quantile residuals when dispersion is incorrectly taken as constant. The true data generating process has nonconstant dispersion:   ξi = η1i = β1 + β2 xi , log(δi ) = η2i = τ1 + τ2 si , log 1 − ξi 11

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

where i = 1, . . . , 40. So, fixed dispersion GJS-Normal, -Logistic, -Student-t (with ν = 2 and 4) and -PE (with k = −0.6, −0.2, 0.2 and 0.6) regression models are, nonetheless, estimated. The values of xi were obtained as random draws of the standard unform distribution U(0, 1), while si is an indicator (dummy) variable that equals 0 for the first 20 observations, and 1 otherwise. Thus, dispersion is not constant for the entire dataset. Finally, we consider β1 = 1.2 and β2 = −2.1 (which yielded ξ ∈ (0.29, 0.77)), and λ = 30 (τ1 = 5.0 and τ2 = −3.4). The index plots and normal QQ-plots of the deviance and quantile residuals are provided in Section 3 of the Online Supplementary Material. Figures 20 and 21 contain the index plots and normal QQplots for the deviance residual, while Figures 22 and 23 contain the index plots and normal QQ-plots for the quantile residual. Such index plots indicate that the residuals dispersion is not constant for all observations. Indeed, there is noticeable more dispersion for the final 20 observations that uncovers an important feature of the true data generating process. It is also noteworthy that several observations are singled out as atypical, which is likely due to the incorrect assumption of constant dispersion. As suggested by [15], it is usual to add envelopes into the normal QQ-plots of the residuals [see, for instance, 16] to decide whether the observed residuals are consistent with the fitted model. So, observations corresponding to absolute residuals outside the limits provided by the simulated envelope are worthy of further investigation. Additionally, if a considerable proportion of points falls outside the envelope, then one has evidence against the adequacy of the fitted model. Below we present an algorithm to build simulated envelopes for the deviance and quantile residuals in the GJS regression model: Step 1. For each i = 1, . . . , n, compute ξbi and δbi .

ED

Step 2. Generate n observations y˜i , where y˜i ∼ GJS(ξbi , δbi ; g).

PT

Step 3. Compute the regression coefficients (say β˜ and τ˜ ) from the regression of y˜i on the covariates X and S.

CE

Step 4. Compute the deviance and quantile residuals using y˜i , β˜ and τ˜ ; denote the resulting residuals by r˜id and r˜iq , respectively. q d Step 5. Repeat the previous steps K times, thus obtaining K residuals r˜ik and r˜ik , for i = 1, . . . , n and k = 1, . . . , K.

AC

q d in non-decreasing order, thus obtaining Step 6. For each k = 1, . . . , K, sort the n residuals r˜ik and r˜ik q d the ordered residuals r˜(i)k and r˜(i)k .

Step 7. For each i = 1, . . . , n, obtain the sample (γ/2)- and (1 − γ/2)-quantiles1 of the ordered residuals d[γ/2] d[1−γ/2] q[γ/2] q[1−γ/2] q d r˜(i)k over k, say r˜(i)k and r˜(i)k , and of the ordered residuals r˜(i)k over k, say r˜(i)k and r˜(i)k . 1

In this paper, all simulated envelopes were constructed by using K = 2000, and γ = 1% thus having a 99% confidence band for the residuals.

12

ACCEPTED MANUSCRIPT

d[γ/2]

Step 8. The lower and upper bounds for each residual rid of the original regression are given by r˜(i)k d[1−γ/2]

and r˜(i)k

, respectively. Also, the lower and upper bounds for each residual riq of the original q[γ/2]

regression are given by r˜(i)k

q[1−γ/2]

and r˜(i)k

, respectively.

5.2

Application that uses real data

AN US

CR IP T

From Figures 20-23 (Section 3 of the Online Supplementary Material), note that some residuals fall outside the simulated envelopes in the normal QQ-plots, which can be taken as evidence of lack of fit (i.e., nonconstant dispersion). Additionally, it worth emphasizing that the generated envelopes for the deviance residual in the GJS-PE regression model display asymmetry in some situations (see Figure 21 in Section 3 of the Online Supplementary Material), whereas the generated envelopes for the quantile residual in the GJS-PE regression model are typically symmetric in all cases (see Figure 23 in Section 3 of the Online Supplementary Material). Remembering from the numerical results in the previous section that the deviance residual for the GJS-PE regression model is not well approximated for the standard normal distribution when the value of k is large (in absolute value). In summary, we may conclude that the quantile residual is the most effective when it came to revealing that the model is incorrectly specified.

ED

M

We are interested in modelling the proportion of blank votes (Y ) in the 2006 Peruvian general election of an electoral district as a function of the Human Development Index (HDI). The voting data as well as the HDI data can be obtained from http://www.onpe.gob.pe and http://www.pnud.org.pe, respectively. We have 194 electoral districts, and data are presented in Figure 1. We assume that Yi ∼ GJS(ξi , δi ; g), with   ξi = η1i = β1 + β2 HDIi , log(δi ) = η2i = τ1 + τ2 HDIi , log 1 − ξi

AC

CE

PT

where i = 1, . . . , 194. We shall assume GJS distributions with density generators of the normal, logistic, Student-t and PE distributions; that is, we have the GJS-Normal, -Logistic, -Student-t and -PE regression models. We would like to point out that the some GJS regression models have additional unknown quantities (for instance, ν > 0 and k ∈ (−1, 1] in the GJS-Student-t and GJS-PE regression models, respectively). These extra quantities may be considered as known or fixed. On the other hand, we may use a profile loglikelihood procedure to choose the more appropriate values of such quantities in practice. This procedure corresponds to a conditional (indirect) method to estimate the model parameters β and τ [see, for example, 17]. It is worth emphasizing that inference under the conditional scenario for the additional quantities is easier to be implemented in our approach, because it corresponds to a particular GJS regression model with a fixed value for the additional quantity. The reader is referred to [17] for discussions about the cost of adding unknown extra quantities to a model. So, by using a profile log-likelihood procedure (see Figure 2), we obtain ν = 5.44 and k = 0.53. 13

ACCEPTED MANUSCRIPT

+

+

0.30 proportion of blank votes

+

0.25

+

0.20 0.15 0.10

++

++ ++ + +++ ++ + ++ ++ + ++ ++ ++ + + + ++ ++ +++ + + ++ ++ + + + ++ +++ + ++++ + ++ + + + ++ ++++ + + + ++ ++ + + + + + + ++ + + ++ + ++ + + + +++ ++ + + + + +++ ++ + + + ++++ ++++++ + + +++ + ++ + + ++++ ++ + + + +++++ + + + + ++ + ++ + ++ ++ + + + + + ++ + + + + ++ + + + + + + + + + + + + + + 0.50

0.55

0.60

0.65

+ +

0.70

AN US

0.45

CR IP T

0.35

human development index

4.5

−26

GJS−PE regression

−28 −30

profile log−likelihood function

−32

PT

−22.55

CE

ν = 5.44

k = 0.53

−34

−22.65

AC

−22.60

profile log−likelihood function

−22.50

ED

GJS−Student−t regression

M

Figure 1: HDI on proportion of blank votes.

5.0

5.5

6.0

6.5

0.0

0.2

ν

0.4

0.6 k

Figure 2: Profile log-likelihood function.

14

0.8

ACCEPTED MANUSCRIPT

AN US

M

Parameter β1 β2 τ1 τ2

Table 2: Parameter estimates. GJS-Normal regression GJS-Logistic regression Estimate SE 90% CI Estimate SE 90% CI 2.1220 0.2295 (1.745; 2.499) 2.2672 0.2080 (1.925; 2.609) −6.4690 0.4151 (−7.152; −5.786) −6.7349 0.3768 (−7.355; −6.115) 2.3635 0.5202 (1.508; 3.219) 3.0638 0.6385 (2.014; 4.114) −1.9936 0.9227 (−3.511; −0.476) −2.1065 1.1324 (−3.969; −0.244) GJS-Student-t regression GJS-PE regression Estimate SE 90% CI Estimate SE 90% CI 2.2984 0.2091 (1.954; 2.642) 2.2865 0.2408 (1.891; 2.683) −6.7921 0.3787 (−7.415; −6.169) −6.7729 0.4346 (−7.488; −6.058) 2.6844 0.6713 (1.580; 3.789) 2.9449 0.6708 (1.842; 4.048) −2.1214 1.1905 (−4.080; −0.163) −2.0895 1.1932 (−4.052; −0.127)

ED

Parameter β1 β2 τ1 τ2

CR IP T

Table 2 lists the ML estimates, asymptotic standard errors (SE) and the 90% asymptotic confidence intervals (CI) of model parameters for the GJS-Normal, -Logistic, -Student-t and -PE regression models. From Table 2, we have a negative relationship between the response variable (proportion of blank votes) and the HDI; that is, we observe high proportion of blank votes in electoral districts with low HDI, and this behaviour decreases in electoral districts with high HDI. This result can be interpreted as distrust of the political party system, once a blank vote (also known as a protest vote) is a vote cast in an election which can be interpreted as a vote to demonstrate the dissatisfaction with the choice of candidates or refusal of the current political system. Additionally, the dispersion decreases with the HDI, since the MLE of τ2 is negative in all cases.

AC

CE

PT

Figure 3 displays the residual index plots of the deviance and quantile residuals for all regression models. Notice that the residuals are randomly scattered around zero. Also, both the deviance and quantile residuals reveal a large negative residual (case #20) and a large positive residual (case #147) for all regression models; that is, both residuals indicate that the observations #20 and #147 are atypical. The case #20 corresponds to a district which has HDI = 0.5088 and the proportion of blank votes was approximately 7.31%, whereas the observation #147 corresponds to a district which has HDI = 0.6423 and the proportion of blank votes was approximately 35.6%. Figure 4 displays the normal QQ plots (with generated envelopes) of the deviance and quantile residuals for all regression models. Note that the GJSNormal, -Logistic and -PE regression models seem to be inappropriate to model these data, since there are observations outside the envelope. On the other hand, the GJS-Student-t regression model seems to be appropriate to model these data, since there are no observations falling outside the envelope; that is, the generated envelope for the GJS-Student-t regression model does not present any unusual feature. 15

ACCEPTED MANUSCRIPT

GJS−Normal regression

GJS−Normal regression

+

4

GJS−Logistic regression

+

4

147

147

4

+

147

++

−4

0

−2

++ + + + + ++ + + ++ + + +++ +++ + + + ++ + +++++ ++ + + ++ + + + + ++ ++ +++ + ++ + + + + + + + + + + + ++ + + + + + + ++ + ++ + + + + +++ + + + +++ + + +++ + + ++ ++++ + ++ + + + + + + ++ +++ + ++ + + + ++ + ++ ++ ++ + + ++++ + + + + + + + + ++ ++ ++ + +++++ + + + + + + + ++++ + ++ + + + + + ++ ++ + ++ + + + + + ++

20

+

−6 0

50

100

150

200

0

50

100

GJS−Logistic regression 4

++ ++ +

+ + + ++ + + + + + +++ + ++ + +++ ++ + + ++ ++++ + + ++ + + + ++ + ++ + + ++ +++ + + + + + ++ ++ + ++ + + + ++ + ++ + + + + ++ ++ + + + +++ +++ + ++ ++ + + +++ ++ ++ + + + + ++ + + + + + + ++++ + + + + +++ + + + ++ + + + ++ +++ + ++ + + +++++ ++ + + + + + + + + + + ++ + + + + ++ + + + + ++ ++ + + + + + ++ + + +

2

0

−2

20

+

−4

50

100

150

200

+

+

++ ++ +

+ + + ++ + + + + + ++ + + ++ + + +++ ++ + ++ +++ + + + ++ + + + ++ + + + ++ + + + +++ + + ++ + ++ ++ + ++ + + + + + + + ++ ++ +++ + ++ + + + +++ + + + + ++ + +++ ++ ++ + + + + + ++ + + + + ++ + + + + + + ++ +++ + + + ++ + + ++ + ++ + + + + + + + + ++ + +++ + ++ + + + + + + + + ++ + + + + ++ + + ++ ++ + + + + + ++ + + +

2

0

−2

+ 147 +

100

150

200

AC

0

147 +

+

2 ++ ++ ++++ + ++ ++ +

0

−2

+

150

200

++

+

0

index

+

+

20

−4 100

+ +

+

+ ++ + + + ++ + +++ ++ + + ++ ++++ + + ++ + + + ++ + +++ + + + + ++ + ++ ++ + ++ + + + + + + + + + ++ + + + ++ ++ + + +++ + + + + + + +++ + + + + + + + + + ++ + + + + ++ + + + + + ++ +++ + + ++ + + ++ + + + ++ + + + ++++ + + + + ++ + + +++++ + + + + + + + ++ + + + ++ + ++ ++ + + + + + ++ + + +

20

50

100 index

+

PT

deviance residual

−4

50

GJS−PE regression

++ + + ++ ++ + ++ + + + ++++ + ++ + +++ ++ + + ++ ++++ + + ++ + + + ++ + + ++ + +++ + + + ++ ++ + + ++ + ++ + + + + + + + + + +++ + ++ + ++ + + + +++ + + + + + ++ + +++ ++ + + + + + + + ++ + + + + ++ + + + + + ++ +++ + + ++ + + + + + + ++ + + + +++++ + + + + ++ + + +++ + + + ++ + + ++ + ++ + ++ + ++ ++ + ++ + + + + + + +

CE

−2

0

4

quantile residual

+

+

++ + + + ++ + + + ++ + + + ++ + + ++ + + + + + + ++++ + + + + + ++ ++ + + ++ + ++ + +++ + + + + ++ ++ + + ++ + ++ + + + + + + + ++ +++ +++ + ++ + + + +++ + + + + ++ + +++ ++ ++ + + + + +++ + + + + ++ + + + + + + ++ +++ + + + ++ ++ + + + +++ + ++ + + ++++ ++ + + + + + + + + ++ + + ++ + + + + + ++ + + ++ ++ + + + + + ++ + + +

index

ED

50

+

+

147 +

+

0

200

−4

GJS−PE regression

2

150

20

+

0

index

4

100

GJS−Student−t regression

20

−4 0

50

+ 147

M

−2

0

index

+

+

deviance residual

quantile residual

0

200

+

4

+ 147 +

20

GJS−Student−t regression

4

2

150

index

+

++ + + + ++ + + + + ++ + +++ ++ + + ++ ++++ + + ++ + + + + ++ +++ + + + + ++ + ++ ++ + ++ + + + + ++ + + ++ + + ++ ++ ++ + + +++ + + + + +++ + ++ + ++ + + + + +++ + + + + + ++++ + + + + + +++ + + + ++ + + + ++ + + ++ + + +++++ ++ + + + + + +++++ ++ + + + ++ + + + + ++ + ++ ++ + + + + + ++ + + +

AN US

index

+

+ ++ + + +++ + ++ ++ +

−4

20

+

0

−2

−4

−6

2

+

CR IP T

−2

2

+

deviance residual

0

++ + + + + ++ + + ++ + + +++ +++ + + + ++ + +++++ ++ + + ++ + + + + ++ ++ +++ + ++ + + + + + + + + + + + ++ + + + + + + ++ + ++ + + + + +++ + + + +++ + + +++ + + ++ ++++ + ++ + + + + + + ++ +++ + ++ + + + ++ + ++ ++ ++ + + ++++ + + + + + + + + ++ ++ ++ + +++++ + + + + + + + ++++ + ++ + + + + + ++ ++ + ++ + + + + +

+

quantile residual

deviance residual

2

+

+

+

quantile residual

+

+

50

100

150

200

index

Figure 3: Residuals index plots: deviance and quantile residuals.

16

150

200

ACCEPTED MANUSCRIPT

GJS−Normal regression

GJS−Normal regression +

GJS−Logistic regression +

4

++ ++ ++++ ++++++++ +++++++ +++++++++++ +++++++ + + + + +++++++++ +++++ +++++++ ++++++++ +++++++ +++++++++ ++++++++++++ +++++++++++ + + + ++++ ++++ ++++++

0

−2 +

++ ++ ++++ ++++++++ +++++++ +++++++++++ +++++++ + + + + +++++++++ +++++ +++++++ ++++++++ +++++++ +++++++++ +++++++++++++ +++++++++++ + + + ++++ ++++ ++++++

2

quantile residual

deviance residual

2

+

++ +

0

−2

+

+

2

+

0

−2

+

+

−4

−4

+

−4

+

−3

−2

−1

0

1

2

3

−3

−2

−1

1

2

3

standard normal quantiles

GJS−Logistic regression

+

+

−3

−2

−1

GJS−Student−t regression

4

0

1

2

GJS−Student−t regression

4

+

4

+

+ +

2

−2

+

+

−3

−2

−1

0

1

2

3

−3

+

−2

−1

+

+

+

CE +

+

+

0

1

2

3

−3

−2

−1

1

2

+

++ + 2

0

+

−2

+

2

3

−3

standard normal quantiles

++ ++ ++++ ++++++++ +++++ ++++++++ +++++++++++ + ++++ +++++ +++++ ++++ ++++++ ++++++ +++++ + + + + +++++++ ++++++ ++++++++++++++ ++++++++++ ++++ +++++ +++++

++ +

+

+

−4 0

1

GJS−PE regression

+

−1

0

standard normal quantiles

4

++ ++++++ ++++++++ ++++++ ++++++++ + + + + + + + + + + +++ +++++ ++++ +++++ ++ ++++ ++ + + +++++ ++ +++ ++++ +++++++ ++++++ ++++++++++++++ ++++++++++ + + ++ +++++ +++++

−2

++ +

+

quantile residual

deviance residual

0

−2

AC

−2

+

PT

2

−3

0

++ +++++ +++ +++++++ +++++ + + + + + + ++++ ++++++++ +++++ ++++ +++++ ++++ ++++++ ++++++ + + +++ ++++ + +++++++ +++++++ +++++++++++++ +++++++++ ++++ + + +++ +++++

−4

GJS−PE regression

4

−4

+

2

standard normal quantiles

ED

standard normal quantiles

+

++ +

+

−4

+

−4

0

++ +++++ +++ +++++++ +++++ + + + + + + ++++ ++++++++ +++++ ++++ +++++ ++++ ++++++ ++++++ + + +++ +++++ +++++++ ++++++ +++++++++++++ +++++++++ ++++ + + +++ +++++

M

−2

++ +

quantile residual

0

++ +++ ++++ +++++++ +++ +++++++ + + + + + + + +++++++ +++++ ++++ +++++ ++++ +++++++ ++++++ + + + + + +++++ +++++++ ++++++ ++++++++++++++ ++++++++ ++++ +++ + + +++++

deviance residual

2

quantile residual

3

standard normal quantiles

AN US

standard normal quantiles

0

++ +++++ ++++ ++++++ ++++++ + + + + + +++++++++ ++++ ++++ ++++++ ++++++ +++++ ++++ + + + + + + ++++ ++++ +++++++ +++++ ++++++++++++++ ++++++++ ++++ + + + ++++ +++++

++ +

CR IP T

++ +

4

deviance residual

4

−2

−1

0

1

2

3

standard normal quantiles

Figure 4: Normal QQ-plots with generated envelopes: deviance and quantile residuals.

17

3

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

We would like to point out that, as evidenced by our Monte Carlo simulations, the generated envelope for the deviance residual in the GJS-PE regression model can be asymmetric for large values of k (in absolute value). The value of k we obtain for the real data was k = 0.53, then the confidence bands for rid are (slightly) asymmetric, as expected. Hence, one needs to have attention in order to make some conclusion about the adequacy of the GJS-PE model fitted to the data in such a case. Therefore, the better choice to make such conclusion (i.e., adequacy of the GJS-PE regression model fitted to the real data) should be the quantile residual we propose. In what follows we shall further investigate cases #20 and #147. So, in order to reveal the impact of these two observations on the parameter estimates, Table 3 shows the absolute changes (AC) in the ML estimates after dropping one of the two cases with outstanding influence for the GJS-normal, -Logistic, -Student-t and -PE regression models. The AC of each estimate is defined as ACθj = |θbj − θbj(I) |, where θbj(I) denotes the ML estimate of θj , with j = 1, . . . , m (where m is the total number of parameters), after the set I of observations has been removed. We also present the corresponding p-values in this table. From the figures in Table 3, note that the observation #147 is highly influential for the estimate of τ2 in the GJS-Normal, -Logistic and PE regression models; that is, the HDI becomes non-significant at the 10% nominal level when this observation is not in the data for these regression models. On the other hand, this observation does not change the significance of this parameter in the GJS-Student-t regression model, confirming the robustness of the Student-t distribution against outlying observations. Hence, the GJS-Student-t regression model provides a good representation to the election data.

AC

CE

PT

ED

Table 3: AC and the corresponding p-values. Dropped observation: #20 Normal Logistic Student-t Parameter AC p-value AC p-value AC p-value β1 0.0417 < 0.01 0.0116 < 0.01 0.0069 < 0.01 β2 0.0595 < 0.01 0.0169 < 0.01 0.0102 < 0.01 τ1 0.8630 < 0.01 0.3797 < 0.01 0.2125 < 0.01 τ2 1.3918 < 0.01 0.6162 0.01 0.3455 0.019 Dropped observation: #147 Normal Logistic Student-t Parameter AC p-value AC p-value AC p-value β1 0.1587 < 0.01 0.0585 < 0.01 0.0370 < 0.01 β2 0.2973 < 0.01 0.1101 < 0.01 0.0697 < 0.01 τ1 0.8323 < 0.01 0.4244 < 0.01 0.2797 < 0.01 τ2 1.5810 0.333 0.8041 0.128 0.5294 0.088

18

PE AC 90% CI 0.0169 < 0.01 0.0254 < 0.01 0.4957 < 0.01 0.8032 0.01 PE AC p-value 0.0710 < 0.01 0.1333 < 0.01 0.5223 < 0.01 0.9913 0.193

ACCEPTED MANUSCRIPT

6

Concluding remarks

PT

ED

M

AN US

CR IP T

As remarked earlier, the residuals carry important information concerning the appropriateness of assumptions that underlie statistical models, and thereby play an important role in checking model adequacy. They are used to identify discrepancies between models and data, so it is natural to base residuals on the contributions made by individual observations to measures of model fit. The use of residuals for assessing the adequacy of fitted regression models is nowadays commonplace due to the widespread availability of statistical software, many of which are capable of displaying residuals and diagnostic plots, at least for the more commonly used models. In short, residual analysis aims at identifying atypical observations and/or model misspecification. Recently, [3] introduced the generalized Johnson SB (GJS) regression model for limited range response variables on the basis of the generalized Johnson SB distribution (parameterized in a convenient way). In this paper, we introduced a new residual for the class of GJS regression models, which uses the GJS cumulative distribution function in its computation. It was numerically evaluated relative to the deviance residual proposed by [3]. The Monte Carlo simulation results favour the residual we propose, since its finite-sample distribution is better approximated by the standard normal distribution than that of the deviance residual and, therefore, this unified reference distribution can indeed be used as the reference distribution for the quantile residual in the class of GJS regression models for model checking. Additionally, the quantile residual we propose is able in identifying misspecification in the response distribution, while the deviance residual can fail to do that. Overall, the residual we propose is, therefore, a better choice to identify departures from the model assumptions and to assess the overall goodness-of-fit of the GJS regression model than the deviance residual. Finally, but not less important, it would be interesting to study how the probability-scale residual [7] works in the class of GJS regression models. An in-depth investigation of this study is beyond the scope of the current paper, but certainly is an interesting topic for future work.

CE

Acknowledgments

AC

The authors would like to express their deepest gratitude to the Associate Editor and an anonymous reviewer for their insightful comments and suggestions that greatly improved this paper. A.J. Lemonte was supported by grant 301808/2016–3 from Conselho Nacional de Desenvolvimento Cient´ıfico e Tecnol´ogico (CNPq-Brazil). G. Moreno–Arenas gratefully acknowledges grants from Mobility Program of the Universidad Industrial de Santander (Colombia).

Supplementary material Supplementary material associated with this article can be found in the online version of the manuscript. 19

ACCEPTED MANUSCRIPT

References [1] N.L. Johnson, Systems of frequency curves generated by the methods of translation. Biometrika. 36 (1949) 149–176. [2] S. Kotz, J.R. van Dorp, Beyond Beta: Other Continuous Families of Distributions with Bounded Support and Applications. World Scientific Press, Singapore, 2004.

CR IP T

[3] A.J. Lemonte, J. Baz´an, New class of Johnson SB distributions and its associated regression model for rates and proportions. Biometrical Journal. 58 (2016) 727–746. [4] K.T. Fang, T.W. Anderson, Statistical Inference in Elliptical Contoured and Related Distributions. Allerton Press, New York, 1990.

AN US

[5] S.L.P. Ferrari, F. Cribari–Neto, Beta regression for modeling rates and proportions. Journal of Applied Statistics. 31 (2004) 799–815. [6] F. Cribari–Neto, A. Zeileis, Beta regression in R. Journal of Statistical Software. 34 (2010) 1–24. [7] B.E. Shepherd, C. Li, Q. Liu, Probability-scale residuals for continuous, discrete, and censored data. Canadian Journal of Statistics. 44 (2017) 463–479.

M

[8] P. McCullagh, J. Nelder, Generalized Linear Models, 2nd ed. Chapman & Hall, London, 1989.

ED

[9] P.K. Dunn, G.K. Smyth, Randomised quantile residuals. Journal of Computational and Graphical Statistics. 5 (1996) 236–244.

PT

[10] J.A. Doornik, An Object-Oriented Matrix Language – Ox 6. Timberlake Consultants Press, London, 2009.

CE

[11] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2016.

AC

[12] P.L. Espinheira, E.G. Santos, F. Cribari–Neto, On nonlinear beta regression residuals. Biometrical Journal. 59 (2017) 445–461. [13] C. Feng, A. Sadeghpour, L. Li, Randomized quantile residuals: an omnibus model diagnostic tool with unified reference distribution. (2017) arXiv preprint arXiv:1708.08527v1. [14] G. Chen, N. Balakrishnan, A general purpose approximate goodness-of-fit test. Journal of Quality Technology. 27 (1995) 154–161.

20

ACCEPTED MANUSCRIPT

[15] A.C. Atkinson, Plots, Transformation and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford University Press, New York, 1985. [16] J. Neter, M.H. Kutner, C.J. Nachtsheim, W. Wasserman, Applied Linear Statistical Models, 4th Edition. Irwin, Chicago, 1996.

AC

CE

PT

ED

M

AN US

CR IP T

[17] J.M.G. Taylor, A.L. Siqueira, R.E. Weiss, The cost of adding parameters to a model. Journal of the Royal Statistical Society B. 58 (1996) 593–607.

21