Testing the hypothesis of increasing hazard ratio in two samples

Testing the hypothesis of increasing hazard ratio in two samples

Computational Statistics and Data Analysis xx (xxxx) xxx–xxx Contents lists available at ScienceDirect Computational Statistics and Data Analysis jo...

530KB Sizes 0 Downloads 42 Views

Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Testing the hypothesis of increasing hazard ratio in two samples Shyamsundar Sahoo a,∗ , Debasis Sengupta b a

Department of Statistics, Haldia Government College, Haldia 721657, India

b

Applied Statistics Unit, Indian Statistical Institute, 203 B.T. Road, Kolkata 700108, India

article

info

Article history: Received 14 March 2016 Received in revised form 9 April 2017 Accepted 25 April 2017 Available online xxxx Keywords: Proportional hazards model Increasing failure rate Greatest convex minorant Increasing hazard ratio Two-sample problem

abstract Tests designed to detect increasing hazard ratio against the proportional hazards hypothesis are generally consistent for other alternatives also. This article provides a test of the null hypothesis of increasing hazard ratio. The test is based on the separation between an empirical version of the relative trend function and its greatest convex minorant. The proportional hazards model, the least favorable null model for large samples, is used to produce simulation based cutoffs. A simulation study shows reasonable performance of the proposed test in small samples. The analytical test, together with a graphical version, is illustrated through two real life examples. © 2017 Elsevier B.V. All rights reserved.

1. Introduction The proportional hazards (PH) assumption has been used widely for modeling and analysis of survival data. A test of this assumption is not only an important two-sample problem, it is also relevant as a diagnostic check for Cox’s regression model (Cox, 1972). Several graphical and analytical tests have been proposed for this purpose; see Kay (1977), Lee and Pirie (1981), Wei (1984), Breslow et al. (1984), Gill and Schumacher (1987), Dabrowska et al. (1989), Dabrowska et al. (1992), Deshpande and Sengupta (1995), Sengupta et al. (1998) and Sahoo and Sengupta (2016). In many real life situations, the hazard ratio of two samples is observed to vary with time. Pocock et al. (1982) observed the phenomenon of crossing hazards while studying the prognostic relevance in the long run in the treatment of breast cancer patients. Champlin et al. (1983) and Begg et al. (1984) found the superiority of a treatment over another only in the short term period. Ng’andu (1997) observed the phenomenon of non-constant hazard ratio while comparing the survival probabilities of heroin addicts treated in two different methadone clinics. Ng’andu (1997) also reported an instance of crossing hazards among gastric carcinoma patients treated in a clinical trial that compared chemotherapy and chemotherapy combined with radiation therapy. The phenomenon of crossing hazards is sometimes modeled with increasing hazard ratio (IHR). Several authors (see Gill and Schumacher, 1987; Deshpande and Sengupta, 1995) have proposed analytical tests for assessing the PH hypothesis for two-samples against the monotone hazard ratio alternative. Sengupta et al. (1998) proposed a test procedure against the weaker alternative of increasing cumulative hazard ratio. In case the hazard ratio in two samples is neither constant nor monotonically increasing, a procedure for testing the PH hypothesis against the IHR alternative would produce a meaningless decision. Therefore, whether or not the PH hypothesis is rejected in favor of the IHR alternative for a particular data set, one cannot conclude that the hazard ratio is increasing.



Corresponding author. E-mail address: [email protected] (S. Sahoo).

http://dx.doi.org/10.1016/j.csda.2017.04.009 0167-9473/© 2017 Elsevier B.V. All rights reserved.

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

2

1 2 3 4 5 6 7 8 9 10 11 12

13

14

15

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

A definite conclusion can be reached if rejection of the PH hypothesis by such a test is complemented by non-rejection by another test, for which the null hypothesis is IHR. There is no such test in the existing literature. In this paper, we propose an omnibus goodness-of-fit test to check the validity of the IHR assumption. We develop the procedure for complete data (with no censoring), and later indicate how it can be extended to the case of randomly rightcensored data. Consider two absolutely continuous distributions FX and FY over the support [0, ∞), having positive density functions fX and fY , cumulative hazard functions (CHF) ΛX = − ln(1 − FX ) and ΛY = − ln(1 − FY ), and hazard functions λX and λY , respectively. The IHR model for the ordered pair of distributions (FX , FY ) means that the ratio λY /λX is a monotone increasing function over [0, ∞), while the PH model is such that λY /λX is equal to some positive constant for all t > 0. Let X1 , X2 , . . . , Xm and Y1 , Y2 , . . . , Yn be sets of independent samples from distributions FX and FY , respectively. We assume that each set of samples is arranged in increasing order, and denote them by the vectors X and Y, respectively. A typical test of PH vs. IHR concerns the hypotheses Null hypothesis, Alternative hypothesis,

H0 : H1 :

λY /λX is a positive constant, λY /λX is an increasing function.

(1)

The testing problem considered here concerns the hypotheses Null hypothesis, Alternative hypothesis,

H1 : H2 :

λY /λX is an increasing function, λY /λX is not an increasing function.

(2)

20

This article is organized as follows: In Section 2, we present a solution to the hypothesis testing problem (2) on the basis of data Y, assuming that FX is known. In Section 3, we develop a procedure for the testing problem (2), on the basis of data X and Y. In Section 4, we present a graphical version of this test. In Section 5, we present the results of a detailed simulation study on the small sample performance of the proposed tests. In Section 6, we illustrate the use of the proposed tests through the analysis of two data sets. Some concluding remarks are given in Section 7.

21

2. A test for the IHR hypothesis: FX is known

16 17 18 19

22 23 24 25 26

27

28 29 30 31 32 33

34

Let Zi = ΛX (Yi ) for i = 1, 2, . . . , n, and Z = (Z1 , Z2 , . . . , Zn )T . The vector Z may be said to contain a set of n ordered 1 −1 samples from the distribution FZ = FY ◦ Λ− X , having CHF ΛZ = ΛY ◦ ΛX . Sengupta and Deshpande (1994) showed that −1 λY /λX is a monotonically increasing function if and only if ΛY ◦ ΛX is convex. The convexity of ΛZ , in turn, is equivalent to the increasing failure rate (IFR) property of the distribution function FZ . Therefore, the testing problem (2) reduces to the simplified problem Null hypothesis, Alternative hypothesis,

H1∗ : H2∗ :

FZ is an IFR distribution, FZ is not an IFR distribution,

(3)

based on the data Z. A solution to the above testing problem was proposed by Tenga and Santner (1984). We show the key steps in the development of a modification of that test, with the intention of following a similar route in Section 3, when FX is unknown and represented by the data X. Let us assume Y1 < Y2 < · · · < Yn , i.e., the elements of Y are distinct. Note that, under the assumed model, this Z be the Nelson–Aalen estimator based on Z, defined as assumption is likely to hold with probability 1. Let Λ

Z (t ) = Λ

I (Zi ≤ t )

n  i=1

n 

I (Zi ≤ Zj )

j =1 35

36

which can also be written as n 1  I (Yi ≤ Λ− 1 X (t ))  Y ◦ Λ− =Λ X (t ). n  i=1 I (Yi ≤ Yj ) j =1

37 38

39

Z , i.e., the largest convex function dominated by Λ Z . It can be shown Let  GZ denote the greatest convex minorant (GCM) of Λ that  GZ is the piecewise linear function obtained by linearly interpolating in between its values at Z1 , Z2 , . . . , Zn , which are gj =  GZ (Zj ) =

 h ,   1 

min hj , min

 

hn ,

40 41

D(j)



Zk − Zj Zk − Zi

hi +

Zj − Zi Zk − Zi

 hk

j = 1,

,

1 < j < n, j = n,

Z (Zj −) for j = 1, 2, . . . , n and D(j) = {(i, k) : 1 ≤ i < j < k ≤ n} for j = 1, 2, . . . , n. This fact is established where hj = Λ by following the arguments used by Tenga and Santner (1984) to prove a similar result (their Theorem 2.1), where they used

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

3

the negative logarithm of the empirical survival function instead of the Nelson–Aalen estimator. The function  GZ is zero over Z is zero) and undefined over (Zn , ∞) (where the Nelson–Aalen estimator Λ Z is constant). The domain of  [0, Z1 ) (where Λ GZ is [0, Zn ]. The Kolmogorov–Smirnov (K-S) distance statistic for testing (3) is the maximum separation between the graphs Z and  of a left continuous version of Λ GZ , which simplifies to KSDn (Z) = max {wj (hj − gj )},

(4)

1
where wj is a weight function with wj > 0 for 1 ≤ j ≤ n. At least three observed lifetimes are needed for the statistic to be computable. The weight function may be suitably chosen to counter the instability of the unweighted difference, arising Z in the right tail, because of smaller risk sets at large t (Gill and Schumacher, 1987). from large variability of Λ It can be shown along the lines of Tenga and Santner (1984, Theorem 3.1) that KSDn (Z) is stochastically smaller than KSDn (U), where U = (ΛZ (Z1 ), ΛZ (Z2 ), . . . , ΛZ (Zn ))T . Since the latter vector consists of ordered samples from the unit exponential distribution, the small sample distribution of KSDn (U) can be obtained through simulations, and be used for obtaining a conservative test. 3. A test for the IHR hypothesis: FX is unknown

gmj =

min hmj , min

  where cmn =

hmn ,

n

j =1

D(j)



Zmk − Zmj Zmk − Zmi

I (Zmj = Zm1 ), dmn =

hmj = Λmn (Zmj −),

hmi +

n

j =1

Zmj − Zmi Zmk − Zmi

 hmk

,

cmn < j ≤ n − dmn ,

5

6 7 8 9 10 11 12

14 15 16 17 18 19 20 21 22 23 24

n − dmn < j ≤ n,

I (Zmj = Zmn ),

1 ≤ j ≤ n, cmn < j ≤ n − dmn .

{wj (hmj − gmj )},

(5)

cmn
where the weight wj is positive for each j. We now prove a result that enables us to approximate conservatively the null distribution of the statistic KSDn (Zm ) by a universal distribution when the size m of the first sample is large. Theorem 1. Let Zm = (Zm1 , Zm2 , . . . , Zmn ) be the vector of n-order statistics which is generated from two independent samples X and Y of sizes m and n from the distributions FX and FY , respectively, as described above, such that the ordered pair of distributions (FX , FY ) is IHR and FY ◦ FX−1 (1) = 1. Then, for fixed n, lim P {KSDn (Zm ) ≥ t } ≤ P {KSDn (U) ≥ t }

m→∞

4

25

Note that linear interpolation is needed only when gmj ̸= gml , which happens only when Zmj ̸= Zml . Moreover, the function Gmn is defined on [0, Zmn ] as described in Section 2. For testing the hypotheses in (2), we define a statistic analogous to the maximum distance statistic mentioned in Section 2, as max

3

1 ≤ j ≤ cmn ,

D(j) = {(i, k) : cmn ≤ i < j < k ≤ n − dmn + 1; Zmi < Zmj < Zmk },

KSDn (Zm ) =

2

13

X (t ). Then, for fixed Y1 < Y2 < · · · < Yn , define Let us denote the Nelson–Aalen estimator based on the X-sample by Λ X (Yj ) for j = 1, 2, . . . , n, and Zm = (Zm1 , Zm2 , . . . , Zmn )T . Note that the elements of Zm may not be unique. Let Λmn Zmj = Λ be the Nelson–Aalen estimator computed from the data Zm . 1 The function Λmn is related to the relative trend function (RTF), defined by Lee and Pirie (1981) as ΛY ◦ Λ− X . Gill and −1 −1    Schumacher (1987) had worked with the empirical RTF (ERTF), ΛY ◦ ΛX , where ΛX is the left-continuous function defined 1 −    −1 by Λ X (u) = inf{t : t > 0, ΛX (t ) ≥ u}. It can be verified that Λmn (t ) = ΛY ◦ ΛX (t +) for t ∈ [0, Zmn ). If one uses the 1 −1  Y ◦ Λ −  right-continuous inverse defined by ΛX (u) = sup{t : t > 0, ΛX (t ) ≤ u}, then the function Λ X coincides with Λmn over the set [0, Zmn ). Let us denote the GCM of Λmn by Gmn . It can be shown that this function has a form similar to that of  GZ defined in Section 2, viz. it is the piecewise linear function obtained by linearly interpolating in between its values at Zm1 , Zm2 , . . . , Zmn , which are  h ,   m1 

1

for all t ≥ 0,

(6)

where U = (U1 , U2 , . . . , Un ) is the vector of order statistics of a sample of size n from the unit exponential distribution. The above result holds with equality if FX and FY have proportional hazards.

X converges uniformly with probability 1 to ΛX as m → ∞ on [0, τ ], for any positive number Proof. It can be shown that Λ τ strictly less than FX−1 (1) (Shorack and Wellner, 1986). For such a number τ , let Aτ denote the event that Yn ≤ τ and Bτ X converges to ΛX uniformly on [0, τ ] as m → ∞. Then, P (Aτ ∩ Bτ ) = P (Aτ )P (Bτ ) = P (Aτ ) = FYn (τ ). denote the event that Λ

26 27 28 29 30

31 32 33

34 35 36

37

38 39

40 41 42

4

1 2

3

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

Let C denote the event that Zm → Z as m → ∞. It is easy to see that Aτ ∩ Bτ implies

√ √ X (Yj ) − ΛX (Yj )| n ∥Zm − Z∥2 ≤ max |Zmj − Zj | n = max |Λ 1≤j≤n 1≤j≤n √ X (y) − ΛX (y)| n ≤ sup |Λ 0≤y≤τ

→ 0 as m → ∞.

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Thus, (Aτ ∩ Bτ ) ⊂ (Aτ ∩ C ). Hence, P (Aτ ∩ Bτ ) ≤ P (Aτ ∩ C ) ≤ P (Aτ ), i.e., P (Aτ ∩ C ) = FYn (τ ). Let D denote the event that there is a number M such that the vector Zm has n distinct elements for m > M. We shall show that (Aτ ∩ C ) ⊂ D. To see this implication, suppose the event Aτ ∩ C has taken place. It follows that there is a number M such that for m > M, we have ∥Zm − Z∥2 < ϵ , where ϵ = 31 min1≤i M and 1 ≤ i < j ≤ n, we can write

|Zmj − Zmi | = |(Zmj − Zj ) − (Zmi − Zi ) + (Zj − Zi )| ≥ |Zj − Zi | − |Zmj − Zj | − |Zmi − Zi | > 3ϵ − ϵ − ϵ = ϵ > 0, i.e., the event D has also occurred. Thus, P (Aτ ∩ C ∩ D) = P (Aτ ∩ C ) = FYn (τ ). If the event D takes place, then the vectors (hm1 , hm2 , . . . , hmn )T , (gm1 , gm2 , . . . , gmn )T are both continuous functions of Zm for sufficiently large m. Therefore, KSDn (Zm ) is also a continuous function of Zm for sufficiently large m. Thus, Aτ ∩ C ∩ D implies that KSDn (Zm ) → KSDn (Z), an event that we denote by E. Thus, P (E ) ≥ FYn (τ ). This inequality holds for all τ < FX−1 (1). By allowing τ to go arbitrarily close to FX−1 (1), we can arrange FYn (τ ) to be arbitrarily close to {FY ◦ FX−1 (1)}n , which is equal to 1. We conclude that P (E ) = 1. Since almost sure convergence implies convergence in distribution, we can write lim P {KSDn (Zm ) ≥ t } = P {KSDn (Z) ≥ t }

m→∞ 21 22 23 24 25 26 27 28

29

30 31 32 33

34

35 36 37 38 39

Since Z has an IFR distribution, Theorem 3.1 of Tenga and Santner (1984) implies that P {KSDn (Z) ≥ t } ≤ P {KSDn (U) ≥ t } for all t ≥ 0, with equality holding for all t ≥ 0 if Z has the unit exponential distribution. If FX and FY have proportional hazards, Z has the exponential distribution (instead of IFR), with mean equal to the ratio of hazards of FX and FY . Further, it follows from Remark 2.2 of Tenga and Santner (1984) that KSDn (Z) is invariant under linear scale change θ Z with θ > 0. In particular, the mean of the exponential distribution does not matter. The last statement of the theorem follows.  The above theorem shows that when m is large, the PH model gives rise to the least favorable configuration under H1 in (2) for testing this hypothesis against H2 through the statistic KSDn (Z). A test of H1 vs. H2 is then given by the critical function

φ(Zm ) =

1, 0,

if KSDn (Zm ) > cα,n , otherwise,



(7)

where cα,n is chosen to satisfy P {KSDn (U) > cα,n } = α, U = (U1 , U2 , . . . , Un ) being a vector of n-order statistics of a sample of size n from the unit exponential distribution. According to Theorem 1, the level of the test (7) goes to α as m → ∞. The value cα,n may be obtained through simulation as the (1 − α) quantile of the distribution of KSDn (U). One can also bypass Theorem 1 and use a variation of the test (7) with critical function:

φ ∗ (Zm ) =

1, 0,



∗ if KSDn (Zm ) > cα, m,n (θ ), otherwise,

(8)

∗ where the alternative cutoff cα, m,n (θ ) is chosen as the 1 − α quantile of the simulated distribution of KSDn (Zm ), where X is sampled from the unit exponential distribution and Y is sampled from the exponential distribution with parameter θ (i.e., mean 1/θ ), θ being a suitable hazard ratio that leads to a conservative cut-off. Though the exponential distribution is known to play a central role in the analysis of lifetime data, a cut-off with wider applicability may be obtained by exploring other families of distributions, as shown in the first experiment of Section 5.2. The above test can be extended to the case of randomly right-censored data by using the censored-data version of the Nelson –Aalen estimators,

 X (t ) = Λ

Λmn (t ) =

I (Xi ≤ t )

 i∈{1,2,...,m} Xi is uncensored

 i∈{1,2,...,n} Yi is uncensored

40

for all t ≥ 0.

n



,

I (Xi ≤ Xj )

j =1

I (Zmi ≤ t ) n 

.

I (Zmi ≤ Zmj )

j =1

While Theorem 1 cannot be used for censored data, a cut-off determined by simulation of two samples can be used.

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

5

4. A related graphical test

1

Sahoo and Sengupta (2016) had introduced acceptance bands for a constrained estimate of the relative trend function under the PH assumption, by making use of the supremum of the difference between this estimate and an unconstrained estimate. The idea was that if the hazards are indeed proportional, then the constrained estimate should lie within a neighborhood of the unconstrained estimate (an acceptance band). Non-containment within the band corresponds to rejection of the null hypothesis. Using the above idea, the test proposed in Section 3 can be redesigned as a graphical test of the IHR assumption. Let w(t ) be a piecewise constant left-continuous function defined over the interval [Zm1 , Zmn ) with the value wj at Zmj . Then, under ∗ the null hypothesis, the GCM of Λmn should lie, with probability 1 −α , above Λmn − cα, m,n (θ )/w or Λmn − cα,n /w , depending on whichever cut-off is used. Therefore, this curve may be plotted as a lower acceptance band for the GCM of Λmn . The GCM strays beyond the acceptance band if and only if the hypothesis is rejected. If there is rejection, the plot would give a visual indication as to where the GCM crosses the band, i.e., deviates most from the IHR assumption. 5. Simulation studies

1

3 4 5 6 7 8 9 10 11 12

13

We run an initial set of simulations to determine the thresholds for the proposed test, and examine whether to use ∗ the threshold cα,n or the threshold cα, m,n (θ) with a suitable θ . We then perform further simulations to study the empirical size and power of the tests. Since there is no competing test in the existing literature, we compare different versions of the proposed test (i.e., with different weight functions) with one another. We use five choices of the weights wj in the statistic (5),

wj(1) =

2

,

14 15 16 17

(9)

18

(10)

19

wj(3) = wj(1) + wj(2) ,

(11)

20

wj(4) = wj(1) · wj(2) ,

(12)

21

wj(5) = 1.

(13)

22

wj(2) =

nhmj

1 n(hmj − max{hmi : hmi < hmj })

,

The first two choices of weights are meant to compensate for large values and large fluctuations of the CHF, respectively (which happens due to larger variance of the estimators of the cumulative hazard function in the right tail; see Section 2). Tenga and Santner (1984) had used these weight functions. The next two weight functions combine these effects in additive and multiplicative ways, respectively. The last choice is for no weighting at all. We use α = 0.05, and consider varying sample sizes of 50, 100, 200, 400 and 500 in the two groups. We investigate the properties of the tests for both complete and censored samples. A total of 10,000 replicates are used for obtaining each entry in Tables 1–10. 5.1. Determining conservative threshold

(3)

24 25 26 27 28 29

30

In order to determine cα,n , we simulate the random vector U as an ordered set of n samples from the unit exponential distribution, compute KSDn (U), and then use many replicates to determine the 1 − α quantile. ∗ In order to determine cα, m,n (θ ), we simulate the random vector X as an ordered set of m samples from the unit exponential distribution and the random vector Y as an ordered set of n samples from the exponential distribution with mean 1/θ . Subsequently, we compute KSDn (Zm ), and then use 10,000 replicates to determine the 1 − α quantile. We consider values of θ from the interval [1/2, 4]. The results for the weight functions (9)–(13) are summarized in Tables 1–5, respectively. It is observed from Tables 1–4 that as θ increases, the quantile c0∗.05,m,n (θ ) of the test decreases gradually. For the range of values of θ considered here, the values of c0∗.05,m,n (θ) are found smaller than c0.05,n for every choice of m and n in the table. This finding indicates that the cut-off c0.05,n , meant for large m, is conservative for the choices of m and n used here. On the other hand, since the highest cut-off occurs at one end of the range of θ used in the study, the size of the test (8) may not be protected when θ is unknown. Therefore, we ignore these weight functions in our further study. (2)

23

(5)

When the weight functions wj , wj and wj are used, the corresponding results reported in Tables 2, 3 and 5 show that the quantile c0∗.05,m,n (θ ) initially increases and then decreases with θ . The highest point is generally reached at θ belonging in the interval [1.5, 2.5]. However, the maximum value is larger than the corresponding quantile c0.05,n for the same value of n. Therefore, for all the three weight functions given in (10), (11) and (13), we decide to use the test (8) with cut-off c0∗.05,m,n (2). The test (7) can be used when m is very large.

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

6

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

Table 1 Empirical thresholds c0.05,n and c0∗.05,m,n (θ) for test statistic (5) with weight (9). Threshold

θ

Value of threshold for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

c0.05,n



0.0193

0.0096

0.0096

0.0048

0.0048

0.0048

0.0024

0.0019

c0∗.05,m,n (θ)

0.5 1.0 1.5 2.0 2.5 3.0 4.0

0.0171 0.0167 0.0165 0.0163 0.0162 0.0160 0.0159

0.0085 0.0083 0.0082 0.0081 0.0081 0.0080 0.0079

0.0089 0.0085 0.0084 0.0083 0.0082 0.0082 0.0081

0.0041 0.0040 0.0040 0.0040 0.0040 0.0039 0.0039

0.0042 0.0041 0.0041 0.0040 0.0040 0.0040 0.0039

0.0044 0.0043 0.0042 0.0041 0.0041 0.0040 0.0040

0.0020 0.0020 0.0020 0.0020 0.0020 0.0020 0.0019

0.0017 0.0016 0.0016 0.0016 0.0016 0.0016 0.0016

Table 2 Empirical thresholds c0.05,n and c0∗.05,m,n (θ) for test statistic (5) with weight (10). Threshold

θ

Value of threshold for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

c0.05,n



0.2166

0.1881

0.1881

0.1694

0.1694

0.1694

0.1515

0.1456

c0∗.05,m,n (θ)

0.5 1.0 1.5 2.0 2.5 3.0 4.0

0.1770 0.2031 0.2126 0.2139 0.2101 0.2065 0.1957

0.1593 0.1825 0.1987 0.2001 0.1989 0.1964 0.1934

0.1567 0.1818 0.1978 0.1952 0.1965 0.1970 0.1951

0.1459 0.1684 0.1839 0.1893 0.1870 0.1828 0.1805

0.1394 0.1622 0.1831 0.1822 0.1812 0.1797 0.1806

0.1346 0.1603 0.1766 0.1774 0.1755 0.1773 0.1750

0.1309 0.1496 0.1683 0.1743 0.1713 0.1702 0.1686

0.1190 0.1380 0.1608 0.1598 0.1578 0.1584 0.1576

Table 3 Empirical thresholds c0.05,n and c0∗.05,m,n (θ) for test statistic (5) with weight (11). Threshold

θ

Value of threshold for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

c0.05,n



0.2284

0.1932

0.1932

0.1717

0.1717

0.1717

0.1525

0.1464

c0∗.05,m,n (θ)

0.5 1.0 1.5 2.0 2.5 3.0 4.0

0.1875 0.2150 0.2247 0.2263 0.2223 0.2187 0.2077

0.1639 0.1876 0.2046 0.2056 0.2045 0.2021 0.1992

0.1612 0.1869 0.2034 0.2007 0.2020 0.2026 0.2008

0.1480 0.1709 0.1865 0.1919 0.1896 0.1854 0.1832

0.1414 0.1645 0.1856 0.1847 0.1838 0.1822 0.1832

0.1365 0.1626 0.1792 0.1799 0.1781 0.1798 0.1775

0.1318 0.1506 0.1695 0.1756 0.1725 0.1714 0.1699

0.1196 0.1388 0.1616 0.1607 0.1587 0.1592 0.1585

Table 4 Empirical thresholds c0.05,n and c0∗.05,m,n (θ) for test statistic (5) with weight (12). Threshold

1

2 3 4

θ

Value of threshold for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

c0.05,n



0.0193

0.0096

0.0096

0.0048

0.0048

0.0048

0.0024

0.0019

c0∗.05,m,n (θ)

0.5 1.0 1.5 2.0 2.5 3.0 4.0

0.0149 0.0127 0.0115 0.0105 0.0099 0.0091 0.0079

0.0074 0.0065 0.0061 0.0058 0.0054 0.0051 0.0046

0.0084 0.0075 0.0070 0.0066 0.0063 0.0060 0.0056

0.0033 0.0029 0.0028 0.0026 0.0024 0.0023 0.0020

0.0037 0.0034 0.0031 0.0030 0.0028 0.0027 0.0025

0.0042 0.0038 0.0035 0.0034 0.0032 0.0031 0.0029

0.0017 0.0015 0.0014 0.0014 0.0013 0.0012 0.0011

0.0015 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011

5.2. Empirical size and power: uncensored case Since the cut-off c0∗.05,m,n (2) for each of weight functions (10), (11) and (13) was determined after considering several combinations of exponential distributions for the two samples, we now check the empirical size by considering some other combinations of distributions under the null hypothesis. Let W (γ , δ) denote the Weibull distribution having hazard rate

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

7

Table 5 Empirical thresholds c0.05,n and c0∗.05,m,n (θ ) for test statistic (5) with weight (13). Threshold

θ

Value of threshold for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

c0.05,n



1.2852

1.4339

1.4339

1.5697

1.5697

1.5697

1.7134

1.7285

c0∗.05,m,n (θ)

0.5 1.0 1.5 2.0 2.5 3.0 4.0

0.5687 0.9818 1.1753 1.2482 1.2203 1.2091 1.1778

0.5859 1.0756 1.3853 1.4296 1.4383 1.3923 1.4102

0.6316 1.2069 1.4429 1.4438 1.4375 1.4336 1.4248

0.5665 1.0868 1.4568 1.5698 1.5853 1.5424 1.5439

0.5983 1.2004 1.5526 1.6054 1.6004 1.5829 1.5515

0.6438 1.2914 1.5912 1.5923 1.6086 1.5952 1.5770

0.5812 1.1605 1.6348 1.7405 1.7099 1.6942 1.6802

0.6158 1.3070 1.7678 1.7628 1.7514 1.7503 1.7264

Table 6 Empirical size of test (8) with weights (10) and (13) under PH models with distributions W (1, 2) and W (θ, 2). Weight

wj(2)

wj(5)

θ

Percent significant at 0.05 level for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

1.0 1.5

2.8 4.7

1.8 4.8

2.2 5.4

1.4 4.0

1.3 5.1

1.4 5.0

0.7 3.5

0.8 5.3

2.0 2.5 3.0 5.0

4.9 4.5 3.9 1.4

5.0 4.8 4.4 3.4

5.0 5.0 5.1 5.1

4.7 4.3 4.1 2.7

5.0 4.9 4.7 4.8

4.5 4.9 4.6 5.0

4.8 4.6 4.2 3.8

4.5 4.5 4.4 5.0

1.0 1.5

0.5 3.3

0.4 3.9

0.9 4.8

0.1 3.2

0.2 4.3

0.7 5.4

0.0 3.3

0.2 5.2

2.0 2.5 3.0 5.0

4.4 4.5 4.5 3.5

5.1 4.7 4.8 4.1

5.1 4.8 4.8 4.5

5.0 4.8 4.8 4.0

4.8 4.9 4.7 4.6

4.9 4.8 4.8 4.7

4.8 4.4 4.4 4.2

4.5 5.0 4.7 4.8

γ δ t δ−1 . In our first experiment, we choose W (1, 2) and W (θ , 2) as the distributions for the first and second samples, respectively. Here, the hazard ratio is constant and is equal to θ . Note that if the samples are transformed by the cumulative hazard function of the first distribution, which is a strictly increasing function, Zm remains unchanged, and therefore the transformed samples would produce the same value of the test statistic KSDn . The transformed version of the first sample may be regarded as a sample from the unit exponential distribution. If the two samples have proportional hazards with λY /λX = θ , the transformed version of the second sample may be regarded as a sample from the exponential distribution with mean 1/θ . Therefore, for all combinations of FX and FY with λY /λX = θ , the distribution of the test statistic will be the same. In this experiment, the values of θ are chosen in the range [1, 5] and the results are given in Table 6. The reported quantity is the empirical size  α , expressed as a percentage, as estimated by the fraction of simulation runs rejecting H0 . The rejection is based on the conservative threshold c0∗.05,m,n (2), i.e., the least favorable scenario involving θ = 2, obtained from the simulation results reported in Tables 2 and 3 and 5. The reported size in each case is based on 10,000 simulation runs. For the nominal significance level of 5%, this number of runs corresponds to a standard error [0.05 ×(1 − 0.95)/10,000]1/2 = 0.0022. Thus, 95% of the empirical sizes reported in Table 6 should be less than 0.05 + 1.654 × 0.0022, that is, 5.36%. All the empirical sizes are indeed found to be smaller than this threshold. Further, the empirical size is found to be smaller than the nominal size when θ is away from the worst case value, 2 (as expected). For weight function (11), the pattern empirical sizes are almost identical to that for weight function (10), and are omitted for brevity. In our second experiment, we choose W (1, 1) and W (γ , δ) as the distributions for the first and second samples, respectively, where δ is chosen as 1.2, 1.4, 1.6, 1.8 and 2, and γ is chosen to ensure that the two distributions have the same median. Note that in this case, the hazard ratio is γ δ t δ−1 , which is an increasing function of t. The results are summarized in Table 7, in terms of the percentage of rejections of the null hypothesis, for weights (10) and (13). It is seen that for both the weight functions, the empirically observed size is smaller than the nominal size. For the (2) test with weight wj , the size becomes smaller when δ is away from 1, i.e., the hazard ratio is away from the PH model. The results of the test corresponding to weight (11) have been found very similar to that of the test with weight function (10), and accordingly, these are not reported in the table for the sake of brevity. In our third experiment, we study the power properties of the tests for the decreasing hazard ratio (DHR) alternative, obtained through the distributions W (1, 1) and W (γ , δ). The parameter δ is varied in steps from 0.8 to 0.3, producing progressive departure from the PH assumption (i.e., departure from the null hypothesis). The parameter γ is adjusted so that W (γ , δ) has the same median as W (1, 1).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

8

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

Table 7 Empirical size of test (8) with weights (10) and (13) under IHR models with equi-median distributions W (1, 1) and W (γ , δ).

δ

Weight

wj(2)

wj(5)

Percent significant at 0.05 level for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

1.2 1.4

0.7 0.2

0.3 0.0

0.3 0.0

0.1 0.0

0.1 0.0

0.0 0.0

0.0 0.0

0.0 0.0

1.6 1.8 2.0

0.1 0.0 0.0

0.0 0.0 0.0

0.0 0.0 0.0

0.0 0.0 0.0

0.0 0.0 0.0

0.0 0.0 0.0

0.0 0.0 0.0

0.0 0.0 0.0

1.2 1.4

1.1 1.0

1.2 1.2

1.8 1.2

0.8 1.1

1.4 0.9

1.9 1.2

0.9 1.1

1.8 0.8

1.6 1.8 2.0

0.6 0.4 0.3

0.8 0.4 0.2

0.6 0.2 0.2

0.7 0.3 0.2

0.4 0.2 0.2

0.6 0.3 0.2

0.5 0.3 0.2

0.6 0.2 0.2

Table 8 Empirical power of test (8) with weight (10) under DHR models with equi-median distributions W (1, 1) and W (γ , δ). Weight

wj(2)

1 2 3

δ

Percent significant at 0.05 level for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

0.8 0.7 0.6

11.7 22.5 39.5

16.0 35.9 64.0

22.0 49.2 78.9

16.0 42.6 75.1

24.2 59.5 90.5

31.4 73.5 97.4

26.9 69.6 96.4

46.6 92.9 99.9

0.5 0.4 0.3

58.6 77.4 88.6

87.2 97.4 99.8

95.5 99.4 100.0

95.4 99.8 100.0

99.4 100.0 100.0

99.9 100.0 100.0

100.0 100.0 100.0

100.0 100.0 100.0

(2)

Table 8 shows the empirical power (reported as percentage, as before) of the test with weight wj given by (10). For this test, the empirical power is found to increase with decreasing δ , as one moves further away from the null hypothesis. The (3) increase in power is faster when the sample size is larger. Powers of the test with weight wj given by (11) are not reported (2)

(5)

6

here, as the pattern is similar to that of the test with weight wj . The test with weight wj given by (13) is found to have much smaller power (results not reported in the table), and consequently, we drop this weight function in our subsequent study.

7

5.3. Empirical size and power: censored case

4 5

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

To generate randomly type-I right censored data with different censoring fractions, we choose the Weibull and the exponential distributions as the lifetime and censoring distributions, respectively, for each sample. We use weight functions (10) and (11) in test (8), since the corresponding tests are found to have reasonable power properties as well as have dependable sizes (discussed in the foregoing subsection) in the case of complete samples. In order to study the empirical size of the tests in the presence of censoring, we consider the pairs of distributions W (1, 2) and W (θ , 2) with θ = 1, 2, 3 and 4. The parameter of the censoring (exponential) distribution for each sample is so adjusted as to produce a specified censoring fraction, i.e., a theoretically specified probability φ of a particular observation being censored. We use equal censoring fractions of 0.1, 0.25 and 0.50 in each sample, and also unequal censoring fractions of 0.75 and 0.25 in the first and second sample, respectively. As for decision threshold, we have adopted a conservative approach. We have done this by selecting thresholds for complete data with smaller sample size, since a lower sample size involves a higher threshold (see Tables 2 and 3). If φ1 and φ2 are the censoring fractions in the first and second samples of sizes m and n, respectively, the expected number of complete data points in the two samples are [m(1 − φ1 )] and [n(1 − φ2 )], respectively. We have used complete data thresholds for this combination of sample sizes in order to calculate the empirical size of the censored data tests. The simulation results (not reported here for the sake of brevity) indicate that the empirical sizes of the test with weights wj(2) and wj(3) are smaller than the nominal size, for all sample sizes and for every pattern of censoring fractions in the two samples. Particularly, the sizes are found much smaller than the nominal size for all sample sizes when the censoring fractions in both the samples are greater than or equal to 0.25. We investigate the empirical power of the tests, by simulating censored samples from DHR models. Survival times for the first and second samples are generated from the distributions W (1, 1) and W (γ , δ), respectively, with δ = 0.7, 0.5 and 0.3, and the corresponding values of γ are chosen so that both the lifetime distributions have the same median. The parameter of the censoring distribution (taken as exponential) for each sample is so chosen as to achieve a specified censoring fraction. Determination of thresholds and pattern of censoring fractions in the two samples remain exactly the same as in the case of obtaining empirical sizes of the test.

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

9

Table 9 Empirical power of test (8) with weight (10) for censored data from DHR models with equi-median distributions W (1, 1) and W (γ , δ).

φ1

φ2

δ

Percent significant at 0.05 level for m = 50 n = 50

100 100

200 100

100 200

200 200

400 200

200 400

500 500

0.10 0.10 0.10

0.10 0.10 0.10

0.7 0.5 0.3

15.6 47.4 84.2

21.2 76.2 99.0

31.5 88.1 99.8

26.1 88.4 100.0

41.8 97.8 100.0

54.3 99.5 100.0

52.6 99.8 100.0

81.8 100.0 100.0

0.25 0.25 0.25

0.25 0.25 0.25

0.7 0.5 0.3

5.7 28.4 71.3

6.1 47.1 95.0

10.0 63.3 98.3

8.0 65.9 99.5

11.5 80.0 100.0

19.8 93.4 100.0

17.6 94.7 100.0

45.3 99.9 100.0

0.50 0.50 0.50

0.50 0.50 0.50

0.7 0.5 0.3

0.1 1.6 11.6

0.0 1.5 28.4

0.1 2.1 34.5

0.0 2.2 52.4

0.0 3.6 67.2

0.1 6.8 81.6

0.0 6.1 92.6

0.0 24.4 99.8

0.75 0.75 0.75

0.25 0.25 0.25

0.7 0.5 0.3

6.0 20.2 52.0

2.7 18.0 65.3

1.9 15.4 68.1

3.4 25.6 80.0

1.9 20.3 82.5

2.7 26.1 91.5

2.4 26.7 92.0

4.9 44.4 99.6

Table 10 p-values of the various tests for GCD and PBC data sets. Data set

GCD PBC

Tests of PH

Tests of IHR

TGL

TPL

Uc

KSDmn with weight (10)

KSDmn with weight (11)

<0.0001 <0.0001

<0.0001 <0.0001

0.004 <0.0001

0.597 0.021

0.603 0.024

Table 9 summarizes the empirical power of the test corresponding to weight function (10) for censored data. The results are reported in the percentage of rejections of the null hypothesis. As expected, the power of the test increases with decreasing δ , that is, as the underlying model gradually moves away from the worst case null model (i.e. the PH model). It is also found that the power of the test increases with increase of sample sizes in both the samples. Further, the power of the test decreases gradually with increasing amount of censoring. Almost similar empirical performance has been observed for the test with weight function (11) (results not reported here) for all sample sizes and all censoring situations. Additional simulation studies based on censored data generated from the uniform distribution of censoring time for each sample give similar results as found in the case of exponential distribution for censoring time. These results are omitted for the sake of brevity. 6. Data analysis In this section, we illustrate the utility of the tests proposed here by analyzing two real data sets. The first data set consists of the survival times (in days) of gastric cancer patients in a clinical trial that compared chemotherapy and chemotherapy combined with radiation therapy in the treatment of locally advanced unresectable gastric carcinoma. The trial was conducted by the Gastrointestinal Tumor Study Group (Schein et al., 1982), and in the trial, 45 patients were randomized to each of the two treatment groups. There are two cases of censoring in the first group, and six cases of censoring in the second group. The data are analyzed by Stablein and Koutrouvelis (1985) and also by Klein and Moeschberger (2003). These studies reveal the situation of crossing hazards between the two groups. We refer to this data set as the gastric cancer data (GCD). The second data example concerns patients with primary biliary cirrhosis (PBC) of the liver. The Mayo Clinic conducted a randomized clinical trial of PBC patients between January, 1974 and May, 1984, to compare the drug D-penicillamine (DPCA) with a placebo. There were 312 participants for whom complete information was available. Of these randomized patients, 125 had died in the trial by July, 1986. The data set is given in Fleming and Harrington (1991, Appendix D.1) with detailed analysis. While DPCA turned out to produce no significant improvement over the placebo, the covariate protime (prothrombin time) was found to have a bearing on the survival time. Graphical checks indicated a non-monotone hazard ratio over time for groups of patients stratified by this covariate (see Fleming and Harrington, 1991; Therneau and Grambsch, 2000). For the present analysis, we use the threshold 11 for protime, to split the 312 subjects into two groups with approximately equal number of observed deaths. Group 1 with protime less than or equal to 11 has 229 subjects including 66 deaths, while Group 2 with protime greater than 11 has 83 subjects including 59 that were observed to die. The p-values of three analytical tests of the PH assumption, all of which are consistent against the IHR alternative, are reported in Table 10. The test statistics TGL and TPL are those proposed by Gill and Schumacher (1987), with the Gehan versus log-rank weight functions and the Prentice versus log-rank weight functions, respectively. The statistic Uc is the U-statistic proposed by Deshpande and Sengupta (1995), with standard deviation for scaling determined by 5000 simulation runs. The table shows that one-sided p-values of all the tests indicate rejection of the PH hypothesis for both the data sets.

1 2 3 4 5 6 7 8 9

10

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

10

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

Fig. 1. Plots of RTF, its GCM and 95% lower acceptance bands of GCM, computed with two weight functions, for the GCD data.

Fig. 2. Plots of RTF, its GCM and 95% lower acceptance bands of GCM, computed with two weight functions, for the PBC data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

The last two columns of Table 10 show the p-values of the proposed test (8) with weight functions given by (10) and (11). The p-values are computed by simulation. Specifically, for each data set, simulated values of KSDn (Zm ) were generated by using uncensored samples drawn from the exponential distributions with rate parameters 1 and θ for the first and the second groups, respectively. For reasons explained in Section 5.1, the cut-off was chosen as c0∗.05,m,n (2) in case of both weight functions (10) and (11). The sample sizes of the simulated data were selected to match the number of uncensored observations in the actual data set. It may be observed that the proposed test with each weight function clearly indicates acceptance of the IHR hypothesis in the case of the GCD data and rejection of that hypothesis in the case of the PBC data. This conclusion complements the findings of the existing analytical tests, which are able to reject the PH hypothesis but are unable to distinguish between the IHR and non-IHR situations. As a follow-up study, we use the graphical test described in Section 4. The test is based on the function Λmn , which is an empirical version of the relative trend function (see the discussion in the second paragraph of Section 3). For the sake of simplicity, we refer to this function as the RTF in the present discussion. Fig. 1 shows, for the GCD data, the plot of the RTF, (2) (3) its GCM and two lower acceptance bands of the GCM computed at level α = 0.05 by using weight functions wj and wj given by (10) and (11), respectively. The latter curves are referred to as lower acceptance bands 2 and 3, respectively. Fig. 2 shows the corresponding plot for the PBC data. The GCM in the case of the GCD data stays well above the lower acceptance bands all the way. In the case of the PBC data, however, the GCM crosses the acceptance bands. For both the data sets, the two acceptance bands nearly overlap with one

S. Sahoo, D. Sengupta / Computational Statistics and Data Analysis xx (xxxx) xxx–xxx

11

another. Overall, crossings of the bands occur in a region where the RTF is by and large concave. These are indeed regions of clear departure from the ideal (convex) shape of the RTF under the null hypothesis. 7. Concluding remarks It can be shown that the IFR property of a distribution corresponds to concavity of its total time on test (TTT) plot (Klefsjö, 1982). This plot, which is confined to the unit square, provides an opportunity to study the difference between the empirical TTT plot and its geometrically constrained counter part. Tenga and Santner (1984) had used this plot to obtain an alternative goodness of fit test for the IFR model. A similar test for the IHR model in two samples may be sought. Apart from the test proposed by Tenga and Santner (1984) for the IFR class, Srivastava et al. (2012) had provided analogous tests for the IFRA and NBU classes, based on the supremum of the difference between an empirical version of the cumulative hazard function and its (suitably) geometrically constrained version. All these tests can be readily converted to graphical tests by constructing an acceptance band for the constrained estimator, in the manner of the test proposed in Section 4. Acknowledgments This research is partially sponsored by the project ‘‘Optimization and Reliability Modeling’’ funded by the Indian Statistical Institute, Kolkata. The authors thank the associate editor and the two anonymous referees for their detailed comments which led to many improvements of the paper. References Begg, B.C., McGlave, P.B., Bennet, J.M., Cassileth, P.A., Oken, M.M., 1984. A critical comparison of allogeneic bone marrow transplantation and conventional chemotherapy as treatment for acute non-lymphomytic leukemia. J. Clin. Oncol. 2, 369–378. Breslow, N.E., Elder, L., Berger, J., 1984. A two-sample censored-data rank test for acceleration. Biometrics 40, 1049–1062. Champlin, R., Mitsuyasu, R., Elashoff, R., Gale, R.P., 1983. Recent advances in bone-marrow transplantation, in: Gale R.P. (Ed.), UCLA Symposia on Molecular and Cellular Biology. New York: Alan R. Liss, vol. 7, pp. 141–158. Cox, D.R., 1972. Regression models and life tables (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 34, 187–202. Dabrowska, D.M., Doksum, K.A., Feduska, N.J., Husing, R., Neville, P., 1992. Methods for comparing cumulative hazard functions in a semi-parametric hazard model. Stat. Med. 11, 1465–1476. Dabrowska, D.M., Doksum, K.A., Song, J.K., 1989. Graphical comparison of cumulative hazards for two populations. Biometrika 76, 763–773. Deshpande, J.V., Sengupta, D., 1995. Testing the hypothesis of proportional hazards in two populations. Biometrika 82 (2), 251–261. Fleming, T.R., Harrington, D.P., 1991. Counting Processes and Survival Analysis. John Wiley, New York. Gill, R., Schumacher, M., 1987. A simple test of the proportional hazards assumption. Biometrika 74, 289–300. Kay, R., 1977. Proportional hazards regression models and the analysis of censored survival data. Appl. Stat. 26, 227–237. Klefsjö, B., 1982. On ageing properties and total time on test transforms. Scand. J. Stat. 9, 37–41. Klein, J.P., Moeschberger, M.L., 2003. Survival Analysis: Techniques for Censored and Truncated Data. Springer-Verlag, New York. Lee, L., Pirie, W.R., 1981. A graphical method for comparing trends in series of events. Comm. Statist. Theory Methods 10, 827–848. Ng’andu, N.H., 1997. An empirical comparison of statistical tests for assessing the proportional hazards assumption of Cox’s model. Stat. Med. 16, 611–626. Pocock, S.J., Gore, S.M., Kerr, G.R., 1982. Long-term survival analysis: The curability of breast cancer. Stat. Med. 1, 93–104. Sahoo, S., Sengupta, D., 2016. On graphial tests for proportionality of hazards in two samples. Stat. Med. 35 (6), 942–956. Schein, P.D., Stablein, D.M., Bruckner, H.W., Douglass, H.O.R., Mayer, e.a., 1982. A comparison of combination chemotherapy and combined modality therapy for locally advanced gastric carcinoma. Cancer 49, 1771–1777. Sengupta, D., Bhattacharjee, A., Rajeev, B., 1998. Testing for the proportionality of hazards in two samples against the increasing cumulative hazard ratio alternative. Scand. J. Stat. 25, 637–647. Sengupta, D., Deshpande, J.V., 1994. Some results on the relative aging of two life distributions. J. Appl. Probab. 31, 991–1003. Shorack, G.R., Wellner, J.A., 1986. Empirical Processes with Applications to Statistics. John Wiley, New York. Srivastava, R., Li, P., Sengupta, D., 2012. Testing for membership to the IFRA and the NBU classes of distributions, in: Lawrence N. Girolami M. (Eds.). Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS) 2012, La Palma. Canary Islands, vol. 22 JMLR, pp. 1099–1107. Stablein, D.M., Koutrouvelis, I.A., 1985. A two-sample test sensitive to crossing hazards in uncensored and singly censored data. Biometrics 41, 643–652. Tenga, R., Santner, T.J., 1984. Testing goodness of fit to the increasing failure rate family. Nav. Res. Logist. 31, 617–630. Therneau, T.M., Grambsch, P.M., 2000. Modelling Survival Data: Extending the Cox Model. Springer-Verlag, New York. Wei, L.J., 1984. Testing goodness of fit for proportional hazards model with censored observations. J. Amer. Statist. Assoc. 79, 649–652.

1 2

3

4 5 6 7 8 9 10 11

12

13 14 15

16

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44