Testing current status data for dependent censoring

Testing current status data for dependent censoring

Statistics & Probability Letters 48 (2000) 213 – 216 Testing current status data for dependent censoring  Daniel Rabinowitz Department of Statistics...

61KB Sizes 1 Downloads 64 Views

Statistics & Probability Letters 48 (2000) 213 – 216

Testing current status data for dependent censoring  Daniel Rabinowitz Department of Statistics, Columbia University, Mail Code 4403, Broadway and 120th St., New York, NY 10027, USA Received September 1998; received in revised form July 1999

Abstract An approach to testing current status data for dependence between the examination times and the event times is presented. The approach is based on a rank statistic that detects decreasing trends, as a function of the examination time, c 2000 Elsevier Science B.V. All rights reserved in the probability that the event occurs before the examination. Keywords: Cross-sectional data; Non-parametric maximum likelihood; Survival analysis

1. Introduction Current status data occur when the time that study subjects experience an event is not observed exactly, but instead, each subject undergoes an examination whereupon it is determined whether the event has yet occurred. A common assumption that underlies most methods of analyzing current status data is that subjects’ event times and examination times are independent (or conditionally independent given available covariates). See, for example, Groeneboom and Wellner (1992) for the one-sample problem and Klein and Spadey (1993) or Finkelstein (1986) for regression problems. When there is independence between examination times and event times, conditionally given the examination time, the probability that a subject’s event occurs before the subject’s examination is given by the marginal distribution function of the event time evaluated at the subject’s examination time. When the examination times are not independent of the event times, the indicators of whether the events occur before the examinations do not have conditional expectations equal to the marginal distribution function, and estimates of the marginal distribution function that rely on that equality can be biased. The purpose here is to present an approach to detecting situations where relying on equality leads to bias. When the marginal distribution functions of the event times and the examination times are unspeci ed, the sole restriction imposed on the observed data by the null hypothesis is that the conditional probability that the event precedes the examination is a non-decreasing function of the examination time. The foundation for the approach presented here is a test statistic that re ects local decreasing trends, as a function of the 

This work was supported by grant GM55978 from the National Institutes of General Medical Sciences.

c 2000 Elsevier Science B.V. All rights reserved 0167-7152/00/$ - see front matter PII: S 0 1 6 7 - 7 1 5 2 ( 9 9 ) 0 0 2 0 2 - 3

214

D. Rabinowitz / Statistics & Probability Letters 48 (2000) 213 – 216

examination time, in the conditional probability. Not every form of dependence between the examination times and the event times results in local decreasing trends. Absent decreasing trends, however, the distribution of the observed data is compatible with the null hypothesis. Previous authors have considered dependence between censoring times and event times with right censored data. See, for example, Rotnitzky and Robins (1995) or Williams and Lagakos (1976). Methods that use auxilliary covariates to adjust for dependent censoring with interval censored and current status data have been developed. See, for example, van der Laan and Hubbard (1997). In the next section, notation for current status data and the non-parametric maximum likelihood estimator are presented. In the third section, the test statistic is presented and an approach to computing p-values is discussed. The approach to computing p-values involves conditioning on the non-parametric maximum likelihood estimator of the marginal distribution of the event times. 2. Notation Let n denote the number of subjects in the sample. For the ith subject, let Xi denote the examination time and let Yi denote the indicator that event time occurs before the examination time. The n pairs of event and examination times are assumed independent and identically distributed. Let F denote the marginal distribution of the event times. With current status data, only the Xi and Yi , are observed. The null hypothesis leaves unspeci ed the marginal distribution of the examination times and the marginal distribution of the event times, but speci es that the examination times and the event times are independent. Let Fˆ denote the non-parametric maximum likelihood estimate of F. The estimate is a non-decreasing right continuous step function whose jump points are a subset of the examination times. See, for example, Groeneboom and Wellner (1992) or Ayer et al. (1955). Let M denote the number of jump points and, for j from 1 to M , let Lj denote the jth ordered jump point. Adopt the convention that L0 is −∞ and LM +1 is ∞. Let nj denote the number of examinations that fall in the interval [Lj−1 ; Lj ). Then, for t in the interval P ˆ is the interval speci c average i:Xi ∈[Lj−1 ; Lj ) Yi =nj . [Lj−1 ; Lj ), F(t) The jump points are characterized by two properties: rst, the interval speci c averages are increasing in the examination times; and second, for any division of the interval into two sub-intervals, the interval speci c averages in the sub-intervals are not increasing. The second property plays an important role in the approach to the computation of p-values described in the next section. Finally, for j from 1 to M , for i such that Xi ∈ [Lj−1 ; Lj ), let r(i) denote the rank of the ith subject’s examination time in inverse order among those examinations in Xi ∈ [Lj−1 ; Lj ). That is, if Xi is the rst examination time in the interval, then r(i) is nj , and if Xi is the last examination time in the interval, then r(i) is 1. 3. Tests Since the sole restriction imposed on the observed data by the null hypothesis is that the conditional probability that the event precedes the examination is a non-decreasing function of the examination time, tests for dependence should be sensitive to settings where the conditional expectations are not a non-decreasing function of the examination times. In this section a rank-based statistic is presented that may be used in a given interval to detect a decreasing trend in the conditional expectation. It is proposed that the rank-based statistic be computed for each of the intervals where Fˆ is constant and that the results of the computations be summed to form a test statistic. The statistic advocated here for testing for decreasing trends in a given interval is essentially the Wilcoxin statistic. See, for example, Randles and Wolfe (1991). To compute the statistic the subjects with examination

D. Rabinowitz / Statistics & Probability Letters 48 (2000) 213 – 216

215

times in the given interval are rst classi ed according to whether or not their event times precede their examination times. Then, the subjects are ranked in inverse order of their examination times. Finally, the sum of the inverse ranks, over the subjects with event times preceding examination is calculated. A decreasing trend in the conditional expectation of the Yi is indicated by a predominance of the subjects whose event time precedes examination in the subjects with higher inverse ranks. That is, large values of the rank-based statistic are evidence for dependence between the examination times and the event times. Only in the situation where the conditional expectation is decreasing over the whole range of the examination times would computing the rank-based statistic over the whole range be most ecacious. In a situation with an overall increasing trend in the conditional expectation, but with sub-intervals with decreasing trends, it would be advantageous to apply the rank-based statistic separately within the sub-intervals. The sum the values of the statistics from the sub-intervals could then be used as a test statistic. A systematic approach to choosing the sub-intervals may be based on the non-parametric maximum likelihood estimate of the distribution of the event times. The intervals where Fˆ is constant, the [Lj−1 ; Lj ), are the largest intervals where the trend in the Yj is decreasing. This suggests the [Lj−1 ; Lj ) for j from 1 to M as a data driven choice of sub-intervals. The sum of the rank-based test statistics for this choice may be written as M X

X

r(i)Yi :

j = 1 i∈[Lj−1 ; Lj )

ˆ Approximate (conditional) p-values for the rank-based statistic may be calculated by conditioning on F. The joint conditional distribution of the Yi for subjects with examinations in an interval [Lj−1 ; Lj ) depends on F, modulo a multiplicative constant, evaluated at the subjects’ examination times in the interval. Although the distribution function F is not known and therefore cannot be used to compute the conditional distribution function, it may be estimated. The non-parametric maximum likelihood estimator is constant on the intervals, suggesting that F be estimated as constant there also. For distribution functions that are constant on the intervals, the conditional distribution is generated by assigning zeros and ones to the indicators so as to maintain equality of their sum to the observed sum and so as to satisfy the second of the two conditions that ˆ characterizes the jump points of F. The characterization of the approximate conditional distribution of the indicators may be used in Monte Carlo simulations in order to compute p-values. An easily programmed approach to carrying out the simulations is to, in each iteration, in each interval, repeatedly randomly assign the appropriate number of the indicators to be one and the remainder to be zero, until the second of the two conditions that characterize the jump points is satis ed.

4. Discussion There is a graphical interpretation of the rank-based statistic that provides intuition into the sensitivity of the test statistic to decreasing trends in the conditional expectation of the Yi given the Xi . For i from 1 to n, let Y(i) denote whether or not the event time precedes the examination time for the subject whose examination time is the ith-order statistic among the examinations. Let S(t) denote the increasing step function on [1; n) that makes a unit jump at each i where Y(i) is equal to 1. For an interval [Lj−1 ; Lj ) where Fˆ is constant, let i1 and i2 denote the ranks of the examinations that make up the upper and lower endpoints of the interval, respectively. Then, the contribution of the interval to the test statistic is the area between the graph of S(t) and the horizontal line at level S(i1 ) over the interval from i1 to i2 . The function S(t) is described by Groeneboom ˆ and Wellner (1992) in the context of computing F.

216

D. Rabinowitz / Statistics & Probability Letters 48 (2000) 213 – 216

If the subjects with examination times in [Lj−1 ; Lj ) and with event times preceding their examinations are interspersed evenly among all of the subjects with examination times in [Lj−1 ; Lj ), then the graph of S(t) on the interval would approximate the line segment that connects (i1 ; S(ii )) to (i2 ; S(i2 )). However, if the subjects for whom the event time precedes the examination time were over-represented among the subjects with the earlier examination times, then the graph of S(t) would have the same end-points as the segment, but would be roughly concave. When S(t) is concave, the area and therefore the rank statistic would be larger. In this way, the rank statistic is sensitive to locally decreasing trends in the Y(i) . The Monte Carlo approach to computing p-values relies on the assumption that F is only negligibly non-constant on the intervals where Fˆ is constant. When the assumption is incorrect, since F is a distribution function, outcomes in which the subjects with earlier examination times have event times preceding their examinations would be more common in the simulations than under the null hypothesis. The over-represented outcomes correspond to larger values of the test statistic. It follows that the p-values that result from the simulations would be conservative. The approach here is based on detecting regions where the conditional expectation is decreasing. Although it appears that no other kind of non-parametric information is available, it is certainly possible for there to be dependent censoring in the absence of any regions where the conditional expectation is decreasing. It would seem that, in such settings, without auxilliary information, no approach could have power to detect the dependence. References Ayer, M., Brunk, H.D., Eweng, G.M., Reid, W.T., Silverman, E., 1955. An empirical distribution function for sampling with incomplete information. Ann. Statist. 26, 641– 647. Finkelstein, D.M., 1986. A proportional hazards model for interval-censored failure time. Biometrics 42, 845–854. Groeneboom, P., Wellner, J.A., 1992. Information Bounds and Nonparametric Maximum Likelihood Estimation. Birkhaeuser, Basel. Klein, R.W., Spadey, R.H., 1993. An ecient semiparametric estimator for binary response models. Econometrica 61, 387– 421. van der Laan, M.J., Hubbard, A., 1997. Estimation with interval censored data and covariates. Lifetime Data Models 3, 77–91. Randles, R.H., Wolfe, D.A., 1991. Introduction to the Theory of Nonparametric Statistics. Krieger Publishing Company, Malabar, FL. Rotnitzky, A., Robins, J.M., 1995. Semiparametric regression estimation in the presence of dependent censoring. Biometrika 82, 805–820. Williams, J.S., Lagakos, S.W., 1976. Independent and dependent censoring mechanisms. Proceedings of the International Biometric Conference 9, 408– 428.