Statistical Considerations in Radiation Oncology

Statistical Considerations in Radiation Oncology

C H A P T E R 12 Statistical Considerations in Radiation Oncology DEFINITION OF STATISTICAL TERMS Mean and Mode • The mean (average) is the arithmeti...

263KB Sizes 0 Downloads 90 Views

C H A P T E R

12 Statistical Considerations in Radiation Oncology DEFINITION OF STATISTICAL TERMS Mean and Mode • The mean (average) is the arithmetic sum of observations, divided by the number of observations. • The mode is the value that occurs most frequently among a set of numbers. There may be multiple modes in a sample. Bimodal refers to a distribution of data that cluster around two distinct values. Certain cancer incidences, such as craniopharyngiomas, are bimodal with respect to age at onset (i.e., children and older adults).

Sample Inferential statistics is the process of identifying a representative and manageable group of research subjects to infer quantitative attributes to a larger population. A sample, then, should ideally mirror the research population of interest. A sampling frame is the accessible population for research, which may differ from the idealized population of interest. Bias is likely to contaminate research in which the sampling frame differs substantially from the desired population.

Variance • Variance is a statistic on the squared unit scale that measures the dispersion of a data set. Small variances indicate predictable, repeatable experiments, whereas large variances indicate scattered, unexpected outcomes with repeated sampling. • The sample variance is calculated by the following [1]: P S2 ¼

ðX  MÞ2 N1

Copyright © 2019 Elsevier Inc. All rights reserved.

Fundamentals of Radiation Oncology https://doi.org/10.1016/B978-0-12-814128-1.00012-X

201

202

12. STATISTICAL CONSIDERATIONS IN RADIATION ONCOLOGY

where X is each observation in the sample, M is the mean of the sample, and N is the number of observations in the sample.

Standard Deviation • Standard deviation is the square root of variance as shown below [1,2]. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P ðX  MÞ2 S ¼ N1

• The standard deviation is on the same unit scale as the sample. Like variance, small standard deviations indicate that the data are repeatable and predictable when the experiment is conducted again.

Confidence Intervals The confidence interval (CI) is an estimated set of boundaries that likely include the population mean for a specified degree of probability. CIs are generally reported at the 95% level, that is, upon repeated sampling, 95% of the bounds calculated will include the population mean. CI is calculated by the following [2]:  .pffiffiffiffiffiffiffiffi  .pffiffiffiffiffiffiffiffi 95% CI ¼ M  1:96  s ðNÞ to M þ 1:96  s ðNÞ ; pffiffiffiffiffiffiffiffi where M is the sample mean, s ðNÞ is the  sample pffiffiffiffiffiffiffiffi standard deviation divided by the square root of the sample size N. Note that s ðNÞ is referred to as the standard error of the mean, abbreviated SEM, often reported as error bounds. It is common to see “MSEM” in descriptive statistical tables in journal articles.

Null and Alternative Hypotheses In inferential statistics, sample data are extrapolated to the research population. It is rare to observe an entire population. In practice, assumptions made about the data from populations include how the values are spread and where the observations tend to occur. Samples allow refining assumptions to make them more reasonable. However, to draw conclusions about the population from samples, the scientific method suggests taking an objective approach. • Hypothesis testing in statistics allows an investigator to decide between two mutually exclusive choices based only on data. By crafting the decision-making as objectively as possible, there is less risk of bias or incorrect conclusions. The two mutually exclusive choices are called the null hypothesis and the alternative hypothesis.

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

DEFINITION OF STATISTICAL TERMS

203

• The null hypothesis is that which would be due to chance alone and assumes that there is no association between experimental outcomes that cannot be explained by pure randomness. • The alternative hypothesis is that which is unlikely to be due to chance alone. There may be an association between experimental outcomes. Statistical inference involves specifying and refining these two choices so that they may be applied directly to the outcomes in a given experiment.

p-value The P-value may be considered as a measure of evidence or likelihood. It is a statistic used for decision-making in inferential statistics. Its utility in interpreting the results of a hypothesis test stems from the idea that it is loosely related to a probability statement about the null hypothesis. The P-value is derived from observed data and assumptions made about the population. If the P-value is small (meaning close to zero), one may conclude that there is evidence suggesting that the null hypothesis is unlikely. Unfortunately, P-values are often misunderstood and misused. The P-value contains no intrinsic information. For example, consider a population on which two experiments are performed. A large experiment with a small effect size may have the same P-value as a small experiment with a large effect size in the same population. Some investigators have published in peer-reviewed papers the misinterpretation that P-values between 0.05 and 0.20 constitute a “trend,” although statements like these have no statistical basis.

Type I (a) and Type II (b) Error • Type I error occurs when the null hypothesis is true yet rejected. • Type II error occurs when the null hypothesis is false yet accepted. • Minimizing the chances of either error is ideal. Large sample sizes and efficient study designs directly reduce type II errors. • In general, a value of 0.05 is chosen for type I (a) error and a value of 0.2 or less is chosen for type II (b) error.

Power The power of the study is the probability of rejecting the null hypothesis if indeed false. Power is defined as 1 e b. Biomedical studies with power less than 80% are of questionable regarding statistical significance.

Parametric and Nonparametric Methods A parametric method refers to the use of statistical models that assume explicitly mathematical properties (such as the shape function, location, and dispersion) of the population of interest. On the other hand, when one uses a nonparametric method, fewer assumptions regarding the population’s mathematical structure are made. A familiar example of a parametric test is the t-test. The t-test analyzes differences between two sample set means. The t-test assumes that the populations have a Gaussian (normal) distribution. The t-test has a nonparametric analog. The Wilcoxon

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

204

12. STATISTICAL CONSIDERATIONS IN RADIATION ONCOLOGY

test also analyzes differences in two sample means, but it does not require that the data be sampled from a Gaussian distribution. Using a parametric test for data that are not sampled from the assumed distribution results in biased tests, and those results (especially the P-value) should be considered questionable or uninterpretable. Despite requiring larger sample sizes to retain the same inferential power as parametric tests, nonparametric tests are more robust and less biased than applying an inappropriate parametric test or method. With the advent of computers, nonparametric methods are gaining popularity in biomedical research and should be considered when such methods are available.

Sensitivity, Specificity, and Predictive Value The sensitivity is the proportion of patients with the disease who are correctly identified by a diagnostic test. The specificity is the proportion of patients without the disease who are correctly identified by a diagnostic test. The positive predictive value is the proportion of patients with a positive test result who have the disease. The negative predictive value is the proportion of patients with a negative test result who do not have the disease. True positives and negatives are determined by an irrefutable standard. Tests with less than or equal to 50% sensitivity are generally not useful because they have the same or less utility than the random flip of a coin. A 2  2 table (Table 12.1) of a diagnostic test result is used to calculate the sensitivity, specificity, and predictive values.

Clinical Trials: Phase I, II, III, IV Phase I trials use the preclinical (i.e., animal studies) dose estimates. In a phase I trial, the first patients are enrolled with the objective of estimating a tolerable dose that is not overly toxic in humans. A common design for a phase I trial is the “3 þ 3” dose

TABLE 12.1

A 2  2 Table Showing Definitions of Sensitivity, Specificity, and Predictive Value

Diagnostic Test Result

Disease Present

Disease Absent

Positive

a (true positive)

b

Negative

c

d (true negative)

Sensitivity ¼

a aþc

Specificity ¼

d bþd

Positive predictive value ¼

a aþb

Negative predictive value ¼

d dþc

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

DEFINITION OF STATISTICAL TERMS

205

escalation scheme wherein cohorts of three patients are assigned to ever-increasing doses until excessive toxicity is observed. The sample size for a phase I trial is generally less than 20 subjects. Phase II trials use the dosing scheme derived from the phase I study. A phase II trial assesses the agent’s efficacy in humans. Phase II studies may be blinded, comparative, randomized trials, or single-agent open-label trials. Phase II objectives are to assess efficacy in a small number of patients and to prepare for a pivotal trial, if justified. In oncology, it is common to use response rate as an endpoint for phase II trials, particularly for diseases with multiyear expected survival. The sample size for a phase II trial is often between 40 and 100 subjects. Phase III trials are large, randomized pivotal trials in which a new therapeutic agent is compared with an accepted standard, either alone or in combination. Subjects are randomized to one of the arms, often with randomization depending on their geographic location and their disease characteristics. These trials are generally multicenter studies with endpoints such as overall survival or progression-free survival. Randomization and day-to-day trial administration is handled by a single center or a cooperative group, such as Radiation Therapy Oncology Group (RTOG). The sample size for a phase III trial is often well over 300 subjects. An incomplete list of currently open RTOG trials is given in Table 12.2 [3]. Phase IV trials are used for long-term surveillance of a drug or a device after it has regulatory approval for public use. These trials are also known as postmarking or confirmatory trials. The time period for phase IV trials is for 2 years or more to detect any long-term adverse effects of the drug or the device for its use.

Survival Analysis • Life table method: This is a table of the proportion of patients surviving over time. This method is useful for data when the exact times of death are not known, although it is rarely used outside of epidemiological research. • KaplaneMeier method: This is a plot of the cumulative probability of survival of patients, of which the survival estimates are recalculated whenever there is an event. It accommodates censoring more efficiently and intuitively than the life table method.

Retrospective Chart Reviews Retrospective chart reviews assemble data from patients who have finished their course of treatment and are useful for quality control and hypothesis-generating studies. They are also useful when a randomized prospective trial is too expensive or not feasible but are prone to selection bias, confounding, and sampling errors. Many investigators do not recognize their results as being sufficient evidence to affect clinical practice.

Metaanalyses Metaanalyses are formal statistical models for pooling randomized studies to reach a combined effect size. Metaanalyses have an advantage of providing estimates based on large sample sizes but are also prone to selection bias, heterogeneity among studies, and publication bias.

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

206

12. STATISTICAL CONSIDERATIONS IN RADIATION ONCOLOGY

TABLE 12.2

A List of Currently Open RTOG Trials

Study Name

Status

Phase

0538

CALGB 30610/Endorsed Study: Phase III Comparison of Thoracic Radiotherapy Regimens in Patients with Limited Small Cell Lung Cancer Also Receiving Cisplatin and Etoposide

Open

III

0631

Phase II/III Study of Image-Guided Radiosurgery/SBRT for Localized Spine MetastasisdRTOG CCOP Study

Open

II/III

0724

Phase III Randomized Study of Concurrent Chemotherapy and Pelvic Open Radiation Therapy with or without Adjuvant Chemotherapy in HighRisk Patients with Early-Stage Cervical Carcinoma Following Radical Hysterectomy

III

0848

A Phase IIR and A Phase III Trial Evaluating Both Erlotinib (Ph IIR) And Chemoradiation (Ph III) As Adjuvant Treatment For Patients With Resected Head Of Pancreas Adenocarcinoma

Open

II/III

0920

A Phase III Study of Postoperative Radiation Therapy (IMRT) þ/ Cetuximab for Locally-Advanced Resected Head and Neck Cancer

Open

III

0926

A Phase II Protocol for Patients with Stage T1 Bladder Cancer to Evaluate Selective Bladder Preserving Treatment by Radiation Therapy Concurrent with Radiosensitizing Chemotherapy Following a Thorough Transurethral Surgical Re-Staging

Open

II

0924

Androgen Deprivation Therapy and High Dose Radiotherapy With or Without Whole-Pelvic Radiotherapy in Unfavorable Intermediate or Favorable High Risk Prostate Cancer: A Phase III Randomized Trial

Open

III

0973

GOG-0238/Endorsed Study: “A Randomized Trial of Pelvic Irradiation With or Without Concurrent Weekly Cisplatin in Patients With Pelvic-Only Recurrence of Carcinoma of the Uterine Corpus”

Open

II R

0974

NSABP B-43/Endorsed Study: “A Phase III Clinical Trial Comparing Open Trastuzumab Given Concurrently with Radiation Therapy and Radiation Therapy Alone for Women with HER2-Positive Ductal Carcinoma In Situ Resected by Lumpectomy”

III

1008

A Randomized Phase II/Phase III Study of Adjuvant Concurrent Radiation and Chemotherapy versus Radiation Alone in Resected High-Risk Malignant Salivary Gland Tumors

II R

1071

Open NCCTG N0577/Endorsed Study: Phase III Intergroup Study of Radiotherapy versus Temozolomide Alone versus Radiotherapy with Concomitant and Adjuvant Temozolomide for Patients with 1p/19q Codeleted Anaplastic Glioma

III

1073

GOG-0258/Endorsed Study: “A Randomized Phase III Trial of Cisplatin and Tumor Volume Directed Irradiation Followed by Carboplatin and Paclitaxel vs. Carboplatin and Paclitaxel for Optimally Debulked, Advanced Endometrial Carcinoma”

III

Open

Open

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

207

DEFINITION OF STATISTICAL TERMS

TABLE 12.2

A List of Currently Open RTOG Trialsdcont’d

Study Name

Status

Phase

1112

Randomized Phase III Study of Sorafenib versus Stereotactic Body Radiation Therapy followed by Sorafenib in Hepatocellular Carcinoma

Open

III

1119

Phase II Randomized Study of Whole Brain Radiotherapy/ Open Stereotactic Radiosurgery in Combination With Concurrent Lapatinib in Patients With Brain Metastasis From HER2-Positive Breast Cancer: A Collaborative Study of RTOG and KROG

II R

1171

GOG-0263/Endorsed Study: “Randomized Phase III Clinical Trial of Adjuvant Radiation Versus Chemoradiation in Intermediate Risk, Stage I/IIA Cervical Cancer Treated With Initial Radical Hysterectomy and Pelvic Lymphadenectomy”

Open

III

1172

COG AEWS1031/Endorsed Study: “A Phase III Randomized Trial of Open Adding Vincristine-topotecan-cyclophosphamide to Standard Chemotherapy in Initial Treatment of Non-metastatic Ewing Sarcoma”

III

1173

ECOG E2108/Endorsed Study: “A Randomized Phase III Trial of the Open Value of Early Local Therapy for the Intact Primary Tumor in Patients with Metastatic Breast Cancer”

III

1175

CALGB 80803/Endorsed Study: Randomized Phase II Trial of PET Scan-Directed Combined Modality Therapy in Esophageal Cancer

Open

II R

1270

NCCTG N107C/Endorsed Study: A Phase III Trial of Post-Surgical Stereotactic Radiosurgery (SRS) Compared With Whole Brain Radiotherapy (WBRT) for Resected Metastatic Brain Disease

Open

III

1271

Open N1048/Endorsed Study: A Phase II/III trial of Neoadjuvant FOLFOX, with Selective Use of Combined Modality Chemoradiation versus Preoperative Combined Modality Chemoradiation for Locally Advanced Rectal Cancer Patients Undergoing Low Anterior Resection with Total Mesorectal Excision

1272

NSABP B-47/Endorsed Study: A Randomized Phase III Trial of Adjuvant Therapy Comparing Chemotherapy Alone (Six Cycles of Docetaxel Plus Cyclophosphamide or Four Cycles of Doxorubicin Plus Cyclophosphamide Followed by Weekly Paclitaxel) to Chemotherapy Plus Trastuzumab in Women with Node-Positive or High-Risk Node-Negative HER2-Low Invasive Breast Cancer

Open

III

1304

NSABP B-51/Endorsed Study: A Randomized Phase III Clinical Trial Open Evaluating Post-Mastectomy Chestwall and Regional Nodal XRT and Post-Lumpectomy Regional Nodal XRT in Patients with Positive Axillary Nodes Before Neoadjuvant Chemotherapy who Convert to Pathologically Negative Axillary Nodes After Neoadjuvant Chemotherapy

III

II/III

(Continued)

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

208

12. STATISTICAL CONSIDERATIONS IN RADIATION ONCOLOGY

TABLE 12.2

A List of Currently Open RTOG Trialsdcont’d

Study Name

Status

Phase

HN001 Randomized Phase II and Phase III Studies of Individualized Treatment for Nasopharyngeal Carcinoma Based on Biomarker Epstein Barr Virus (EBV) Deoxyribonucleic Acid (DNA)

Open

II/III

1306

A Randomized Phase II Study of Individualized Combined Modality Open Therapy for Stage III Non-Small Cell Lung Cancer (NSCLC)

II R

1308

Phase III Randomized Trial Comparing Overall Survival After Photon Open Versus Proton Chemoradiotherapy for Inoperable Stage II-IIIB NSCLC

III

BR001 A Phase 1 Study of Stereotactic Body Radiotherapy (SBRT) for the Treatment of Multiple Metastases

Open

I

GI001 Randomized Phase III Study of Focal Radiation Therapy for Unresectable, Localized Intrahepatic Cholangiocarcinoma

Open

III

BN002 Phase I Study of Ipilimumab, Nivolumab, and the Combination in Patients With Newly Diagnosed Glioblastoma

Open

I

BN001 Randomized Phase II Trial of Hypofractionated Dose-Escalated Photon IMRT or Proton Beam Therapy Versus Conventional Photon Irradiation With Concomitant and Adjuvant Temozolomide in Patients With Newly Diagnosed Glioblastoma

Open

II R

1470

Alliance A071101/Endorsed Study: A Phase II Randomized Trial Comparing the Efficacy of Heat Shock Protein-Peptide Complex-96 (HSPPC-96) (NSC #725085, ALLIANCE IND# 15380) Vaccine Given With Bevacizumab Versus Bevacizumab Alone in the Treatment of Surgically Resective Recurrent Glioblastoma Multiforme (GBM)

Open

II R

1471

SWOG S1400/Endorsed Study: Phase II/III Biomarker-Driven Master Open Protocol for Second Line Therapy of Squamous Cell Lung Cancer

II/III

MODEL FITTING Significance and Tests of Significance In fitting a model to data, not only are the parameter values estimated, but it is also important to determine if the parameters should be included in the model at all. This is accomplished by testing the hypothesis that, for the parameter b, b ¼ 0 (vs. bs 0). This tests whether b contributes additional reduction in predictive utility owing to its presence in the model. Based on the estimated value of b and its variance, it is relatively straightforward to calculate the probability that the true value of b lays within a given range. If 1  a is the amount of assurance or “confidence,” as expressed as a probability, that the b is in a certain range that does not include zero, then the probability that the true value of b is zero is less than a, and the above hypothesis is rejected with a confidence 1  a. The shorthand for this is usually written b (p < a). Very small values

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

MODEL FITTING

209

of p are not particularly useful because they are very sensitive to the assumed distribution of the parameter. The convention for statistical significance has traditionally been a ¼ 5% or 0.05. This is the cutoff generally used by regulatory agencies, and the 0.05 cutoff appears in the International Conference on Harmonization E9 [4]: Statistical Principles for Clinical Trials document, which governs evidence presented for clinical treatment approvals. In model validation, tests of significance go in the opposite direction, that is, a good fit is not indicated by small P-values. In this case, the P-value is a measure of the probability of obtaining the actual data randomly given the correctness of the model. Thus a small P-value indicates that the model fits the data too well to be random.

Survival Analysis and Its Pitfalls With statistical software being widely available, it is often possible to apply a procedure to data for which the procedure is totally inappropriate. It is not the misuse that is the ultimate problem, but rather the erroneous conclusions drawn from this misuse and the potential application of these conclusions to clinical decision-making. • Survival analysis curves are used as an indication of the response to treatment. The simplest version of survival analysis is crude survival. A number of patients are identified as the cohort of interest at a time t ¼ 0. At some later time, the number remaining alive is determined and the crude survival is simply the ratio of the number surviving divided by the initial number. Two values of crude survival can be compared as ratios using the chi-squared statistic. Although this may be a legitimate method to compare the two ratios, the comparison itself may be irrelevant if a significant number of patients were lost to follow up, removed from the study, died of complications of treatment, or died of causes unrelated to the disease being investigated. The conclusions from this survival comparison are misleading because of the bias inherent in the data. • Because patients are put on studies throughout the study period and because they may leave the study for many reasons other than dying of disease, it is necessary to account for the effects of less-than-complete follow-up when trying to describe the survival experience of the study cohort. • Assume that for each patient, one can identify a time, ti, which is either the time of death from the disease being studied (the failure time) or the last time the patient was known to be alive. In the latter case, one may include the following situations: the last time the patient was seen or contacted or the patient died of intercurrent disease or some other unrelated events. These patients are said to be censored at time ti. There are many possible reasons for censoring, but mathematically they all have the same effect. The most common method of estimating the survival function for a patient population is the KaplaneMeier or product limit method. Patients are ordered by their survival (failure or censoring) times. Intervals are defined by the failure times only. Thus the i th interval will occur between the time tie1, when patient ie1 failed, to ti when patient i failed. If the survival at time tie1 was Sie1, then the survival at time ti is shown as below: Survival time at time ti ¼ Si  ðNi  di Þ=Ni ;

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

210

12. STATISTICAL CONSIDERATIONS IN RADIATION ONCOLOGY

where Ni denotes the number of patients alive just before time ti and di is the number whose failure time is ti. The number of patients who are censored between failure times does not directly appear in the calculation, except that these censored patients serve to reduce N from one interval to the next. It is sufficient to understand that any failure will cause the survival curve to decrement parallel to the y-axis. The probability of surviving to any given time is thus the product of the probabilities of surviving all preceding time intervals. Death from disease is not always the failure event. Other events that can define failure are death from any cause, recurrence of disease, occurrence of metastases, local recurrence, etc. Table 12.3 shows the failure event and the name for the survival associated with that event. Sometimes the failure event can be complicated to define. Consider the use of biochemical failure as a surrogate for recurrence in prostate cancer. The American Society for Therapeutic Radiology and Oncology (ASTRO) definition of biochemical failure is three consecutive rises in prostatespecific antigen (PSA) with the time of failure defined as the midpoint between the last nonrising PSA and the first rise. No matter how one defines failure time for recurrence, the patient who recurs has never been free of disease, and in that sense, the treatment has failed. To compare the survival curves for two (or more) groups, the groups must be appropriately similar. Furthermore, within a single group, the patients either fail or are censored. The patients who are censored must have the same expectation

TABLE 12.3

Common Definitions of Time-To-EventeRelated Outcomes Failure Occurs

Censoring Occurs

Overall survival Disease-specific survival

Death from any cause Death from disease

Alive at last contact Death unrelated to the disease or alive at last contacta

Recurrence-free survival, NED survival, disease-free survival

First evidence of recurrence, or death from disease if recurrence is not documented

Death unrelated to the disease or alive and progression-free at last contact

Local recurrence-free failure rate

Local recurrence

Death unrelated to the disease or alive and free of local recurrence at last contact

Event-free failure rate

First evidence of “event” Death unrelated to the event or alive of interest, however, with no “event” at last contact defined

NED, No evidence of disease. a Often, cause of death is completely unknown, as in the Social Security Death Index. In these cases, death may be treated either as an event or as a censoring event. The issue is controversial.

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

LOGISTIC REGRESSION AND ITS PITFALLS

211

of failure as the patients who actually fail. In other words, the patient who is censored at time t must have the same probability of failing as the patient who is followed well beyond time t. Censoring must be a random event, and not one that is correlated in any way with the outcome. If this assumption fails, there may be a statistical bias and hence the conclusions from inferences may be spurious.

LOGISTIC REGRESSION AND ITS PITFALLS In radiation oncology, one often wishes to assess the effects of various factors on the outcome of a binary event where time is not an explicit variable. The most popular method for such an assessment is multiple logistic regression. • In this case the probability P of the event is given by

P ¼

expðb0 þ b1 X1 þ b2 x2 :::Þ 1 þ expðb0 þ b1 x1 þ b2 x2 :::Þ

where (x1, x2, .) is a vector of parameters putatively related to the event and (b0, b1, b2, .) is a vector of coefficients. The values of xi can be continuous or discrete. For the case where x is a continuous variable with values from N to N, P(x) is a sigmoid function in x. • If we consider the case where variable xj takes on only two values, it is possible without loss of generality to assign these values 0 and 1. It is a straightforward exercise to show that

  Pðxj ¼ 1Þ=½1  Pðxj ¼ 1Þ exp bj ¼ : Pðxj ¼ 0Þ=½1  Pðxj ¼ 0Þ P/(1  P) is the probability of occurrence divided by the probability of nonoccurrence or the odds. The expression on the right in equation is the odds ratio for xj ¼ 1 compared with xj ¼ 0. For small values of P, the odds ratio is approximately equal to the relative risk, and like relative risk, odds ratios significantly different from unity indicate that xj ¼ 1 is significantly associated with the event. Thus, logistic regression may be used to determine whether the presence of a condition (xj ¼ 1) significantly increases the likelihood of the occurrence of the event. • For radiation oncology, a typical application would be where a number of discrete variables are assumed to be associated with a binary event. For example, one may be investigating the occurrence of interstitial pneumonitis (IP) in a large cohort of patients undergoing total body irradiation (TBI) as part of the conditioning regimen in hematopoietic stem cell transplant (HCT). Potential variables might include the age of the patient, the dose rate, the type of transplant, etc. These

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

212









12. STATISTICAL CONSIDERATIONS IN RADIATION ONCOLOGY

variables could be dichotomized as follows: pediatric (x j ¼ 1) versus adult (x1 ¼ 0); 10 cGy/min or less (x2 ¼ 1) versus greater than 10 cGy/min (x2 ¼ 0); autologous (x3 ¼ 1) versus allogeneic (x3 ¼ 0). Once the analysis was complete, ebj would give the odds ratio, i.e., it would approximate the relative risk, for pediatric patients, low dose rate, and autologous transplants for i ¼ 1, 2, or 3, respectively. As stated above, logistic regression should be used on data that are time independent. However, outcome data generally are time dependent. This problem is usually obviated by including in the analysis only patients who survive a certain minimum time and evaluating patients for the endpoint at that minimum time. Thus in the example, one could include only patients who survive 12 months posttransplant and ignore any cases of IP that occur after 12 months. For acute effects, this treatment of the data is not necessary. The potential pitfalls of multiple logistic regressions are many. The first issue that the investigator should address is whether the model fits the data. This is a parametric analysis, in which there is a mathematical model being deployed and there are underlying assumptions about the distributions of the parameters in the model. Thus, there are normative procedures for examining how well the model fits (describes) the data [4]. In other words, before the investigator starts thinking about what the odds ratios for the variables in the model mean, she or he should first consider whether the model fits the data. If not, further efforts would be a waste of time until the fit is examined further. Consultation with a statistician is important throughout any statistical analysis of clinical data, but it is especially important when it comes to interpretation of goodness-of-fit statistics and diagnostics. On a very basic level, an error that may occur either by accident or by manipulation is the defining of dichotomous covariates. This is often trivial, as in the case of sex or other unambiguous and naturally dichotomous variables. However, searching for just the right cut point at which dichotomizing a continuous variable is potentially problematic. Also, one should keep in mind that the odds ratio will apply to the group for which x ¼ 1 compared with the group for which x ¼ 0. In other words, each group is compared relative to a single standard group, often referred to as the “base-line” group. Therefore, grouping is important but often unstated in the literature. In one example, one has seen a relative risk reported for second malignancies after TBI. However, the x ¼ 0 group was not the group who underwent HCT without TBI but a sibling control group that did not have cancer at all. Clearly this does not speak to the added risk associated with TBI in an HCT setting. Another potential pitfall is the application of logistic regression to a population that is either too small itself or, more likely, that has too few events. Typically, statisticians will quote that there should be approximately 10 events for every parameter that is being estimated. This is an ideal that may not always be achievable, but one should avoid overparameterizing the analysis.

In that vain, a reader of the literature should be on the lookout for hidden parameters. In many radiation oncology papers, biological effects are handled by what the authors may feel are standard models, with what they state are reasonable parameters.

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

PROPORTIONAL HAZARDS AND THEIR PITFALLS

213

For example, a biologically effective dose, Deq, may be used to account for fractionation effects using a “standard” a/b ratio for the linear-quadratic model. Then this dose along with other parameters may be included in the logistic regression. Just because the parameters in the model used to determine the biologically effective dose are simply stated and not determined by the statistical procedure does not mean that they do not count as statistical variates. One should also consider whether an explanatory parameter is time dependent, that is, whether its value may change over time if measured repeatedly. For example, new diagnostic techniques may be introduced that allow for better staging of patients. If stage is significantly related to outcome, then one may experience the “Will Rogers effect” in the analysis of the data. For example, consider patients on the borderline between stage II and stage III of some hypothetical disease. Those patients with slightly more disease, who were previously categorized as stage II, may with better diagnostic tests begin to be categorized as stage III. Presumably, these would have been among the poorer performing stage II patients, but among the better performing stage III patients. Thus by moving these patients from stage II to stage III based on the better diagnostic test, the survival for both stages would appear to improve even though no overall change actually occurs. Among the factors that can lead to poor estimates of the coefficients or erroneous confidence limits is “confounding.” The relationship between a variable and the outcome is confounded when that variable is correlated with both the outcome and another variable. For example, if we also included in our example of IP analysis a variable for whether the patient exhibited acute growth versus host disease (GVHD), confounding could occur because GVHD is correlated with the type of transplant and possibly with the outcome (IP) as well. One should be attentive to any error or warning messages that appear when fitting logistic regression models. Logistic regression maximum likelihood estimates are dependent on the NewtoneRaphson iterative gradient ascent method. However, this estimation method fails when categorical independent variables have too few observations in any group, either if there is an excessive amount of missing data or if the model is overparameterized. In these cases, many current software packages will report a warning that the logistic regression “failed to converge” or that the estimates are “unstable.” This means that the model should be refitted with fewer explanatory factors or after groups with small numbers of observations are dropped or combined.

PROPORTIONAL HAZARDS AND THEIR PITFALLS A further topic in survival analysis that combines the multivariate aspect of logistic regression with data whose outcome statistic is a failure time is the proportional hazards model. • This is also frequently termed Cox regression. The hazard function, h(t), is the instantaneous or age-specific failure rate. If F(t) denotes the survival function, then dFðtÞ hðtÞ ¼ dt : FðtÞ

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

214

12. STATISTICAL CONSIDERATIONS IN RADIATION ONCOLOGY

It gives the probability of failing at time t, given that failure has not occurred before time t. In the proportional hazards model, it is assumed that the hazard function can be separated into a baseline hazard function h0(t), depending only on t and a link function j, which depends on observable covariates, but not t. Thus j ¼ jðb0 þ b1 x1 þ :::Þ; and hðtÞ ¼ h0 ðtÞjðb; xÞ The most commonly used form of j is the exponential j ¼ expðb0 þ b1 x1 þ /Þ; which makes expðbi Þ the hazard ratio, that is, the ratio of the hazard function for xi ¼ 1 compared with xi ¼ 0. • The same pitfalls exist for proportional hazards analysis as for logistic regression. In addition, the assumption of proportionality of hazards may not apply, in which case, the analysis is not valid. • The most straightforward way of inspecting the assumption of proportionality is to compare the KaplaneMeier curves. If the curves are parallel and do not cross, that is evidence that the proportional hazards assumptions hold. However, if the curves cross or do not follow an exponential decay function, this indicates that the Cox model’s assumptions do not hold and therefore the parameter estimates are likely biased (Fig. 12.1). • There are also formal hypothesis tests for the proportional hazards assumption, but these complex tests are best left to a statistician. A statistician may also suggest alternative models to the Cox model when the proportional hazards assumptions are not met. • It may be known or reasonably suspected that patients having different values of the variables being investigated will be likely to have different outcomes. The most obvious example is in the comparison of two different treatment arms. It is reasonable to suspect that outcomes for different treatments will be different and therefore patients should be randomly allocated to the two arms. However, even within the context of a randomized clinical trial, it is quite easy to achieve a nonrandom allocation. 1. First, patients who are offered participation in the trial may be somehow different from patients who are not offered participation. 2. Also, patients who select participation may be different from those who decline. Either of these situations may be called “selection bias.” 3. Even with a randomized cohort, biases may still occur because of confounding. Confounding variables may exist in the data. That is, there may be factors that are related to outcome but are not evenly distributed between treatment arms. If these factors are unsuspected and not included in the analysis, biased results will certainly occur. Even if they are examined, random fluctuations may make it appear that a variable correlated with the confounding factor is more significant variable and the confounding factor may not be identified by statistically fitting the algorithms as the truly significant variable.

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

RETROSPECTIVE STUDIES AND THEIR PITFALLS

215

FIGURE 12.1 Proportional hazards assumptions are met on the survival curve in (A) but not met on the survival curve in (B).

RETROSPECTIVE STUDIES AND THEIR PITFALLS If the above problems exist for properly randomized data, it is easy to see why nonrandom, retrospectively collected data are viewed with such a jaundiced eye. Confounding, false correlations, and selection bias may invalidate all conclusions drawn

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

216

12. STATISTICAL CONSIDERATIONS IN RADIATION ONCOLOGY

from the data. A classic example is the survival analysis of prostate cancer patients by treatmentdradiation versus surgery. Because of the possibility of incomplete resection of the tumor, most urological surgeons agree that it is inappropriate to operate on patients with advanced disease. These patients, poorer performers, would then receive radiation. In addition, only those patients considered surgical candidates would undergo surgery. Clearly, a nonbiased analysis is not possible in these data.

METAANALYSIS AND ITS PITFALLS When one is faced with many randomized trials testing the same hypotheses, a natural instinct is to combine all of the studies to determine their “bottom line.” The field of metaanalysis formalizes the analyses of combining statistics reported in randomized studies. There are many philosophies regarding when it is appropriate to combine studies. Another consideration is whether studies are too qualitatively or quantitatively heterogeneous for a metaanalysis; the consequence of combining heterogeneous studies is that the results are not easily interpretable, or worse, misleading. For example, if a study reports a result in pediatric patients but the hypothesis of the metaanalysis concerns adults only, it is difficult to justify using a pediatric effect size statistic in a metaanalysis in adult trials. • One difficulty with metaanalysis is that the raw data are rarely available and so metaanalyses are performed on summary statistics. In this case, the investigator must decide which statistics and which assumptions are most appropriate for measuring the intended effect size. An example of this dilemma is whether it is appropriate to combine eight studies, which report the raw odds ratio from a caseecontrol design with the odds ratio of a logistic regression, adjusted for multiple factors. These two types of odds ratio statistics have a similar interpretation, but they are not reasonably combinable in a formal metaanalysis. • Another difficulty with metaanalysis is selection bias. However, a certain type of selection bias is unique to metaanalysis. If journals tend to publish significant (positive) studies, and an investigator uses the literature to search for studies in a metaanalysis, there is an implicit selection effect that tends to bias the results of the metaanalysis toward significance. This is called “publication bias” and is often difficult to quantify [5]. Metaanalyses that do not consider problems with study heterogeneity and publication bias do not provide evidence of an objective test of hypotheses. Testing hypotheses appropriately is an integral part of applying the scientific method to radiation oncology data. When clinical decision-making is dependent on poorly designed studies or biased analyses, patients may receive suboptimal treatment and the clinical literature may become muddled. Clinicians should collaborate with experienced investigators and statisticians to avoid these errors. The investment of time into understanding these potential problems and the thoughtful application of these statistical methods will result in improved quality in radiation oncology research.

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY

REFERENCES

217

References Ref 1. Petrie A, Sabin C. Medical statistics at a glance. 2nd ed. Cambridge (MA): Blackwell Scientific Publishers Inc.; 2005. Ref 2. Harris M, Taylor G. Medical statistics made easy. 2nd ed. Bloxham, Oxfordshire (UK): Scion Publishing; 2008. Ref 3. International Conference on Harmonisation. Statistical principles for clinical trials (ICH E9). Stat Med 1999;18(15):1905e42. Ref 4. Hosmer DW, Lemeshow S. Applied logistic regression (Wiley Series in probability and statistics). 2nd ed. New York: John Wiley & Sons, Inc.; 2000. Ref 5. Givens GH, Smith DD, Tweedie RL. Publication bias in meta- analysis: a Bayesian data-augmentation approach to account for issues exemplified in the passive smoking debate (with discussion). Stat Sci 1997;12(4):221e50.

II. TECHNIQUES AND MODALITIES OF RADIATION ONCOLOGY