Methodology for quality-of-life assessment: a critical appraisal

Methodology for quality-of-life assessment: a critical appraisal

Thorac Surg Clin 14 (2004) 305 – 315 Methodology for quality-of-life assessment: a critical appraisal Benny Chung-Ying Zee, PhDa,b,*, Tony S.K. Mok, ...

202KB Sizes 0 Downloads 67 Views

Thorac Surg Clin 14 (2004) 305 – 315

Methodology for quality-of-life assessment: a critical appraisal Benny Chung-Ying Zee, PhDa,b,*, Tony S.K. Mok, MD, FRCPCc a

Centre for Clinical Trials, School of Public Health, Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, NT, Hong Kong SAR, China b Comprehensive Cancer Trials Unit, Department of Clinical Oncology, Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, NT, Hong Kong SAR, China c Department of Clinical Oncology, Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, NT, Hong Kong SAR, China

Technologic advancements in surgery and medicine have transformed diseases such as cancer from a usually fatal disease to a curable illness for some patients and a chronic condition for many more. This change has resulted in an increasing appreciation about the health-related quality of life (HRQOL) of patients diagnosed with cancer and the quality of care they receive. Knowledge of HRQOL provides helpful information to primary care providers, specialists, other health care providers, patients, and families to understand and explore further their role in symptom management and appropriate means for providing special care for patients throughout the course of cancer. Partly due to the realization of this need, the academic research on quality of life (QOL) has been active since the 1980s and 1990s. HRQOL is widely considered as multidimensional constructs describing how illness and treatment affect patients’ ability to function and the potential burden of symptoms from treatment [1 – 3]. HRQOL has become an important end point in cancer clinical trials, and it has presented many methodologic issues that have been areas of active research. This article discusses several methodologic issues in the research for HRQOL. One issue in HRQOL research is the definition and the associated clinical meaning associated with the QOL domains [4,5]. In contrast to symptoms such

* Corresponding author. E-mail address: [email protected] (B.C.-Y. Zee).

as pain, dyspnea, and fatigue, which can be clearly defined, HRQOL sometimes refers to abstract constructs, such as social functioning. Instead of a measurement of depression, HRQOL assesses emotional functioning that may incorporate a much wider scope of emotional distress due to the disease and treatment. Within the clinical trial context, various methods of symptom assessment have been studied extensively with a long history of methodologic development that has been accepted by regulatory agencies in the drug approval process to deal with specific symptoms. Some successful examples and future directions of development for other HRQOL constructs are discussed. Because HRQOL data usually are obtained through patient self-administered questionnaires, missing values are common [6]. The way the data are processed may make a difference in the analysis and may affect the study conclusion. To deal with missing data, most statistical software uses case-deletion method by deleting cases with missing data, then applies standard complete-data methods for the analysis. A disadvantage is that case-deletion method reduces the sample size and produces less efficient and possibly biased estimates leading to erroneous conclusions. This article discusses many commonly used approaches in summarizing the QOL scores with missing data. Finally, the HRQOL outcome may have been defined clearly with an optimal methodology of dealing with missing data; however, the way the data are analyzed may affect the conclusion. The difficulty

1547-4127/04/$ – see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/S1547-4127(04)00028-3

306

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

encountered in the analysis is due to the longitudinal nature of the HRQOL data coupled with practical issues, such as appointment scheduling. The analysis methods of using growth curve modeling are discussed and compared with the results of the analysis of using a HRQOL response variable to summarize longitudinal HRQOL data [7,8]. Throughout this article, a cancer-specific questionnaire, the European Organization for Research and Treatment of Cancer (EORTC) Core Quality of Life Questionnaire (QLQ-C30) is used for illustration purposes [9]. This is a patient self-administered questionnaire with 30 items. It has been psychometrically validated and used in many studies. Embedded in the questionnaire are dimensions assessing physical functioning, role functioning, emotional functioning, cognitive functioning, social functioning, and overall or global QOL. In addition, there are three symptom domains, including fatigue, nausea and vomiting and pain, and six single items on dyspnea, insomnia, appetite loss, constipation, diarrhea, and financial difficulties. The whole QLQ-C30 questionnaire normally takes about 10 to 15 minutes to complete.

1 0.78

0.8 0.6

0.38

0.4 0.2 0

–0.01

–0.06

–0.2 –0.4

–0.32

–0.27 –0.3

–0.6 Phys Role Emot Cogn Glob Naus Appe QOL

Fig. 1. Spearman correlations between average diary nausea score and selected European Organization for Research and Treatment of Cancer (EORTC) Core Quality of Life Questionnaire (QLQ-C30) domains and symptoms. Phys, physical functioning; Role, role functioning; Emot, emotional functioning; Cogn, cognitive functioning; Glob QOL, global quality of life; Naus, nausea and vomiting; Appe, appetite loss.

Symptom assessment and health-related quality of life Clinical trials on symptom control have been successful in the past. One example is in the area of antiemetic drug development. Pharmaceutical companies and academic institutions have done many well-designed studies assessing the antiemetic effect of treatments such as ondansetron and dexamethasone [10 – 13]. One reason for these successes is that nausea and vomiting, although considered as a subjective response, can be understood easily from a clinical point of view, which facilitates the regulatory approval process. In a study assessing chemotherapyinduced nausea and vomiting [14], 5-day average nausea scores captured by patient diary on a 100-mm visual analogue scale was significantly correlated with QLQ-C30 nausea and vomiting domain ( P < .01), appetite loss ( P < .05), physical functioning ( P < .05), and global QOL ( P < .05) (Fig. 1). An important observation of this result is that the high correlation between 5-day average of nausea scores from patient diary and the one-time assessment of the HRQOL at the end of day 5 implies that the assessment of nausea and vomiting can be done at one point of time, although it was believed that the daily assessment of nausea and vomiting is more accurate, and daily diary assessment has been a preferred end point in antiemetic studies. Kaizer et al [11] captured daily

measurements of nausea and vomiting for 5 days and a one-time measurement of HRQOL at day 8 after starting chemotherapy; there was a significant difference between the maintenance ondansetron arm and the no maintenance ondansetron arm with respect to complete response (59.6% versus 42.1%; P = .012, one-sided test). The 5-day average of severity of nausea also was significant in favor of the ondansetron maintenance arm, with a mean of 9.4 mm difference (P = .002). Similarly the comparison between EORTC QLQ-C30 nausea and vomiting domain was significant in favor of the maintenance ondansetron arm (17.8 versus 29.2; P < .001). The benefit of nausea and vomiting assessment for the maintenance arm was not strong enough, however, to translate into a global QOL benefit (53.9 versus 49.4; P = .217). The above-mentioned observations have been noted repeatedly in other studies. In a phase II, prospective randomized study, patients with stage IIIb/IV, histologically confirmed non – small cell lung carcinoma with an age range of 18 to 75 and with bidimensionally measurable disease were randomized to receive either gemcitabine, 1 g/m2, on days 1, 8, and 15 and cisplatin, 75 mg/m2, on day 15 (GP) versus gemcitabine, 1 g/m2, on days 1, 8, and 15 and oral etoposide, 50 mg, on days 1 to 14 (GE) [15].

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

Patient HRQOL was assessed using EORTC QLQC30 at baseline and the beginning of each cycle. After the first cycle of treatment, nausea and vomiting (5.1 for GE versus 16.7 for GP; P = .004) was significantly worse in the GP arm, and alopecia (52.5 for GE versus 11.5 for GP; P < .001) was significantly worse in the GE arm. Nausea and vomiting as assessed by the common toxicity criteria was not significantly different (38% versus 39% grade 2 or greater; P = .93). These results showed that symptoms measured by a patient self-administered tool such as QLQ-C30 are more sensitive than common toxicity criteria. The QOL questionnaire can be used as a valid clinical trial end point; the amount of information required to show improvement is dramatically reduced compared with a patient diary. Other symptoms of interest in these patients are cough, hemoptysis, sore mouth, and pain. Intervention that could alleviate these symptoms may have a greater impact on global QOL.

Summarizing health-related quality-of-life scores with missing data The ease of interpretation for the symptom domains within a HRQOL questionnaire and the evidence that a less frequent measurement schedule may be sufficient do not come without a price. The most obvious problem of using a patient self-administered questionnaire is missing data. The types of missing data in HRQOL studies include data missing due to patients inadvertently missing the item or choosing not to answer the item for a specific reason. Another type of nonresponse may be due to the fact that patients with poor QOL may have a higher likelihood of missing items or may be incapable of answering the whole questionnaire. In a study on palliative therapy for patients with poor-prognosis small cell lung cancer, the proportion of patients completing the QOL assessments decreased from 92% to 31% in patients whom physicians rated as normal compared with patients who were confined to bed or a chair [16]. The consequence of this type of missing data is that serious bias may be generated. To choose an optimal method of handling missing data in a HRQOL study, the authors formally evaluated many different imputation approaches through a simulation method. Methods of imputation Six methods were examined in this study: (1) case deletion, (2) subscale mean, (3) subscale mean 50%,

307

(4) item mean, (5) single imputation, and (6) matched item mean. The most primitive method of handling missing data is the case-deletion method. In this method, a complete data set is generated by deleting all subjects with incomplete items. The subscalemean method can be applied to instruments with domains or subscales that have multiple items. The imputed value of a missing item is the average of all the completed items in that subscale for a particular subject. The major characteristic of this method is that it does not depend on the availability of data from other subjects. The subscale-mean 50% method is similar to the previous subscale-mean method except that it requires at least 50% of the items within the imputing domain to be answered before an average value can be computed and used as the imputed value. The items-mean method imputes the missing items of a subject by an average of the items of the subjects who answered that particular item. This method does not depend on the availability of data in each domain or subscale, but rather it depends on the availability of data from other subjects and is suitable only for large studies. The next two methods use sets of similar subjects to model the missing data. The criteria for defining the set of similar subjects are discussed subsequently. The single-imputation method imputes a randomly selected value from the corresponding item in the set of similar subjects for the missing item. The matched itemsmean method also uses the set of similar subjects, but similar to the imputed items-mean method. Instead of imputing a value randomly selected from the corresponding items, as in single imputation, the missing items are imputed by the mean of the corresponding items. For single-imputation and matched items-mean methods, the selection criteria include a series of comparisons between pairs of subjects. The sum of the absolute differences between all nonmissing items, within each domain, is calculated, as follows: 0 V dijk ¼

Nk X j Iikl  Ijkl j V1; n ik ðak  bk Þ l¼1

k ¼ 1; . . . ; 15; I ¼ 1; . . . ; n; j ¼ 1; . . . ; n; and iaj where Iikl and Ijkl are the lth item values of the kth domain for the ith and jth individuals, nik is the number of nonmissing items in the kth domain for the ith individual, and the values ak and bk are the maximum and minimum of the item value for the kth domain. Nk is the total number of items in the kth domain, and n is the total number of subjects in the study. After the standardized distances (dijk) for each

308

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

Table 1 Average deviations for random missing data (N = 177) 10% data missing

20% data missing

30% data missing

Imputation method

SSD

PMD

SSD

PMD

SSD

PMD

Case deletion Subscale mean Subscale mean 50% Item mean Single imputation Matched item mean

Nil .0401 .0348a .0610 .0377b .0703

.0107 .0004b .0001a .0046 .0039 .0025

Nil .0567b .0559a .1151 .0871 .1195

.0176 .0004a .0014b .0024 .0092 .0113

Nil .0855b .0705a .1597 .1004 .1474

.0701 .0061b .0071 .0007a .0343 .0354

Abbreviations: PMD, population mean deviation; SSD, sum of squared deviation. a Refers to the smallest p-value among the imputation methods. b Refers to the second smallest p-value among the imputation methods.

of the domains are calculated, an average is taken, as follows: " ! # 15 X 0V Dij ¼ dijk =15 V1; k¼1

i ¼ 1; . . . ; n;

j ¼ 1; . . . ; n; and iaj

where Dij is the average standardized distance (ASD). Based on ASD, a probability (prob = 1  ASD) was assigned to the compared subject, and this probability represents the chance the subject will be included in the set of similar subjects. Subjects with a small ASD value have a high probability, which represents a higher likelihood of being included in the set of similar subjects. Simulation study A data set was used to evaluate the performance of the imputation methods. The data set was obtained

from samples of two symptom control studies (SC.8 and SC.9) of the National Cancer Institute of Canada Clinical Trials Group [11,17]. In this data set, a total of 820 subjects were obtained, and 673 had complete QOL data. A sample of 177 patients and a larger set of 452 patients were chosen from the complete QOL data set for performing the evaluation. Among the patients with complete QOL data, a comparison of the day 8 QOL data between two treatment arms was used. To evaluate the performance of various imputation methods, missing data were generated from the complete data set using random and nonrandom processes. In a workshop on ‘‘Missing Data in Quality of Life Research in Cancer Clinical Trials,’’ most cooperative groups reported that the baseline compliance rates were greater than 90% [18]. The compliance rate while patients were receiving treatment was about 80% and after patients completed treatment was in the 70% range. Based on these generally acceptable ranges of missing data during different phases, the proportion of missing items

Table 2 Average deviations for random missing data (N = 452) 10% data missing

20% data missing

30% data missing

Imputation method

SSD

PMD

SSD

PMD

SSD

PMD

Case deletion Subscale mean Subscale mean 50% Item mean Single imputation Matched item mean

Nil .0372 .0368b .0411 .0282a .0806

.0114 .0001a .0005b .0021 .0009 .0008

Nil .0675b .0648a .1052 .0748 .1469

.0192 .0046b .0058 .0020a .0070 .0053

Nil .1144b .1027a .1562 .1148 .1816

.0255 .0023 .0014b .0010a .0121 .0090

Abbreviations: PMD, population mean deviation; SSD, sum of squared deviation. a Refers to the smallest p-value among the imputation methods. b Refers to the second smallest p-value among the imputation methods.

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

309

Table 3 Average deviations for nonrandom missing data (N = 177) 10% data missing

20% data missing

30% data missing

Imputation method

SSD

PMD

SSD

PMD

SSD

PMD

Case deletion Subscale mean Subscale mean 50% Item mean Single imputation Matched item mean

Nil .0120 .0116b .0206 .0113a .0357

.0136 .0015b .0012a .0031 .0015b .0046

Nil .0375 .0369b .0465 .0294a .0988

.0235 .0012b .0027 .0045 .0008a .0071

Nil .0579b .0535a .0864 .0604 .1465

.0282a .0035b .0052 .0090 .0052 .0086

Abbreviations: PMD, population mean deviation; SSD, sum of squared deviation. a Refers to the smallest p-value among the imputation methods. b Refers to the second smallest p-value among the imputation methods.

generated for this study has been decided and set at 10%, 20%, and 30%. For the nonrandom process, missing data were generated using a mechanism such that the probability of a nonresponded item is a function of the score of a patient’s global QOL, which ranged from 1 to 7, and a predefined rate of nonresponses, 10%, 20%, and 30%. The percentage of nonresponse for ith subject is (NR)I, as follows:

deviation (SSD) between the imputed values and the actual values, as follows: 0

K X

N B X B j¼1 B SSD ¼ B i¼1 @

ðNRÞi ¼ Rð8  Gi Þ=7; i ¼ 1; . . . ; n

1 2 ˆ ðYij  Yij Þ C C C; C nj A

i ¼ 1; . . . ; n; and j ¼ 1; . . . ; K

where R is the predefined rate of nonresponse, and Gi is the global QOL score for the ith patient among a total of n patients. Because a small Gi is associated with poor global QOL, the percentage of nonresponse is higher for a patient with poorer global QOL. Evaluation criteria In assessing the accuracy of the imputation methods, the precision of the imputed individual total score is evaluated by the average sum of squared

where Yˆij is the sum of the jth subscale of the ith subject from the imputed data set, Yij is the total for the individual subscale of complete data, nj is the number of complete cases in the jth subscale, N is the number of subjects in the data, and K is the number of subscales. Another evaluation index is the deviation of the averaged population mean deviation (PMD) between the imputed data set and the actual complete data set. The accuracy of the imputation methods is measured by the absolute difference between the sample mean

Table 4 Average deviations for nonrandom missing data (N = 452) 10% data missing

20% data missing

30% data missing

Imputation method

SSD

PMD

SSD

PMD

SSD

PMD

Case deletion Subscale mean Subscale mean 50% Item mean Single imputation Matched item mean

Nil .0188b .0188b .0231 .0141a .0351

.0135 .0026a .0026a .0046 .0029b .0048

Nil .0175a .0177b .0451 .0256 .0503

.0184 .0036b .0057 .0063 .0027a .0061

Nil .0298a .0298a .0856 .0537b .1078

.0246 .0034a .0051b .0100 .0081 .0077

Abbreviations: PMD, population mean deviation; SSD, sum of squared deviation. a Refers to the smallest p-value among the imputation methods. b Refers to the second smallest p-value among the imputation methods.

310

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

Table 5 Health-related morality-of-the data response rates

Physical functioning Role functioning Emotional functioning Cognitive functioning Social functioning Global QOL Fatigue Nausea and vomiting Pain Dyspnea Insomnia Appetite loss Constipation Diarrhea Financial difficulties

GE GP GE GP GE GP GE GP GE GP GE GP GE GP GE GP GE GP GE GP GE GP GE GP GE GP GE GP GE GP

Worsened

Stable

Improved

11 (32%) 11 (32%) 10 (29%) 9 (26%) 9 (26%) 5 (15%) 13 (38%) 7 (21%) 14 (41%) 10 (29%) 13 (38%) 15 (44%) 14 (41%) 10 (29%) 15 (44%) 26 (76%) 12 (35%) 10 (29%) 13 (38%) 7 (21%) 13 (38%) 7 (21%) 21 (62%) 20 (59%) 13 (38%) 13 (38%) 5 (15%) 7 (21%) 8 (24%) 2 (6%)

12 (35%) 7 (21%) 4 (12%) 4 (12%) 25 (74%) 29 (85%) 8 (24%) 10 (29%) 7 (21%) 7 (21%) 11 (32%) 5 (15%) 2 (6%) 5 (15%) 15 (44%) 5 (15%) 7 (21%) 6 (18%) 13 (38%) 11 (32%) 9 (26%) 7 (21%) 13 (36%) 14 (41%) 14 (41%) 14 (4%) 29 (85%) 19 (56%) 19 (56%) 12 (35%)

11 (32%) 16 (47%) 20 (59%) 21 (62%) 0 0 13 (38%) 17 (50%) 13 (38%) 17 (50%) 10 (29%) 14 (41%) 18 (53%) 19 (56%) 4 (12%) 3 (9%) 15 (44%) 18 (53%) 8 (24%) 16 (47%) 12 (35%) 20 (59%) 0 0 7 (21%) 7 (21%) 0 8 (24%) 7 (21%) 20 (59%)

Fisher’s p-valueb

MH c2 P valuec

.346

.477

1.000

.785

.369

.234

.321

.154

.574

.277

.215

.783

.409

.516

.022a

.032a

.810

.496

.106

.036a

.136

.049a

1.000

.806

1.000

1.000

.003a

.180

.003a

.001a

Abbreviations: GE, gemcitabine plus etoposide; GP, gemcitabine plus cisplatin. a Refers to p < 0.05. b Fisher’s exact test with 2 degrees of freedom. c Mantel-Haenszel test with 1 degree of freedom.

of the questionnaire when there are no missing data, as follows: f

l ¼

nj N X X i¼1

ðYij =nj Þ=N

j¼1

where N is the total number of subjects, nj is the number of items in the jth subscale, and Yij is the total for the individual and subscale of complete data. The sample mean of the questionnaire when the missing data are handled by the imputation methods is as follows: lˆ ¼

ny N X X i¼1

ðYˆ ij =nj Þ=N

j¼1

where Yˆij is the total of the completed and imputed values for subjects with missing items. The absolute deviation of the population means is PMD.

Comparisons of results among methods In the simulation, missing rates of 10%, 20%, and 30% were used. The precision of the imputed values was evaluated using the average SSD method of individual items and the method of PMD. In SSD and PMD approaches, values that are closer to 0 indicate smaller differences between the imputed and actual values. The smaller the value, the better the performance of the corresponding imputation method. Results for missing items generated using a random process are shown in Tables 1 and 2. Results for missing items generated using a nonrandom process are shown in Tables 3 and 4. The results of average deviations all were close to zero in the average SSD and the PMD; multiple imputation has shown its accuracy and ability to minimize potential biases caused by the missing data.

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

The results in Tables 1 and 2 revealed that when missing data are ignorable and in cases in which the sample size is small, the performance of subscale 50% is superior in the SSD but not in the PMD measure. As the sample size increases (see Table 2), subscale mean 50% still outperforms other methods most of the time. When the missing data were generated based on a nonrandom process, the average SSD indicated results in favor of subscale mean 50% (Tables 3 and 4). As sample size increases for nonrandom missing data cases, the methods subscale mean, subscale mean 50%, and single imputation showed similar performance. The subscale mean 50% approach to summarize HRQOL data outperforms other methods. When the HRQOL data are not missing at random, the subscale-mean 50% method still has a good performance, but other methods, such as single imputation or more complicated methods, may be needed. The case-deletion, item-mean and matched item-mean methods are not recommended.

Analysis of health-related quality of life data In the analysis stage, in addition to using imputation methods to deal with missing data problems and applying conventional cross-sectional analysis for complete data, two proposed approaches are discussed. The first approach is to summarize the longitudinal HRQOL data into a HRQOL response variable [8]. The second approach is modeling using the growth curve method [7]. The HRQOL response variable method takes into account the longitudinal nature of the data. It is 80

311

defined based on the change scores along time and a prespecified, clinically significant difference D for the change score and summarizes the longitudinal data into a single variable. The longitudinal HRQOL data for a particular patient are categorized into an ‘‘improved’’ category if at least one postbaseline change score is greater than D. The longitudinal HRQOL data for a particular patient are categorized into a ‘‘worsening’’ category if all the postbaseline change scores are less than or equal to D, but there is at least one postbaseline change score that is less than D. A HRQOL response variable is defined as stable if all postbaseline change scores are between D and D. A simple c2 test or Fisher’s exact test with 2 degrees of freedom or a Mantel-Haenzsel (MH) c2 test with one degree of freedom can be used to compare the differences between treatment arms. The growth curve modeling approach is designed to handle dropout and missing data in QOL data [19]. It requires only a missing at random (MAR) assumption to be satisfied and to be able to use all data without artificially forcing patient HRQOL data into fixed time intervals. It also describes the average HRQOL patterns along time for the two treatment arms in a clinical trial. The graphic representation of the various HRQOL domains provides comparative information visually to evaluate the likely impact of treatments. The randomized, phase II non – small cell lung cancer trial mentioned earlier [15] is used to illustrate the methods. Table 5 shows the HRQOL response rates for the QLQ-C30 domains. The GP arm has a significantly higher proportion of patients with worsening of nausea and vomiting and a lower proportion

Third degree growth curve model

GE GP

QOL Score

60 40

p=0.0001

p<0.0001

p<0.0001

p=0.0013

20 0 −20

Treatment-by-time interaction p=0.001

−40 0

1

2

3

4

Cycles Fig. 2. Growth curve models for nausea and vomiting between gemcitabine plus etoposide (GE) arm and gemcitabine plus cisplatin (GP) arm. QOL, quality of life.

312

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

100

Third degree growth curve model

GE GP

80

QOL Score

60

p=0.0291

p=0.0726

p=0.0356

p=0.0431

40 20 0 Treatment-by-time interaction p=0.0125 −20 0

1

2

3

4

Cycles Fig. 3. Growth curve models for dyspnea between gemcitabine plus etoposide (GE) arm and gemcitabine plus cisplatin (GP) arm. QOL, quality of life.

of patients with stable response and improvement ( P = .022, 2 degrees of freedom). Dyspnea and insomnia are significant only in the Mantel-Haenzsel c2 test ( P = .036 and P = .049 with 1 degree of freedom) and not in Fisher’s exact test (P = .106 and P = .136 with 2 degrees of freedom), showing that GP has more improvement in both of these symptoms and less in the stable and worsening categories. In general, the Mantel-Haenzsel c2 test is more powerful at detecting linear trends and less powerful at detecting nonlinear relationships.

80

When the results are assessed using growth curve models, nausea and vomiting at baseline is close to 0, indicating that there is no nausea and vomiting before patients receive chemotherapy. There are significant differences in each of the first four cycles, all of them in favor of the GE arm after treatment started (Fig. 2). For dyspnea, the sample size is relatively small for a randomized phase II study with only 34 patients in each of the two arms. The dyspnea score at baseline for GE is 23.5 (SD = 25.33) and 46.1 (SD = 30.72) for GP (P = .0026). The growth curve shows an

Second degree growth curve model

GE GP

QOL Score

60

40

p=0.8850

p=0.7285

p=0.4701

p=0.3397

20

0 Treatment-by-time interaction p=0.117 −20 0

1

2 Cycles

3

4

Fig. 4. Growth curve models for insomnia between gemcitabine plus etoposide (GE) arm and gemcitabine plus cisplatin (GP) arm. QOL, quality of life.

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315 Second degree growth curve model 40

QOL Score

20

p=0.0255

p=0.1066

p=0.2294

313 GE GP

p=0.3537

0 −20

Treatment-by-time interaction p=0.8996

−40

0

1

2 Cycles

3

4

Fig. 5. Growth curve models for diarrhea between gemcitabine plus etoposide (GE) arm and gemcitabine plus cisplatin (GP) arm. QOL, quality of life.

improvement for the GP arm but a slight worsening then a gradual improvement for the GE arm (Fig. 3). The treatment by time interaction is significant at P = .0125. The improvement in the GP arm is significant, which agrees with the Mantel-Haenzsel c2 test from the HRQOL response analysis. For insomnia, there is no difference at early time points, and the GE arm shows a slight worsening where the GP arm shows a slight improvement, but the effect is not large enough to be significant (Fig. 4). For diarrhea, as shown in Fig. 5, the GP arm has a higher proportion of patients with improvement and a lower proportion of patients with stable responses, 80

but a higher proportion of patients with worsening (P = .003, 2 degrees of freedom). The proportion of patients with improvement and worsening may cancel out the linear trend on the treatment effect for diarrhea. This phenomenon is shown in the significant P value of the Fisher’s exact test, but a nonsignificant P value when Mantel-Haenzsel c2 test was used. Lastly, both tests show a significant result for financial difficulties ( P = .003 and P = .001). For financial difficulties, the average baseline scores between the GE arm of 28.4 (SD = 37.72) and the GP arm of 48 (SD = 36.87) were significant (P = .0204) due to chance in a small study. The growth curve

GE GP

Second degree growth curve model

QOL Score

60

40

20

p=0.2268

p=0.7165

p=0.9314

p=0.3397

0 Treatment-by-time interaction p=0.003 −20 0

1

2 Cycles

3

4

Fig. 6. Growth curve models for financial difficulties between gemcitabine plus etoposide (GE) arm and gemcitabine plus cisplatin (GP) arm. QOL, quality of life.

314

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

shows a clear difference in favor of the GE arm at baseline; the GP arm is shown to improve quickly to the same level of the GE arm after starting treatment (Fig. 6). The difference in the rates of change between the two arms is shown in the treatment by time interaction test (P = .003). In general, the Mantel-Haenzsel c2 test for the HRQOL response shows the test of trend between the two treatment arms with respect to the corresponding domain. Fisher’s exact test provides an indication of any differences between arms. The HRQOL response approach is simple to understand and to carry out. It lacks the visual presentation of the growth curve, however, that provides more details to explain some of the average patterns and potential disparity of HRQOL scores at the baseline.

Summary The methodology for QOL assessment covers a wide range of topics. It involves a proper choice of instruments with appropriate psychometric properties, the administration of these instruments, frequency of measurements, missing data problems, and the method of analysis. There are currently debates on the meaning and interpretation of the HRQOL domains taking the form of arguing how to define minimal clinically meaningful difference and whether this can be used in regulatory approval for drug development [20]. From a practical point of view, the authors proposed that a disease-specific checklist or symptom domains incorporated within a HRQOL questionnaire may be a middle ground to gain general agreements among academic institutions, manufacturers, and regulatory agencies to use a specific symptom checklist or domain as the primary end point for clinical trials together with other HRQOL domains as ancillary data for the study. Antiemetic trial with HRQOL assessments is an example. Most would agree, however, that no matter what HRQOL domains or symptoms are being studied, it should be based on a patient self-administered questionnaire as shown by the lack of sensitivity in the example in this article [15]. Missing data are a problem in the data collection and handling. The authors have examined a few commonly used approaches and performed simulation to study their properties. The subscale-mean method when one has more than 50% of the information on a subscale generally reflects the true values. In practice, one still would have missing data that cannot be handled completely by imputation. The method of analysis must be flexible enough to

incorporate the nature of these data. Two approaches have been discussed, and they are both flexible in terms of using all available information being obtained in a longitudinal fashion with variable visiting schedules and potential missing data. The HRQOL response variable approach is simple and easy to understand. The growth curve models approach provides more detailed information on average trends between treatment arms. In general, these two methods agree on the results of the example. They can be used to report clinical trial results using HRQOL data as end points.

References [1] Ware Jr JE. Measuring functioning, well-being, and other generic health concepts. In: Osoba D, editor. Effect of cancer on quality of life. Boca Raton, FL: CRC Press; 1991. p. 7 – 23. [2] Kaasa S. Measurement of quality of life in clinical trials. Oncology 1992;49:288 – 94. [3] Bruner D. In search of the ‘‘quality’’ in quality-of-life research. Int J Radiat Oncol Biol Phy 1995;31:191 – 2. [4] Osoba D, Aaronson NK, Till JE. A practical guide for selecting quality-of-life measures in clinical trials and practice. In: Osoba D, editor. Effect of cancer on quality of life. Boca Raton, FL: CRC Press; 1991. p. 89 – 104. [5] Beitz J, Gnecco C, Justice R. Quality of life endpoints in cancer clinical trials: the Food and Drug Administration perspectives. J Natl Cancer Inst Monogr 1996; 20:7 – 9. [6] Osoba D, Zee B. Completion rates in health-related quality-of-life assessment: approach of the National Cancer Institute of Canada Clinical Trials Group. Stat Med 1998;17:603 – 12. [7] Zee B. Growth curve model analysis for quality of life data. Stat Med 1998;17:757 – 66. [8] Tu D, Liu J, Pater J. Analysis of the ordinal quality of life response data: whither simple chi-square tests? Controlled Clinical Trials 2003;24:229S. [9] Fayers P, Aaronson N, Bjordal K, et al. EORTC QLQC30 Scoring Manual. Brussels: EORTC Quality of Life Study Group; 1999. [10] Levitt M, Warr D, Yelle L, et al. A comparison of ondansetron versus dexamethasone and metoclopramide as antiemetics in the chemotherapy of breast cancer with cyclophosphamide, methotrexate, and 5-fluorouracil: a study by the Clinical Trials Group of the National Cancer Institute of Canada. N Engl J Med 1993;328:1081 – 4. [11] Kaizer L, Warr D, Hoskins P, et al. The effect of schedule and maintenance on the antiemetic efficacy of ondansetron combined with dexamethasone in acute and delayed nausea and emesis in patients receiving moderately emetogenic chemotherapy: a phase II trial

B.C.-Y. Zee, T.S.K. Mok / Thorac Surg Clin 14 (2004) 305–315

[12]

[13]

[14]

[15]

by the National Cancer Institute of Cancer Clinical Trials Group. J Clin Oncol 1994 May;12(5):1050 – 7. Pater J, Lofters WS, Zee B, et al. The role of 5HT3 antagonists in the control of delayed onset of nausea and vomiting in patients receiving moderately emetogenic chemotherapy. Ann Oncol 1997;8:181 – 5. Lofters WS, Pater J, Zee B, et al. A phase III Double-blind comparison of dolasetron mesylate and ondansetron, and an evaluation of the additive role of dexamethasone in the prevention of acute and delayed nausea and vomiting due to moderately emetogenic chemotherapy. J Clin Oncol 1997;15:2966 – 73. Zee B, Yeo W, Lai M, et al. Chemotherapy-induced nausea affects patients’ quality of life (QOL) more than vomiting. Presented at ISOQOL, November 12 – 5, 2003 Mok TS, Zee B, Nguyen B, et al. A prospective randomized study comparing toxicity and quality of life of two gemcitabine-based regimens (with or without cisplatin) for treatment of advanced non-small-cell lung cancer. ASCO; 2002.

315

[16] Hopwood P, Stephens RJ, Machin D. Approaches to the analysis of quality of life data: experiences gained from a Medical Research Council Lung Cancer Working Party palliative chemotherapy trial. Qual Life Res 1994;3:339 – 52. [17] Latreille J, Pater J, Johnston D, et al. Use of dexamethasone and granisetron in the control of delayed emesis for patients who receive highly emetogenic chemotherapy. J Clin Oncol 1998;16:1174 – 8. [18] Bernhard J, Gelber RD, editors. Workshop on missing data in quality of life research in cancer clinical trials: practical and methodological issues. Stat Med 1998; 17:511 – 796. [19] Fairclough D, Peterson H, Cella D, Bonomi P. Comparison of several model-based methods for analysing incomplete quality of life data in clinical trials. Stat Med 1998;17:781 – 96. [20] Osoba D, Rodrigues G, Myles J, Zee B, Pater J. Interpreting the significance of changes in health-related quality-of-life scores. J Clin Oncol 1998;16:139 – 44.