Impact of Different Criteria for Defining Hypopneas in the Apnea-Hypopnea Index

Impact of Different Criteria for Defining Hypopneas in the Apnea-Hypopnea Index

Impact of Different Criteria for Defining Hypopneas in the Apnea-Hypopnea Index* Renee L. Manser, MBBS; Peter Rochford, BSc; Robert J. Pierce, MD, FCC...

98KB Sizes 0 Downloads 21 Views

Impact of Different Criteria for Defining Hypopneas in the Apnea-Hypopnea Index* Renee L. Manser, MBBS; Peter Rochford, BSc; Robert J. Pierce, MD, FCCP; Graham B. Byrnes, BSc, PhD; and Donald A. Campbell, MD

Objectives: To explore the effect of using different scoring criteria for hypopneas in the scoring of polysomnographic studies: (1) by estimating the level of agreement between apnea-hypopnea index (AHI) scores derived from different scoring methods, and (2) by examining the effect on the point prevalence of disease using different threshold values of the AHI. Design: Retrospective analysis of 48 diagnostic polysomnographic records. Setting: Tertiary-hospital sleep-disorders clinic. Measurements: AHIs were derived from three different methods for scoring hypopneas. The hypopnea definitions used incorporated different combinations and threshold values of respiratory signal changes in addition to differences in the requirement for associated oxygen desaturation or arousal. The level of agreement between different scoring methods was assessed by constructing Bland-Altman plots and calculating intraclass correlation coefficients (ICCs). ␬ statistics were used to assess agreement between the different methods using varying thresholds of AHI to categorize sleep apnea (AHI > 5, AHI > 15, and AHI > 20). Results: The random-effects ICC for the three methods was 0.89, suggesting that the different scoring methods tended to rank patients fairly consistently. However, the point prevalence of disease estimated by using different thresholds of AHI was found to vary depending on the method used to score sleep studies (␬, 0.30 to 0.95). Conclusions: These findings have implications for case finding, population-prevalence estimates, and grading of disease severity for access to government-funded continuous positive airway pressure services. Guidelines for standardizing the measurement and reporting of sleep studies in clinical practice should be implemented. (CHEST 2001; 120:909 –914) Key words: diagnosis; hypopnea; polysomnography; sleep apnea syndromes Abbreviations: AHI ⫽ apnea-hypopnea index; CI ⫽ confidence interval; CPAP ⫽ continuous positive airway pressure; ICC ⫽ intraclass correlation coefficient

recent survey of sleep laboratories in the AusA tralian state of Victoria identified many sources 1

of variability in the measurement of hypopneas. Similar results have been reported from surveys2 in North America. In general, the measurement of hypopneas is very poorly standardized.3 There are *From the Clinical Epidemiology and Health Service Evaluation Unit (Drs. Manser, Byrnes, and Campbell), Royal Melbourne Hospital, Parkville, Victoria; and Department of Respiratory Medicine (Mr. Rochford and Dr. Pierce), Austin and Repatriation Medical Center, Heidelberg, Victoria, Australia. The study was funded by the Victorian Department of Human Services. This study was performed at the Austin and Repatriation Medical Center, Heidelberg, Victoria, Australia. Manuscript received December 1, 2000; revision accepted April 12, 2001. Correspondence to: Renee L. Manser, MBBS, Clinical Epidemiology and Health Service Evaluation Unit, Ground Floor, Charles Connibere Building, Royal Melbourne Hospital, Parkville, Victoria 3050; e-mail: [email protected]

variations in the types of sensors used to measure particular signals. For example, differences in the response characteristics of different airflow signals have been shown4 to be of practical importance. Furthermore, sleep laboratories may use different combinations of signals and apply different threshold values when scoring hypopneas. The requirement for hypopnea definitions to include associated changes in oxygen saturation and or arousals also vary between laboratories.1,2 Despite these limitations, the apnea-hypopnea index (AHI) remains the primary measurement of sleep-disordered breathing. The problem of standardization impacts on sleep apnea diagnosis, grading of disease severity, treatment decisions, and research. Importantly, in Victoria and elsewhere, reimbursement for government-funded continuous positive airway pressure (CPAP) services is linked to CHEST / 120 / 3 / SEPTEMBER, 2001

909

the severity of sleep-disordered breathing as defined by the AHI. This study was therefore designed to assess the effect of “real-world” variations in hypopnea measurement on the AHI. We report the level of agreement between AHIs derived from three different hypopnea scoring methods by rescoring a random sample of diagnostic sleep studies. The methods used were chosen to reflect those used by Victorian sleep laboratories currently, and include different combinations of respiratory signals and thresholds and differences in the requirement for associated changes in oxygen saturation or arousal. In this study, we aimed to explore the effect of using different scoring criteria for hypopneas in the scoring of polysomnographic studies (1) by estimating the level of agreement between AHIs derived from different scoring methods, and (2) by examining the point prevalence of disease using different scoring methods and different AHI thresholds. Materials and Methods Forty-eight diagnostic sleep study records were retrospectively selected at random from a sleep study database. This database included all sleep studies performed on patients at a tertiary sleep center during a 6-month period. The random-number sequence was computer generated. A large proportion of subjects in the database had AHI scores at the lower end of the spectrum. In order to improve resolution near the threshold of interest, random selection of sleep study records was stratified according to disease severity (based on previously reported results from our laboratory). There were eight sleep study records selected from each of six categories, (AHI 0 to 10, AHI 10 to 20, AHI 20 to 30, AHI 30 to 40, AHI 40 to 50, and AHI ⬎ 50); therefore, sampling was not by equal probability. Each sleep study record was rescored by the same staff member in a random sequence using three different criteria for scoring hypopneas. The scorer was blinded to the previous scoring results. The scoring methods are outlined below. The study was approved by the Austin and Repatriation Medical Center Ethics Committee. Exclusion Criteria for Sleep Studies Studies were excluded if the sleep efficiency was ⬍ 50% or if poor signals were noted for ⬎ 10% of the total sleep time. Sleep Staging and Scoring Sleep stages were scored according to the criteria of Rechlschaffen and Kales.5 For each method, apneas were defined as cessation of oronasal airflow (such that no breaths were discernible in the airflow signal) for ⱖ 10 s in association with a ⱖ 2% oxygen desaturation. Arousals were defined according to the American Sleep Disorders Association criteria.6 The three different methods for scoring hypopneas are as follows: Method A: A ⱖ 50% reduction in one or more of three respiratory signals (airflow, thoracic, or abdominal respiratory inductive plethysmography) compared with baseline breathing level for ⬎ 10 s and associated with an oxygen desaturation ⱖ 2% compared with baseline. Method B: Either 1 and 3 or 2 and 3 of the following: (1) a reduction of ⱖ 50% from baseline in the amplitude of respiratory 910

inductive plethysmography signals; the reduction must be in both the thoracic and abdominal movement channels (which were recorded on separate channels for this study); (2) a clear amplitude reduction that does not reach the above-mentioned criterion but is associated with either an oxygen desaturation of ⱖ 3% or an arousal; (3) event lasting ⱖ 10 s. Method C: Any discernible reduction in airflow lasting ⱖ 10 s associated with a ⱖ 3% oxygen desaturation (with or without arousal). The methods were selected to reflect the “extremes” in the range of different methods used by Victorian sleep laboratories.1 Method B is based on published guidelines.7 Some of the Victorian laboratories were using slight variations of method B. Intrarater Reliability In order to assess intraobserver variability for each scoring method, six sleep studies were randomly selected and rescored by the same technologist. The time interval between rescores was 3 months, and the technologist was blinded to the previous results. Equipment and Instrumentation For each sleep study, polysomnographic signals were recorded using a computerized polysomnographic system (Sleepwatch; Compumedics; Melbourne, Australia). The montage consisted of the following signals: central EEG (C3/A2), chin electromyogram, eye electro-oculographic activity (right and left electrooculograms) and ECG, ribcage and abdominal excursions by uncalibrated inductance plethysmography, nasal-oral airflow with thermistors, and arterial oxygen saturation by pulse oximetry (Radiometer OXI; Radiometer; Copenhagen, Denmark) using finger probes. Statistical Analysis Statistical analysis was performed using statistical software (Stata version 6.0; Stata Corporation; College Station, TX).8 Repeated-measures analysis of variance was used to assess whether the mean AHIs derived from each of the three methods were significantly different. Pearson’s correlation coefficient was used to assess the correlation between AHIs derived from different hypopnea definitions. The Pearson correlation coefficient tests the strength of the linear relationship between the variables and is not a measure of agreement. It may not detect systematic differences between the methods.9 An intraclass correlation coefficient (ICC) was calculated using both the random-effect and fixed-effect methods described by Streiner and Norman.9 Bias-corrected confidence intervals (CIs) were estimated by using 1,000 bootstrap repetitions. The ICC is a ratio of the true variance between patients to the observed variance of scores and is equivalent to a weighted ␬ with quadratic weights.9 Agreement between the different scoring methods was further evaluated using Bland-Altman plots and limits of agreement.10 Bland and Altman10 have developed a method for measuring the absolute level of agreement between two measurement techniques. This method is useful for comparing measures on the same scale. Agreement was assessed by calculating the limits of agreement. The limits of agreement are equal to the mean difference plus or minus twice the SE. The precision of the estimated limits of agreement was determined by calculating CIs. Bland-Altman plots were constructed for each of the three methods using paired comparisons. These plots were constructed by plotting the difference between two of the methods against the mean of the two methods. Clinical Investigations

␬ statistics were used to assess agreement between the different methods using varying thresholds of AHI to categorize sleep apnea (AHI ⬎ 5, AHI ⬎ 15, and AHI ⬎ 20). In clinical practice, thresholds for defining disease vary between ⬎ 5 events per hour and ⬎ 15 events per hour. In Victoria, patients (without comorbidities) are eligible for government-funded CPAP services for sleep apnea if they have an AHI of ⬎ 20 events per hour.

Results There were a total of 48 patient records included in the analysis. Only one patient record was excluded due to poor quality signals, and this was replaced by another randomly selected patient record. Of the patients included in the study, 92% were men (mean age, 52 years; range, 20 to 86 years). The mean body mass index was 32 kg/m2 (range, 24 to 42 kg/m2). Summary of Polysomnographic Findings Some variables were nonnormally distributed (for example, the arousal index). Descriptive data therefore include median and range values, where appropriate. Tests of normality (Kolmogorov-Smirnov) were performed on the AHI measures derived from each of the three scoring methods. The distributions of these variables were not significantly different from normal (p ⫽ 0.20). The mean ⫾ SD non-rapid eye movement sleep time was 234 ⫾ 44 min (range, 120 to 331 min), and the mean rapid eye movement sleep time was 53 ⫾ 25 min (range, 8 to 107 min). The median sleep efficiency was 74% (range, 50 to 96%). The median arousal index was 32 (range, 5 to 87). An average AHI was calculated by averaging the score from each of the three scoring methods. The mean AHI was 36.7 ⫾ 22 (range, 1.6 to 90). The mean (SD; range) hypopnea indexes for methods A, B, and C were 30 (21; 0 to 81), 35 (19; 4 to 83), and 24 (20; 0 to 83), respectively. The AHIs for methods A, B, and C were 37.5 (23; 0 to 91), 42 (21; 4 to 92), and 30.5 (22; 0 to 87), respectively. The differences in mean AHI scores among the three methods for scoring hypopneas were statistically significant using repeatedmeasures analysis of variance (p ⬍ 0.0005).

method was treated as a random factor, the ICC was 0.89 (95% CI, 0.85 to 0.9). When the scoring method was treated as a fixed factor, the ICC was 0.95 (95% CI, 0.94 to 0.95). The result suggests that 89% of the variance in scores arises from true variance in the patients. Limits of Agreement and Bland-Altman Plots The mean ⫾ SD difference between method A and method B was ⫺ 4.55 ⫾ 6.48 (95% CI, ⫺ 6.43 to ⫺ 2.67). The lower limit of agreement was ⫺ 17.5 (95% CI, ⫺ 20.77 to ⫺ 14.25), and the upper limit of agreement was 8.41 (95% CI, 5.15 to 11.67). The mean difference between method A and method C was 6.97 ⫾ 5.9 (95% CI, 5.26 to 8.69), the lower limit of agreement was ⫺ 4.83 (95% CI, ⫺ 7.8 to ⫺ 1.86), and the upper limit of agreement was 18.77 (95% CI, 15.8 to 21.74). The mean difference between method B and method C was 11.53 ⫾ 8.38 (95% CI, 9.09 to 13.96), the lower limit agreement was ⫺ 5.23 (95% CI, ⫺ 9.44 to ⫺ 1.01), and the upper limit of agreement was 28.29 (95% CI, 24.07 to 32.5). The Bland-Altman plots were constructed for each of the paired comparisons. Figure 1 is a BlandAltman plot comparing method A and method B. This shows that on average, method A produced an AHI of five events per hour lower than method B, but there is considerable scatter at any average level of AHI. Similar plots were constructed for the comparisons between method B and method C, and method A and method C (not displayed). Method B produced an AHI, on average, about 11 events per

Correlation Between Different Scoring Methods Pearson correlation coefficients were calculated for paired comparisons between the different AHIs derived from three methods for scoring hypopneas. The correlation between method A and method B was 0.96. The correlation between method A and method C was 0.97. The correlation between method B and method C was 0.93. ICC ICCs were calculated using the three different methods for scoring hypopneas. When the scoring

Figure 1. Bland-Altman plot illustrating the agreement between AHIs derived from scoring method A and method B. CHEST / 120 / 3 / SEPTEMBER, 2001

911

hour higher than method C, and method A produced an AHI, on average, of 7 events per hour higher than method C, but again there was considerable scatter at any average level of AHI. For each of the paired comparisons, the square of the difference between the scores was tested for association with the mean AHI score using regression analysis and was found not to be significant (p ⫽ 0.20). This suggests that the difference in AHI scores between scoring methods was not related to the severity of sleep apnea.

␬ Statistics and Disease Prevalence With Different AHI Thresholds ␬ statistics for method A vs method B were 0.66 (agreement, 98%), 0.48 (agreement, 85%), and 0.63 (agreement, 88%) for AHI thresholds of ⬎ 5, ⬎ 15, ⬎ 20, respectively. ␬ statistics for method A vs method C were 0.54 (agreement, 94%), 0.95 (agreement, 98%), and 0.77 (agreement, 90%) for AHI thresholds of ⬎ 5, ⬎ 15, ⬎ 20, respectively. ␬ statistics for method B vs method C were 0.3 (agreement, 89%), 0.44 (agreement, 81%), and 0.44 (agreement, 77%) for AHI thresholds of ⬎ 5, ⬎ 15, ⬎ 20, respectively. Thus, the agreement between methods varies considerably depending on the methods and threshold values applied (fair to very good agreement); however, the value of ␬ statistics may be influenced by the proportion of subjects in each category (prevalence), and it may be misleading to make comparisons between them.11 The clinical significance of these results may be assessed by examining the frequency table (Table 1), showing the number of patients (of 48 patients total) identified as having disease when different scoring methods and AHI thresholds are applied to the polysomnographic results. It is clear from Table 1 that method B gives a higher frequency of disease, regardless of the threshold value of AHI, compared with method A or method C, and method A gives a higher frequency of disease compared with method C. For example, in this sample of 48 subjects, when method B is used instead of method C, 11, 9, and 4 additional subjects receive a diagnosis of sleep apnea using AHI thresholds of 20, 15, and 5, respectively. The differences between the methods are less when the threshold value of AHI is lower. Table 1—No. of Patients Classified as Having Disease by Method and AHI Threshold (of a Total of 48 Patients) Methods A B C

912

AHI ⬎ 5

AHI ⬎ 15

AHI ⬎ 20

46 47 43

35 43 34

35 41 30

Table 2—Intrarater Reliability Methods

ICC

All three methods (pooled) A B C

0.990 0.995 0.990 0.995

Intrarater Reliability For each of the different scoring methods, intrarater reliability was assessed by calculating an ICC for those studies that had been rescored. A pooled ICC was also calculated combining the rescored studies for each of the three different methods. The results are presented in Table 2. The results are consistent with excellent reliability. Discussion This study has compared three different methods of scoring hypopneas that represent the extremes of criteria set in Victoria for analyzing polysomnographic results. Overall, our findings indicate that the method used for scoring hypopneas may influence both the diagnosis of sleep apnea and the rating of disease severity. The limits of agreement presented suggest that AHIs derived from different scoring methods for hypopneas differ to a clinically relevant extent. For example, method A may give an AHI that is 19 events per hour higher or 5 events per hour lower than method C, while method B may give an AHI that is 18 events per hour higher or 8 events per hour lower than method A, or an AHI that is 28 events per hour higher or 5 events per hour lower than method C. The CIs presented for these limits of agreement suggest that, even with a more favorable interpretation of the data, the level of agreement between the methods is poor. Agreement was further assessed by calculating an ICC for the three scoring methods. An ICC of 0.89 is relatively high; however, this indicates that there is consistent ranking of subjects by the different scoring methods but does not necessarily imply good absolute agreement. Furthermore, the adequacy of agreement is a practical matter; for some tests, it may need to be very high.12 The value of the ICC may be influenced by the selection of subjects over which it is defined. Where subjects are highly variable, the value of the ICC will tend to be high.13 Nevertheless, in the context of this study, it is useful to contrast the ICC obtained by comparing the different methods for scoring hypopneas with the intrarater reliability. The variation between methods is noted to be greater than the intrarater variability. Clinical Investigations

This study was designed specifically to examine the impact of varying hypopnea definitions on the evaluation of sleep apnea. In order to limit other sources of variation, a single, experienced sleep technologist undertook all the scoring of polysomnographic records. The scorer was blinded to previous scores, and the studies were scored in a random sequence to avoid potential bias from an order effect. This approach is somewhat artificial, and the results cannot necessarily be generalized to clinical practice. Indeed, the very high levels of intrarater reliability compared with other studies reflect the idealized conditions under which this study was conducted.14 Variability between methods or between scorers is likely to be greater in practice. A further possible limitation of this study is that the subjects were highly selected and selection was retrospective. Our exclusion and inclusion criteria, however, were developed in advance. Subject selection was stratified according to previously determined disease severity; therefore, subjects with more severe sleep apnea are overrepresented in the sample. This method was chosen so that we could specifically examine the effect of variations in hypopnea definitions on the assessment of sleep-disordered breathing in participants with moderate sleep apnea. The results of studies15,16 examining other sources of measurement error in the assessment of sleep apnea hypopnea syndrome suggest that variability tends to be greater in the midrange of disease. In the clinical setting, variability in the classification of patients with mild-to-moderate sleep apnea has implications for assessing treatment options and, in some circumstances, eligibility for governmentfunded CPAP services, such as the Victorian CPAP program. It is likely that if these scoring methods were applied to a more heterogeneous population, the level of agreement would improve.9 The methods of defining hypopneas chosen for this study were based on those currently used by laboratories in Victoria and may not be in widespread use elsewhere. Method B was adapted from published guidelines.7 These guidelines acknowledge the limitation to the evidence base for this proposed approach. Where possible, they recommend that changes in thoracoabdominal signals (respiratory inductive plethysmography) be based on the sum signal. Victorian sleep laboratories do not currently use the sum signal; therefore, we based our hypopnea scoring criteria for method B on changes in dual thoracic and abdominal signals. Previous researchers have also examined the agreement between different approaches to determining AHIs. Tsai et al17 compared a hypopnea definition that incorporated a 4% desaturation with one that included corroborative changes in either

desaturation (4%) or arousal. They found that the addition of arousal-based scoring criteria for hypopnea caused only small changes in AHI. Redline et al18 recently reported prevalence data from the Sleep Heart Health Study, in which they examined the effect of using 11 different criteria for scoring hypopneas on the prevalence of disease in a large community-based sample. Redline et al18 concluded that different approaches for measuring AHI could result in substantial variability in identifying and classifying sleep-disordered breathing. There were a large number of methods assessed in this study, and limits of agreement were not presented. To our knowledge, the present study is the first to examine the impact of altering methods of assessing respiratory signal changes on hypopnea scoring. The findings of this study suggest that based on current practices in Victoria, the AHIs reported by different laboratories may not be comparable. The study highlights the need to standardize the methods used to evaluate and define sleep-disordered breathing in clinical practice. This is an issue that needs to be addressed at both the national and international levels. Consideration should be given to adopting the currently available guidelines for research in clinical practice.7 In the absence of such guidelines, polysomnographic reports should include a description of the measurement techniques and methods used to derive AHIs. Adjustments could then be made to the scores depending on the method used. The results of this study suggest that the agreement between the methods could be improved if a correction factor is applied to adjust for the systematic differences between the methods. Thus, when the scoring method is treated as a fixed factor, the ICC is 0.95. Although this represents an improvement, a small proportion of patients would still be misclassified even after adjusting the scores. We have examined only a few of the methods used by sleep laboratories in Victoria. Further research evaluating different approaches to diagnosing sleep apnea is required. The best method for validating diagnostic criteria is to determine which methods of defining respiratory events predict adverse health outcomes. Some longitudinal studies, such as the Sleep Heart Health Study,19 are currently being conducted, but given the diversity of approaches to diagnosing sleep-disordered breathing in current practice, further studies are required.

Conclusion The absolute level of agreement between AHIs derived from some of the different scoring methods used in current clinical practice was poor. However, CHEST / 120 / 3 / SEPTEMBER, 2001

913

the different scoring methods tended to rank patients reasonably consistently; for a large proportion of patients, the scoring method may be irrelevant. When different thresholds of AHI used in practice are applied, however, the inconsistency in classification becomes more apparent. These findings have implications for case finding, population prevalence estimates, and grading of disease severity for access to government-funded CPAP services. Guidelines for standardizing the measurement and reporting of sleep studies should be adopted in clinical practice. References 1 Manser R, Rochford P, Roebuck T, et al. Survey of Victorian sleep laboratories: equipment, staging and scoring criteria (abstract). Respirology 2000; 5(suppl):A68 2 Moser N, Phillips BA, Berry DT, et al. What is hypopnea anyway? Chest 1994; 105:426 – 428 3 Redline S, Sanders M. Hypopnea, a floating metric: implications for prevalence, morbidity estimates and case finding. Sleep 1997; 20:1209 –1217 4 Series F, Marc I. Nasal pressure recording in the diagnosis of sleep apnoea hypopnoea syndrome. Thorax 1999; 54:506 –510 5 Rechtschaffen A, Kales A. A manual of standardized terminology and scoring system for sleep stages of human subjects. Los Angeles, CA: Brain Information Service, Brain Research Institute, 1968 6 Atlas Task Force of the American Sleep Disorders Association. EEG arousals: scoring rules and examples; a preliminary report. Sleep 1992; 15:174 –184 7 American Academy of Sleep Medicine Task Force. Sleeprelated breathing disorders in adults: recommendations for

914

8 9 10 11 12 13 14 15 16 17 18

19

syndrome definition and measurement techniques in clinical research. Sleep 1999; 22:667– 669 StataCorp Stata statistical software, release 6.0. College Station, TX: Stata Corporation, 1999 Streiner D, Norman GR. Reliability. In: Health measurement scales: a practical guide to their development and use. 2nd ed. New York, NY: Oxford University Press, 1995; 104 –127 Bland J, Altman DG. Statistical methods for assessing agreement between 2 methods of clinical measurement. Lancet 1986; 1:307–310 Altman D. Some common problems in medical research. In: Practical statistics for medical research. 1st ed. London, UK: Chapman and Hall, 1991; 396 – 439 Morton A, Dobson AJ. Assessing agreement. Med J Aust 1989; 150:384 –387 Armitage P, Berry G. Further experimental designs. In: Statistical methods in medical research. 3rd ed. Cambridge, UK: Blackwell Science, 1994; 237–282 Whitney C, Gottlieb DJ, Redline S, et al. Reliability of scoring respiratory disturbance indices and sleep staging. Sleep 1998; 21:749 –757 Stradling J, Mitchell J. Reproducibility of home oximetry tracings. J Ambul Monit 1989; 2:203–208 Dakin K, Baldwin DR, Britton JR, et al. A comparison of sleep study interpretation by clinicians using different sleep screening systems (abstract). Thorax 1998; 53(suppl 4):A42 Tsai W, Flemons WW, Whitelaw WA, et al. A comparison of apnea-hypopnea indices derived from different definitions of hypopnea. Am J Respir Crit Care Med 1999; 159:43– 48 Redline S, Kapur VK, Sanders MH, et al. Effects of varying approaches for identifying respiratory disturbances on sleep apnea assessment. Am J Respir Crit Care Med 2000; 161: 369 –374 Quan S, Howard BV, Iber C, et al. The Sleep Heart Health Study: design, rationale and methods. Sleep 1997; 20:1077– 1085

Clinical Investigations