Multicenter assessment of reliability of cranial MRI

Multicenter assessment of reliability of cranial MRI

Neurobiology of Aging 27 (2006) 1051–1059 Multicenter assessment of reliability of cranial MRI M. Ewers a , S.J. Teipel a,∗ , O. Dietrich b , S.O. Sc...

319KB Sizes 1 Downloads 72 Views

Neurobiology of Aging 27 (2006) 1051–1059

Multicenter assessment of reliability of cranial MRI M. Ewers a , S.J. Teipel a,∗ , O. Dietrich b , S.O. Sch¨onberg b , F. Jessen c , R. Heun c , P. Scheltens d , L. van de Pol d , N.R. Freymann c , H.-J. Moeller a , H. Hampel a a

Department of Psychiatry, Dementia and Neuroimaging Section, Alzheimer Memorial Center D2, Ludwig Maximilian University of Munich, Nussbaumstr. 7, 80336 Munich, Germany b Institute of Clinical Radiology-Großhadern, Ludwig Maximilian University of Munich, Marchioninistr. 15, 81377 Munich, Germany c Department of Psychiatry, Rheinische Friedrich-Wilhelms-University Bonn, Sigmund-Freud-Str. 25, 53105 Bonn, Germany d Department of Neurology, VU Medical Center, De Boelelaan 1117, 1081 HV Amsterdam, The Netherlands Received 24 September 2004; received in revised form 6 April 2005; accepted 23 May 2005 Available online 15 September 2005

Abstract Clinical utility of magnetic resonance imaging (MRI) for the diagnosis and assessment of neurodegenerative diseases may depend upon the reliability of MRI measurements, especially when applied within a multicenter context. In the present study, we assessed the reliability of MRI through a phantom test at a total of eleven clinics. Performance and entry criteria were defined liberally in order to support generalizability of the results. For manual hippocampal volumetry, automatic segmentation of brain compartments and voxel-based morphometry, multicenter variability was assessed on the basis of MRIs of a single subject scanned at ten of the eleven sites. In addition, cranial MRI scans obtained from 73 patients with Alzheimer’s disease (AD) and 76 patients with mild cognitive impairment were collected at subset of six centers to assess differences in grey matter volume. Results show that nine out of eleven centers tested met the reliability criteria of the phantom test, where two centers showed aberrations in spatial resolution, slice thickness and slice position. The coefficient of variation was 3.55% for hippocampus volumetry, 5.02% for grey matter, 4.87% for white matter and 4.66% for cerebrospinal fluid (CSF). The coefficient of variation was 12.81% (S.D. = 9.06) for the voxel intensities within grey matter and 8.19% (S.D. = 6.9) within white matter. Power analysis for the detection of a difference in the volumes of grey matter between AD and MCI patients across centers (d = 0.42) showed that the total sample size needed is N = 180. In conclusion, despite minimal inclusion criteria, the reliability of MRI across centers was relatively good. © 2005 Elsevier Inc. All rights reserved. Keywords: Multicenter; Reliability; Phantom test; Magnetic resonance imaging; Competence network dementia

1. Introduction MRI-based assessment of cerebral atrophic changes has been proposed as a useful tool to aid in the early detection and tracking of Alzheimer’s disease (AD) [22]. Numerous studies have examined the diagnostic accuracy of volumetric measurement and rating of selected brain regions such as the medial temporal lobe structures, corpus callosum, and grey matter [4,9,10,13,17,24,25,29,31,32,34]. Recently, in order to assess the clinical utility of neuranotomical markers within a large sample across different clinical sites, several multicenter studies have been reported [12,16,20]. A reli∗

Corresponding author. Tel.: +49 89 5160 5860; fax: +49 89 5160 5808. E-mail address: [email protected] (S.J. Teipel).

0197-4580/$ – see front matter © 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.neurobiolaging.2005.05.032

able application of MRI-based diagnostic criteria is pivotal for the utility of MRI for diagnostic purposes. Variability in MRI-based measurement between clinical sites may potentially influence the accuracy of biological measures [6,7,33], and thus compromise applicability of MRI-based diagnostic criteria across sites. The assessment of MRI reliability with regard to volumetric measures is therefore not only important for the internal quality assurance within a multicenter network but also with regard to the applicability of MRI volumetry for clinical diagnostics. In the present study, we aimed to assess the center-induced variability in MRI-based volumetric measures, such as grey matter volume and hippocampal volumetry, that are relevant for the diagnosis of neurodegenerative disorders. First, accuracy and reliablity of MRI was assessed via a phantom test. Secondly, the variabil-

1052

M. Ewers et al. / Neurobiology of Aging 27 (2006) 1051–1059

ity of volumetric measures, including manual volumetry of the hippocampus and segmentation of brain compartments, was tested. 2. Methods The inclusion criteria for the clinical centers consisted of the presence of a 1.5 T scanner and the opportunity to establish at least a 3D T1 weighted (Philips scanners) or MP-RAGE (Siemens scanners) acquisition sequence. For standardization of MRI acquisition across centers, acquisition parameters were provided as a guideline to all centers. The phantom test of the American College of Radiology MRI Accreditation Program [1] was conducted at 11 German clinical sites of the Competence network on dementia. The phantom consisted of a 148 mm long cylinder filled with a solution of 10 mM nickel chloride and 75 mM sodium chloride in 3.8 l destilled water (Fig. 1). The MRI scans were acquired according to a standardized protocol including an axial T1-weighted 3D sequence and an axial T2-weighted double echo sequence on 1.5 T scanners, where the hardware and sequence parameters varied between centers. Siemens scanners were used (Siemens Sonata or Siemens Magnetom Vision) at eight centers and Philips scanners (Philips Gyroscan and Philips Intera) at the remaining centers. For the T1 3D sequence, repitition time (TR) varied between 450 and 662 ms and echo time (TE) between 13 and 20 ms. The slice thickness was 5 mm and the slice gap 5 or 10 mm across centers. For the T2-weighted double echo sequence, TR was fixed at 2000 ms and TE at 20/80 ms for all centers, with the slice thickness being 5 mm and slice gap 5 or 10 mm. Geometric accuracy was, in addition, assessed on the basis of a sagittal single slice spin-echo acquisition.

Accuracy of MRI was assessed according to the following seven ACR-criteria [1]: (1) geometric accuracy pertained to the measurement of the length (criterion = 148 ± 2 mm) and diameter (criterion = 190 ± 2 mm) of the phantom. (2) High contrast spatial resolution was assessed to measure the scanner’s ability to resolve small objects when the signal-to-noise ratio is high enough not to be the limiting factor (criterion: objects of size <1 mm must be discernable). (3) Slice thickness accuracy included the measurement of the actual slice thickness, with the prescribed slice thickness being 5 mm (criterion = 5 ± 0.7 mm). (4) Slice position accuracy tested the actual location of the slice. A shift in the slice position was determined by the difference in the length between two adjacent bars, which are of equal length if the prescribed slice position is accurate. The criterion is a maximal difference in the bar length as large as 5 mm, corresponding to an actual shift in the slice position of 5 mm. (5) Image intensity uniformity refers to homogeneity of the voxel intensities across a water-only region of the phantom. The image uniformity was assessed according to the following formula: Percent integral uniformity (PIU) = 100 × (1 − (high − low)/(high + low)), where low and high correspond the low and high extreme voxel intensities, which had to be present in at least 100 voxels (size: 1 mm2 ) within the phantom image. The criterion was that PIU must be greater than or equal to 87.5%. (6) Signal ghosting refers to the articfact in which a faint copy (ghost) of the imaged object is visible next to the object. For the computation of the ghosting ratio, four small regions of interest (ROI = 10,000 mm2 ) were placed next to and above the phantom within the air region of the MRI image. A large ROI (about 20,000 mm2 ) was placed within a water-only region of the phantom. The percent signal to ghosting ratio was computed according to: Percent ghosting ratio = |((top + bottom) − (left + right))/ (2 × (large ROI))| where the variables left, right, top, bottom and large ROI correspond to the mean intensity value of the respective ROIs. The criterion was a ghosting ratio ≤0.025. (7) Low contrast object detectability refers to the detectability of a maximum of 40 objects displayed at varying size and contrast values. The criterion of the minimum number of discernable objects was nine. 2.1. In vivo MRI A single healthy male volunteer (31 years old, MMSE = 30) was scanned at a subset of 10 centers. MRI scans were conducted with a sagittal magnetization-prepared rapid gradient echo (MP-RAGE) sequence on Siemens scanners (n = 8) and a 3D-fast-T1-weighted-gradient-echo sequence on Philips scanners (n = 2). Among all centers the TR varied between 9.3 and 20 ms and TE between 3.93 and 4.38 ms.

Fig. 1. The phantom developed by the American College of Radiology is shown. For scanning, the phantom is placed inside the scanner in a position the head of a patient would obtain (as indicated by the marks nose and chin visible on the phantom). Axial slices will produce images perpendicular to the long axis of the phantom. Inside the phantom are structures for the assessment of the different criteria.

2.2. Hippocampus volumetry One scan was manually reformatted (using GE Radworks) coronally in a plane perpendicular to the long axis of the

M. Ewers et al. / Neurobiology of Aging 27 (2006) 1051–1059

(left) hippocampus (slice thickness 2 mm). All scans were then registered using AIR [35], and sinc interpolation was applied for reslicing. The software package Show Images 3.7.0 (developed at VU Medical Center) was used for manual segmentation of the hippocampus. The rater uses the mouse to trace the boundaries of the region of interest. The hippocampus was measured according to criteria published by Jack et al. [14]. In short, the slice on which the hippocampal formation is first visible ventral of the amygdala was the most anterior slice measured. The ventral border is formed by the white matter of the parahippocampal gyrus. The dorsal border is formed by the amygdala in the anterior slices, more posterior cerebrospinal fluid and the choroid plexus form the dorsal border. The slice in which the crux of the fornix is visible in its total length was the most posterior slice measured. The dentate gyrus, cornu ammonis, subiculum, fimbria and alveus were measured (referred to as hippocampus). One rater performed all the segmentations. To assess intra-rater reliability each hippocampus was measured twice on a second occasion. The hippocampal volume (in mm3 ) was computed by multiplying the surface of each region of interest by the slice thickness. All measurements were performed twice on the right hippocampus only. Hence, all results in this paper pertain to the right hippocampus. 2.3. Reliability of voxel-based analysis The in vivo MRI scans of the single subject scanned at 10 out of 11 centers were spatially coregistered, using an affine transformation with 12 parameters (SPM2, Welcome Department, London, UK). The brain volume was isolated from the skull, using the computer program “Brain Extraction Tool” (BET) included in the “Oxford Center of Functional Magnetic Resonance Imaging of the Brain (FMRIB) Software Library 3.1” [30]. Subsequently, the brain was segmented into grey matter, white matter and cerebrospinal fluid (CSF), using “FMRIB’s Automated Segmentation Tool” (FAST). For the segmentation, we chose not to apply a priori probabilities for the voxel classification to increase the sensitivity for potential variation in the volumes of the brain compartments between centers, and thus to achieve a conservative test of the multicenter reliability. For the quantification of the variability in the voxel intensities between centers, voxel intensity values were first normalized to the mean voxel intensity per image and scaled by a factor of 100. Subsequently, coefficients of variation for each voxel of the coregistered and extracted brains across all centers were computed. For the visualisation of the spatial distribution of the voxel-based coefficients of variation, the coefficients of variation were mapped onto a brain with a customized program written in Matlab 5.3 (The Mathworks Inc., 1999). The resulting brain map of coefficients of variation was subsequently used for manual tracing of the hippocampus within one slice that allowed for an easy identification of the borders of the hippocampus. The mean value of the coefficients of variation within the outlined region of the hip-

1053

pocampus was computed for an estimate of the variability of the voxel intensities within the hippocampus. For an estimate of the variability of voxel intensities within the grey and white matter regions, the brain map of covariance coefficients was multiplied with a white and respectively grey matter mask, using SPM2. In a final step, the mean covariance coefficient within the white and grey matter regions were computed, using Matlab 5.3. Extreme values (>60%) that resulted from effects of misregistration and could be visually located within a discrete band along the border of the brain were excluded from the analysis. 2.3.1. Statistics For the phantom test, a bias in the measurements across centers was determined using a two-sided one-sample t-test (α = 0.05) to test for a directional deviation of the observed measurement from the true measurement of the phantom. In addition, the 95% confidence intervals of the deviations of the observed measurements were computed. For the measurement of the image intensity homogeneity and ghosting artefact, the 95% confidence intervals of the deviation between the observed measurement and the criterion value (instead of the “true” value) were employed, since no actual “true” measurement of the phantom was known for these two variables. The coefficient of variation (V) was computed by dividing the standard deviation of the measurements (S.D.) by the mean value (X) of the measurements averaged across all centers (X), multiplied by 100:   S.D. V = × 100 X The coefficient of variation was computed for the hippocampus volumetry, segmentation of each brain compartment, and for the intensities of each voxel of the in vivo MRI brain scans. 2.4. Grey matter volume in AD and MCI patients A total of 73 AD and 76 amnestic MCI patients were available from six of the 11 centers where each center contributed at least n = 10 per group (one center had n = 9, Table 2). AD patients were diagnosed according to the NINCDS-ADRDA criteria [18] and the MCI patients according to the Petersen criteria [26]. For the AD patients (39 of female gender), mean MMSE score was 24.2 (S.D. = 3.1). For the MCI patients (37 of female gender), mean MMSE score was 26.7 (S.D. = 2.5). The difference in MMSE between groups was statistically significant (p < 0.0001). MRI scans were preprocessed by setting the origin of the image to the anterior commissure. Subsequently, the MRI scans positioned in native space were segmented, using the SPM2 standard T1 template, grey matter, white matter and CSF priors. This segmentation step involves an affine transformation of each scan to the template and subsequent retransformation into the native space.

1054

M. Ewers et al. / Neurobiology of Aging 27 (2006) 1051–1059

The effect size d [3] of the difference in the volume of grey matter between AD and MCI patients was estimated according to the following formula:   X−Y , d= δxy where X − Y is the difference in the mean volume of grey matter between AD and respectively MCI patients, and δxy the pooled standard deviation of volume of grey matter obtained from AD and MCI patients. The two-tailed independent samples t-test of the group comparison and corresponding effect size was determined for each center separately (henceforth, referred to as “monocenter effect size”). In addition, the effect size of the group difference on the basis of the grey matter volumes pooled across centers was estimated (henceforth, referred to as “pooled effect size”). In order to compare the monocenter effect size to a “multicenter effect size”, n = 10 data points were randomly sampled each from the MCI group (N = 76) and AD group (N = 73) pooled among centers and the effect size of the group difference was calculated. The random sampling of grey matter volumes from each group and the calculation of the respective effect size was iterated 1000 times. The mean effect size of the resulting sampling distribution along with the 95% confidence interval (95% CI) was calculated, using a customized program written in Matlab 5.6. We chose to sample n = 10 in each sampling trial applied to the pooled data since this sample size approximates the one collected at each single center, thus allowing for a comparison between the monocenter and multicenter effect size. A one-sample t-test was used to test whether the multicenter effect size significantly deviated from the value 0. The sample size required to detect the pooled effect size with a statistical power of 0.80 [3], at α = 0.05 for a two tailed t-test, was estimated, using the computer program GPOWER [5]. The intraclass correlation coefficient adjusted for the mean was computed with SPSS 11.5.1, using a two-way random model.

Fig. 2. Mean values (filled symbols) and 95% confidence intervals (bars) of the deviations in the measurement of the length, diameter, slice thickness and position for the T1 and the DE sequence. For all variables, the lower and upper cut-off values of the criteria are indicated (blank circles), where for the slice position only the upper limit is defined.

of the confidence intervals of deviations in the slice position and thickness exceeded the limits of the ACR criteria (Fig. 2). After exclusion of those two centers that did not meet the criteria, the upper boundary of the confidence interval of measurement deviations were confined within the limits of the criteria of slice position (0.65 mm for T1 and DE, criterion = ±0.7 mm) but still exceeded slightly the upper limit of the criterion of slice thickness (5.58 mm for T1, 5.37 mm for DE, criterion = 5 mm). The confidence interval of the absolute measurements of the ghosting artifact and intensity uniformity (Fig. 4) fell within the limits of the ACR criteria.

3. Results 3.1. Phantom test Descriptive statistics are presented in Table 1. Two out of nine centers did not meet all criteria. For these two centers, the criterion of slice thickness was exceeded by more than 2.69 S.D. At one of these two centers, slice position exceeded the criterion (0.7 mm) by more than 2.53 S.D. and the low contrast object detectability by 0.95 S.D. (criterion >9 objects) in addition. The criterion of high contrast resolution was not met at the other center (Table 1). Across centers, the 95% confidence intervals of deviations in the measurement of the length and diameter (Fig. 2), and low contrast object detectability (Fig. 3) were within the limits of the ACR criteria. In contrast, the upper boundary

Fig. 3. Mean value of measurement deviation in the low contrast object detectability (filled circles) and corresponding 95% confidence interval (bars) for the T1 and DE sequences. The cut-off value of the criterion indicates the maximally acceptable deviation (31 objects) from the total amount of detectable objects (40), i.e. at least nine objects need to be detectable in order to meet the criterion.

M. Ewers et al. / Neurobiology of Aging 27 (2006) 1051–1059

1055

Table 1 Criteria and descriptive statistics for each phantom measurement Dimension

Sequence

Criterion

Arithmetic mean (S.D.)

Length (in mm) Diameter (in mm) Spatial resolution (in mm)

T1 T1 T1

148 ± 2 190 ± 2 ≤1

147.08 (0.78) 189.8 (1.14) HD: 0.95 (0.05)a VD: 0.95 (0.05)b HD: 0.96 (0.05)a VD: 0.99 (0.07)b 5.86 (1.3) 5.92 (1.29) 3.07 (4.39) 3.16 (4.52) 94.08 (1.9) 93.01 (2.05) 0.003 (0.002) 27 (10.63) 20.55 (9.43)

DE Slice thickness (in mm) Slice position (in mm) Intensity uniformity index (in %) Ghosting (in %) Low contrast object (number) a b

T1 DE T1 DE T1 DE T1 T1 DE

≤1 5 ± 0.7 5 ± 0.7 ≤5 ≤5 ≥87.5 ≥87.5 ≤0.025 ≥9 ≥9

HD: objects to be discerned were arrayed in horizontal direction. VD: objects to be discerned were arrayed in vertical direction.

Across all centers, a bias was observed in the geometric measurement, with the length (p = 0.003) but not the diameter (p = 0.57) being underestimated. The slice thickness was significantly higher than the prescribed slice thickness (p = 0.04 for T1 and p = 0.05 for DE). No bias was determined for the other phantom measurements due to the fact that by definition only unidirectional deviation in these measurements was possible (e.g. the spatial resolution could only be too low but not too high).

first and second measurement, respectively. For the mean hippocampal volume averaged across both time points of measurement, the coefficient of variation was 3.55%. The intraclass correlation coefficient was 0.86, indicating a high reliability of the manual hippocampus measurement across centers.

3.2. Hippocampus volumetry

The mean volume was 758.94 cm3 (S.D. = 38.07) for cortical grey matter, 599.93 cm3 (S.D. = 29.20) for white matter and 248.99 cm3 (S.D. = 11.61) for CSF (Fig. 5). The coefficient of variation was 5.02% for grey matter, 4.87 % for white matter, and 4.66% for CSF. For the voxel-based morphometric analysis, the mean coefficient of variation averaged across voxels was 12.81% (S.D. = 9.06) within the grey matter, 8.19% (S.D. = 6.9) within the white matter, and 7.97% (S.D. = 3.28) within the hippocampus (Fig. 6).

The volume of right hippocampus was measured twice by the same rater. The mean hippocampus volume was 3362.05 mm3 (S.D. = 138.94) for the first measurement, 3273.14 mm3 (S.D. = 102.99) for the second measurement. The coefficient of variation was 4.13% and 3.15% for the

3.3. Segmentation of brain compartments and voxel-based morphometry

3.4. Multicenter assessment of difference in grey matter volume between MCI and AD patients

Fig. 4. Mean value and 95% confidence interval (bars) of the index of homogeneity of intensities for the T1 and DE sequence (filled circles). The criterion cut-off value indicated the minimal degree of intensity homogeneity required.

Mean volume of grey matter in AD and MCI patients and the monocenter effect sizes of group differences for each center are displayed in Table 2. The mean of the sampling distribution of the multicenter effect sizes was d = −0.41 (p < 0.0001, Fig. 7; negative effect size indicates larger grey matter volume in MCI patients). All monocenter effect sizes, except for two centers (d = −1.31 and d = 0.27), were within the 95% CI of the multicenter sampling distribution (95% CI = −1.03 − 0.25). Those two centers that showed effect sizes exceeding the 95% CI did not correspond to those centers who deviated from the reliability criteria of the phantom test reported above. A t-test confirmed that the monocenter effect sizes did not differ in either direction from the mean

1056

M. Ewers et al. / Neurobiology of Aging 27 (2006) 1051–1059

Table 2 Comparison of grey matter volume (ml) between AD and MCI patients at each center Center

I II III IV V VI Pooled

Sample size

Mean volume of grey matter (S.D.)

AD

MCI

AD

MCI

13 9 11 11 10 19 73

8 11 17 11 9 20 76

619.38 (77.29) 629.4 (64.8) 611.71 (58.57) 578.48 (67.0) 645.30 (82.38) 559.75 (105.35) 601.33 (84.63)

635.11 (90.37) 672.57 (66.5) 642.49 (62.31) 680.29 (93.53) 624.22 (77.15) 592.51 (78.0) 636.22 (80.85)

Fig. 5. Box-plot of the volumes of the segmented brain compartments including grey matter (GM), white matter (WM), and cerebrospinal fluid (CSF). Outliers (blank circles) are defined as deviating by 1.5–3 box-lengths (interquartile range) from the upper or lower edge of the box.

Effect size d

p-Value

−0.20 −0.69 −0.51 −1.31 +0.27 −0.36 −0.42

0.68 0.16 0.2 0.008 0.57 0.28 0.01

Fig. 7. Sampling distribution of multicenter effect sizes of the group difference in grey matter volume between AD and MCI patients. Samples of n = 10 were randomly drawn from a total of 73 AD and 76 MCI patients each, where sampling without replacement was iterated 1000 times. The blue line indicates the mean of the sampling distribution. For purposes of comparison between multicenter and monocenter effect sizes, the effect size of each monocenter analysis (red stars) were projected onto the plot (note that height of the stars was chosen arbitrarily and does not indicate frequency).

multicenter effect size (p = 0.8), suggesting that the monocenter effect sizes were not superior when compared to the multicenter effect sizes. When pooling data of patients with MCI (N = 76) and respectively AD (N = 73) across centers, the pooled effect size was d = −0.42 (p = 0.01). The total sample size needed to detect the pooled effect size with sufficient statistical power yielded N = 180.

4. Discussion Fig. 6. Colour map of coefficient of variation displayed within a sagittal slice of the brain. The colour bar indicates the correspondence between different colours and the coefficients of variation (in percent).

The aim of the current study was to investigate the reliability of MRI and variability of MRI-based volumetric measures across multiple sites.

M. Ewers et al. / Neurobiology of Aging 27 (2006) 1051–1059

The major results of the phantom test showed that two out of 11 centers did not meet the reliability criteria for slice position and thickness, spatial resolution of low contrast objects and spatial resolution at high spatial contrast. Thus, the majority of centers showed accurate measurements across all dimensions assessed. In order to estimate the variability of the MRI measurements at a follow-up examination, we computed the 95% CIs of the measurement deviations. The 95% CIs exceeded the limits of the criteria on slice position and thickness, primarily due to those two centers that exceeded the criteria for these two measures, suggesting that MRI measurements at a follow-up assessment are likely to be confined within the criteria limits at most centers. However, one has to caution that this conclusion rests on the assumption that the current measurements are representative of subsequent measurements, which is valid only to a limited extent since fluctuation in measurement accuracy over time have been reported in previous studies [19]. A bias was observed in the measurement of the length and slice thickness. However, for the measurement of length, the underestimation, although statistically significant, was rather small since the deviation across all centers did not exceed the limits of the ACR criteria. For the bias in the slice thickness, the effect was due to the extreme values of the two outlier centers, since no bias could be detected after removing those two centers from the analysis, suggesting that the elevation of slice thickness above the limit of the criterion was not representative for the whole group. One potential explanation of the variance in the measurement accuracy between centers may be differences in the routine quality assurance of MRI implemented at the centers. However, the routine quality check of MRI provided by the manufacturers (Siemens and Philips) did not seem to vary between the centers. Quality assurance used at the clinics includes a manufacturer-specific phantom test and the calibration of the scannner twice a year. Measures beyond the quality check provided by the manufacturer were not reported by the clinics. Thus, it is unlikely that any difference in the routine quality assurances between clinics may account for the multicenter variability in measurement accuracy. For those two centers that did not meet the phantom test criteria used in the present study it is possible that transient sources of variability may have lead to measurement inaccuracies. This can only be evaluated within a longitudinal assessment of reliability within centers, which is currently under way at all centers of the network. Although a number of other studies on multicenter MRI studies [19,27] that included volumetry in patients with AD [12,16,20,21] have been reported, only few studies have assessed multicenter reliability [15,19,27]. The comparison with previous studies using a MRI phantom test shows that the deviation in the measurement of length (about 0.92 mm) observed in the current study is relatively low. Prott et al. [27] found in a European multicenter study a deviation in the measurement of length of about 1.8 mm at 23 out of 27

1057

scanners tested. McRobbie and Quest [19] assessed the reliability of 17 MRI scanners over an observation period of 8 years, finding a mean deviation in the measurement of length of 1.39 mm. In their study, deviations in the measurement of length were one of most frequent problems, in addition to deviations in intensity homogeneity and ghosting. Jack et al. [16] do not report statistics on phantom measurements, but mention that geometric distortion and signal-to-noise ratio were the most frequent problems in their longitudinal MRIbased assessment of AD patients. In the current study, we did not observe any failures for these measures by any center tested, however, we did find failures to meet the criteria regarding the slice position, slice thickness and spatial resolution, which were only observed at a relatively low frequency in the study of McRobbie and Quest. However, a direct comparison between the studies is not possible due to differences in the methods and study design. Another aim of the study was to assess to what extent the reliability of the MRI measurement determined in the phantom test can be generalized to manual and fully computerized measures of structural brain changes. Results of the automatic segmentation of the brain compartments showed a variability across centers that is in agreement with other studies on quantification of structural brain deficits, such as the semi-automatic assessment of lesion volume in patients with multiple sclerosis [8]. Both the manual measurement of the hippocampus and the automatic volumetry of grey matter had a coefficient of variation below 5%. More recently, voxelbased morphometry has been applied for the assessment of changes in grey matter concentration in patients with AD and MCI in monocenter studies [2,11,28]. To our knowledge, our study presents the first data on the multicenter variability of voxel-based morphometry. The coefficient of variation for the voxel intensities measured across centers were higher than those of the segmented brain compartments. This increase in variability can at least partially be attributed to misregistration effects, which lead to a mismatch of brain voxel positions between centers. Moreover, misregistration of the brains may lead to the overlap of voxel positions at the border of the brains with voxel intensities of the background, resulting in extreme values of coefficients of variation that were observed along the boundary of the brain. Note, that our protocol of preprocessing of the images did not include any correction for intensity non-uniformity or smoothing-procedures in order to maximize the sensitivity for the reaction of the variability in voxel intensities between centers. Thus, the current results can be interpreted as a liberal estimate of multicenter variability of voxel intensities. Still, the estimate of the variability of the voxel intensities within the hippocampus was moderate and is comparable to the variability of other biological measurements, e.g. APP in cerebrospinal fluid [23]. Based on the observed multicenter variability of volumetric measures, it would be feasible to expect a reduced effect size for multicenter data when compared to monocenter data. However, the present results on the effect size of the differences in grey matter between MCI and AD patients show

1058

M. Ewers et al. / Neurobiology of Aging 27 (2006) 1051–1059

that the multicenter effect size did not seem to systematically differ from the monocenter effect sizes, despite the fact that those two centers that did not meet the criteria of the ACR phantom test were included. In contrast, the results on the multicenter reliability of the measurement of grey matter volume based on the data obtained from the single subject scanned at all centers demonstrated variability between centers. A possible explanation could be that the estimate of the variability between centers based on the single subject observation may partially reflect variability that may occur within centers as well, e.g. positioning effects in the scanner or different radiologists operating the system. Thus, the estimate of the multicenter variability may have been inflated. On the other hand, for the interpretation of the comparison of monocenter and multicenter effect sizes based upon the large sample of MCI and AD patients some caveats need to be considered as well. Note that the estimation of the monocenter effect sizes was based on a rather small sample size per center. Although the effect of diagnosis based upon the pooled multicenter data was statistically significant, this was true for none of the within-center effect sizes of group differences. Thus, it is possible that the rather large variability within centers due to low sample size may have occluded variability between centers. The multicenter effect may have become detectable if larger sample sizes per center had been included. It should be noted that acquisition parameters and scanner manufacturer did vary between centers to some extent. The competence network pursued a standardization of the MRI acquisition by providing guidelines on acquisition sequences to be used by each center. However, a liberal policy of keeping acquisition parameters and manufacturer of hardware uniform across centers was adopted in order to allow for reasonable variability in sequences. A rigid standardization of MRI acquisition within the network would compromise the conclusiveness regarding the reliability of diagnostic criteria at clinical sites outside the network where acquisition parameters and manufactures of hardware are likely to vary. Thus, our strategy was to define liberal standards of acquisition and to assess the variability in measurement accuracy by phantom test and measures of interest, i.e. volume of hippocampus and grey matter that are relevant for the detection of AD. Our current results suggest that although measurement accuracy varies between centers, the extent appears to be within limits that allow for a reasonable multicenter assessment of volumetric differences between groups. From a practical point of view the current results are encouraging since our analysis shows that a multicenter data analysis is sensitive towards the effects of AD on the grey matter volume when compared to MCI patients. Sample size estimation showed that a total of 180 patients are needed in order to reliably detect the expected effect size, which appears to be feasable for multicenter projects. On the basis of these results, we suggest that the ACR reliability criteria may be used in order to assess the extent of measurement accuracy across sites. However, the meaning of the violation of the cri-

teria needs to be determined relative to the specific clinical measures to be studied. In the present study, violation of the criteria did not seem to lead to detectable deviations in measures of interest, i.e. manual tracing of the hippocampus as well as automatic analysis of grey matter volume. Although no direct test on the influence of measurement accuracy on volumetric measures was possible, results of the in vivo MRI of a single subject scanned at 10 out of 11 sites do not suggest that failure of the phantom test necessarily leads to abnormal measurement of the volume of the hippocampus or grey matter. Moreover, the analysis of the multicenter comparison between MCI and AD patients showed that the multicenter analysis produced similar effect sizes of group difference in grey matter volume when compared to the monocenter analysis. This was true even though those two centers that failed the phantom test were included in the analysis, suggesting that the error variance was not increased within a multicenter analysis. Together, the current results suggest that those measurement dimensions in which two centers showed significant deviations on the phantom test may not exert a significant effect on volumetric measures. Future studies may explore systematically measurement dimensions that may be of importance for volumetric measures.

Acknowledgement The study was funded by a grant from the Bundesministeriums f¨ur Bildung und Forschung (BMBF 01 GI 0102) awarded to the dementia network “Kompetenznetz Demenzen”. We thank Dr. Jens Heidenreich (Charite Berlin), Dr. Christian Luckhaus (Rheinische Kliniken D¨usseldorf), Drs. Thomas Kucinski and Thomas M¨uller-Thomsen (Universit¨atklinikum Hamburg-Eppendorf), Dr. Marco Essig (University of Heidelberg) and Dr. Anke Hensel (University of Leipzig) and the centers of Bonn, Erlangen, Frankfurt, Freiburg, Mannheim for contributing data and helpful comments on the manuscript.

References [1] ACR. Phantom test guidance for the ACR MRI accredition program. Reston, VA: American College of Radiology; 2000. [2] Baron JC, Chetelat G, Desgranges B, Perchey G, Landeau B, de la Sayette V, et al. In vivo mapping of gray matter loss with voxel-based morphometry in mild Alzheimer’s disease. Neuroimage 2001;14(2):298–309. [3] Cohen J. Statistical power analysis for the behavioral sciences. Rev. ed New York: Academic Press; 1977. [4] Du AT, Schuff N, Amend D, Laakso MP, Hsu YY, Jagust WJ, et al. Magnetic resonance imaging of the entorhinal cortex and hippocampus in mild cognitive impairment and Alzheimer’s disease. J Neurol Neurosurg Psychiatry 2001;71(4):441–7. [5] Faul F, Erdfelder E. GPOWER: a priori, post-hoc and compromise power analyses for MS-DOS [computer program]. Bonn; 1992. [6] Filippi M, Horsfield MA, Campi A, Mammi S, Pereira C, Comi G. Resolution-dependent estimates of lesion volumes in magnetic

M. Ewers et al. / Neurobiology of Aging 27 (2006) 1051–1059

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14] [15]

[16]

[17]

[18]

[19]

[20]

resonance imaging studies of the brain in multiple sclerosis. Ann Neurol 1995;38(5):749–54. Filippi M, Yousry T, Baratti C, Horsfield MA, Mammi S, Becker C, et al. Quantitative assessment of MRI lesion load in multiple sclerosis. A comparison of conventional spin-echo with fast fluidattenuated inversion recovery. Brain 1996;119:1349–55. Filippi M, van Waesberghe J, Horsfield M, Bressi S, Gasperini C, Yousry T, et al. Interscanner variation in brain MRI lesion load measurements in MS: implications for clinical trials. Neurology 1997;49(2):371–7. Frisoni GB, Testa C, Zorzan A, Sabattoli F, Beltramello A, Soininen H, et al. Detection of grey matter loss in mild Alzheimer’s disease with voxel based morphometry. J Neurol Neurosurg Psychiatry 2002;73(6):657–64. Frisoni GB, Scheltens P, Galluzzi S, Nobili FM, Fox NC, Robert PH, et al. Neuroimaging tools to rate regional atrophy, subcortical cerebrovascular disease, and regional cerebral blood flow and metabolism: consensus paper of the EADC. J Neurol Neurosurg Psychiatry 2003;74(10):1371–81. Grossman M, McMillan C, Moore P, Ding L, Glosser G, Work M, et al. What’s in a name: voxel-based morphometric analyses of MRI and naming difficulty in Alzheimer’s disease, frontotemporal dementia and corticobasal degeneration. Brain 2004;127(Pt 3):628–49. Grundman M, Petersen RC, Ferris SH, Thomas RG, Aisen PS, Bennett DA, et al. Mild cognitive impairment can be distinguished from Alzheimer disease and normal aging for clinical trials. Arch Neurol 2004;61(1):59–66. Hensel A, Ibach B, Muller U, Kruggel F, Kiefer M, Gertz HJ. Does the pattern of atrophy of the Corpus callosum differ between patients with frontotemporal dementia and patients with Alzheimer’s disease? Dement Geriatr Cogn Disord 2004;18(1):44–9. Jack Jr CR. MRI-based hippocampal volume measurements in epilepsy. Epilepsia 1994;35(Suppl 6):S21–9. Jack Jr CR, Petersen RC, Xu YC, O’Brien PC, Smith GE, Ivnik RJ, et al. Prediction of AD with MRI-based hippocampal volume in mild cognitive impairment. Neurology 1999;52(7):1397–403. Jack Jr CR, Slomkowski M, Gracon S, Hoover TM, Felmlee JP, Stewart K, et al. MRI as a biomarker of disease progression in a therapeutic trial of milameline for AD. Neurology 2003;60(2):253–60. Killiany RJ, Gomez-Isla T, Moss M, Kikinis R, Sandor T, Jolesz F, et al. Use of structural magnetic resonance imaging to predict who will get Alzheimer’s disease. Ann Neurol 2000;47(4):430–9. McKhann G, Drachman D, Folstein M, Katzman R, Price D, Stadlan EM. Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA work group under the auspices of Department of Health and Human Services Task Force on Alzheimer’s disease. Neurology 1984;34(7):939–44. McRobbie DW, Quest RA. Effectiveness and relevance of MR acceptance testing: results of an 8-year-audit. Brit J Radiol 2002;75(894):523–31. Mungas D, Reed BR, Jagust WJ, DeCarli C, Mack WJ, Kramer JH, et al. Volumetric MRI predicts rate of cognitive decline related to AD and cerebrovascular disease. Neurology 2002;59(6):867–73.

1059

[21] Mungas D, Jagust WJ, Reed BR, Kramer JH, Weiner MW, Schuff N, et al. MRI predictors of cognition in subcortical ischemic vascular disease and Alzheimer’s disease. Neurology 2001;57(12):2229– 35. [22] Nestor PJ, Scheltens P, Hodges JR. Advances in the early detection of Alzheimer’s disease. Nat Med 2004;10(Suppl):S34–41. [23] Olsson A, Hoglund K, Sjogren M, Andreasen N, Minthon L, Lannfelt L, et al. Measurement of [alpha]- and [beta]-secretase cleaved amyloid precursor protein in cerebrospinal fluid from Alzheimer patients. Exp Neurol 2003;183(1):74–80. [24] Pantel J, Kratz B, Essig M, Schroder J. Parahippocampal volume deficits in subjects with aging-associated cognitive decline. Am J Psychiatry 2003;160(2):379–82. [25] Pantel J, Schroder J, Essig M, Jauss M, Schneider G, Eysenbach K, et al. In vivo quantification of brain volumes in subcortical vascular dementia and Alzheimer’s disease. An MRI-based study. Dement Geriatr Cogn Disord 1998;9(6):309–16. [26] Petersen RC, Smith GE, Waring SC, Ivnik RJ, Tangalos EG, Kokmen E. Mild cognitive impairment: clinical characterization and outcome. Arch Neurol 1999;56(3):303–8. [27] Prott FJ, Haverkamp U, Willich N, Resch A, Stober U, Potter R. Comparison of imaging accuracy at different MRI units based on phantom measurements. Radiotherapy and Oncology 1995;37(3): 221–4. [28] Rombouts SA, Barkhof F, Witter MP, Scheltens P. Unbiased wholebrain analysis of gray matter loss in Alzheimer’s disease. Neurosci Lett 2000;285(3):231–3. [29] Scheltens P, Launer LJ, Barkhof F, Weinstein HC, van Gool WA. Visual assessment of medial temporal lobe atrophy on magnetic resonance imaging: interobserver reliability. J Neurol 1995;242(9): 557–60. [30] Smith S. Fast robust automated brain extraction. Human Brain Mapping 1999;17(3):143–55. [31] Teipel SJ, Alexander GE, Schapiro MB, Moller HJ, Rapoport SI, Hampel H. Age-related cortical grey matter reductions in nondemented Down’s syndrome adults determined by MRI with voxelbased morphometry. Brain 2004;127(Pt 4):811–24. [32] Teipel SJ, Bayer W, Alexander GE, Bokde AL, Zebuhr Y, Teichberg D, et al. Regional pattern of hippocampus and corpus callosum atrophy in Alzheimer’s disease in relation to dementia severity: evidence for early neocortical degeneration. Neurobiol Aging 2003;24(1): 85–94. [33] Tofts PS. Standardisation and optimisation of magnetic resonance techniques for multicentre studies, Journal of Neurology. Neurosurg Psychiatry 1998;64(Suppl):S37–43. [34] Visser PJ, Verhey FR, Hofman PA, Scheltens P, Jolles J. Medial temporal lobe atrophy predicts Alzheimer’s disease in patients with minor cognitive impairment. J Neurol Neurosurg Psychiatry 2002;72(4):491–7. [35] Woods RP, Grafton ST, Holmes CJ, Cherry SR, Mazziotta JC. Automated image registration. I. General methods and intrasubject, intramodality validation. J Comp Assist Tomogr 1998;22(1):139– 52.