www.elsevier.com/locate/ynimg NeuroImage 24 (2005) 1088 – 1098
On the modeling, construction, and evaluation of a probabilistic atlas of brain perfusion Torbjbrn Vik,a,b,* Fabrice Heitz,a Izzie Namer,b and Jean-Paul Armspachb a
Laboratoire des Sciences de l’Image de l’Informatique et de la Te´le´detection (LSIIT), UMR-7005 CNRS, 67412 Illkirch, France Institut de Physique Biologique, UMR-7004 CNRS, 67085 Strasbourg, France
b
Received 29 April 2004; revised 7 September 2004; accepted 21 October 2004 Available online 9 December 2004 To detect subtle, abnormal perfusion patterns in brain single photon emission computer tomography (SPECT) images, it is necessary to develop quantitative methods in which computer-aided statistical analysis takes advantage of information present in databases of normal subjects. The purpose of this study was to evaluate and examine aspects of the creation and the modeling power of three statistical models for representing brain perfusion as observed in ECD-SPECT. The first model is a local model of voxel-by-voxel mean and variance. The second model is a PCA-based global model that accounts for covariance patterns in the images. The third model is an original model that is a non-linear extension to the second model. This model is based on robust statistics for modeling abnormalities. To evaluate the models, a leave-one-out procedure combined with simulations of abnormal perfusion patterns was adopted. Abnormal perfusion patterns were simulated at different locations in the brain, with different intensities and different sizes. The procedure yields receiver operator characteristics (ROC) that present a combined measure of model-fit and model-sensitivity at detecting abnormalities. The scheme can further be used to compare models as well as the influence of different preprocessing steps. In particular, the influence of different registration approaches is studied and analyzed. The results show that the original non-linear model always performed better than the other models. Finally, location-dependent detection performance was found. Most notably, a higher variation of perfusion was observed in the right frontal cortex than in the other locations studied. D 2004 Elsevier Inc. All rights reserved. Keywords: Brain perfusion atlas; SPECT; Probabilistic PCA; Robust estimation; Evaluation
* Corresponding author. Institut de Physique Biologique, Faculte´ de Me´decine, Universite´ de Strasbourg, UMR-7004 CNRS, 4, Rue Kirschle´ger, 67085 Strasbourg Cedex, France. Fax: +33 90 24 40 84. E-mail address:
[email protected] (T. Vik). Available online on ScienceDirect (www.sciencedirect.com). 1053-8119/$ - see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2004.10.019
Introduction In clinical practice, single-photon emission computer tomography (SPECT) images of brain perfusion are mostly analyzed in a qualitative fashion. Less often, a semi-quantitative approach using an interactive region-of-interest (ROI) method is applied. Both approaches suffer from operator bias and are hence subjective. For the localization of seizure foci in epilepsy, a more objective method called SISCOM was developed (O’Brien et al., 1998; Zubal et al., 1995). An image acquired between seizures (interictal) is subtracted from an image acquired during seizure (ictal) after registration of the images with the patient’s MRI image. The interictal image is thus used as a reference image for the ictal image. Such a reference image would also be desirable for the semi-quantitative analysis of SPECT exams in other pathologies such as dementia and depression, but are unfortunately not generally available for the same patient. This, beside the interest in understanding normal brain perfusion itself, motivates the efforts in creating normative data (Tanaka et al., 2000; Van Laere et al., 2001), mean images (Imran et al., 1998), and atlases that describe normal brain perfusion based on sample SPECT images of normal volunteers. In particular, a computerized, voxel-based, probabilistic atlas of normal brain perfusion has the potential of improving the sensitivity and the objectivity in the evaluation of SPECT images. The objective of such an atlas is twofold: (1) the description of normal brain perfusion and (2) the design of statistical tests to detect significant abnormal perfusion patterns. However, the creation of such an atlas is complicated by several obstacles: (1) anatomical variability between subjects must be minimized by spatial normalization into a standard reference frame, (2) the issue of intensity normalization for semi-quantitative analysis needs to be solved, and (3), the validity of a statistical model must be assessed. The last point is particularly difficult since no ground truth information about the true activity distribution observed in a SPECT image exists. Several approaches have been presented for quantitatively comparing images of brain perfusion with a normal population by means of a normal database. Existing methods are mostly based on
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
automatic registration schemes for spatially normalizing the images before statistical analysis. A first group of approaches can be characterized as volume/region of interest (VOI/ROI) methods: after registration with an atlas, the average perfusion on predefined VOIs is automatically calculated and analyzed either for asymmetry (Kang et al., 2001) or for differences in mean (Greitz et al., 1991; Rizzo et al., 1995). Similar methods have been used to create normative data (Imran et al., 1998; Koyama et al., 1997) and to examine the effects of age and gender (Van Laere et al., 2001). In Pagani et al. (2002), the authors use principal component analysis (PCA) on the mean values of a set of VOIs to aid the interpretation. A second approach consists in a voxel-by-voxel analysis after registration. In Chang et al. (2002), the authors compare the mean of differences in normal images with differences of ictal–interictal images of patients with temporal lobe epilepsy (TLE). Ictal images in TLE have also been compared in the same manner in (Lee et al., 2000). A similar study compared patients with head injury to normal subjects after correcting for age (Stamatakis et al., 2002). These kind of studies can conveniently be performed using statistical parametric mapping (SPM) (Acton and Friston, 1998). A third approach, fundamentally different in nature, was presented in Minoshima et al. (1995) where the authors calculate a cortical surface from an atlas image and find the maximal intensity in the PET image on an inward perpendicular vector to the surface. Mean values across subjects are then calculated and a z score is estimated for images to evaluate. The utility of this approach was recently assessed in Honda et al. (2003). A fourth and last approach, based on a multivariate analysis of voxel intensities has been proposed by Houston et al. (1994, 1998) where bbuilding blocksQ (eigenimages) of normal perfusion have been determined by PCA. The remaining residual, not explained by these eigenimages, was analyzed to find abnormal patterns of perfusion. The robust model presented in this work is an extension of this model. The importance of validation of medical image analysis techniques has been emphasized in Jannin et al. (2002) and Bowyer (2000). Validation evaluates the performance and limitations of a method and can clarify the clinical potential of a method. Two objectives are central for the evaluation of a probabilistic brain atlas. These are issued by the following two questions: (1) how well does the model fit real data? and (2) how well are abnormalities detected for the pathologies of interest? Several factors contribute to complicate the evaluation: (1) the definition of normal data is subjective, (2) data is rare and expensive, (3) ground truth is not available and (4) a quality metric is difficult to define (what is a good detection?). We have combined the two objectives into a statistical evaluation framework that can give partial, joint answers to both. An advantage of the proposed methodology is that it permits the comparison of different atlas models and different strategies for atlas creation which are otherwise difficult to evaluate. In this paper, we perform an evaluation study using a leave-oneout strategy combined with simulated abnormalities. First, the evaluation is used to compare three different models. We present a non-linear model that is original in this context. The first model evaluated is a model of the voxel-by-voxel type described above, the second is the PCA-based model presented in Houston et al. (1994, 1998) and the third, original model is based on a nonlinear extension of the second model. This non-linear model is robust to non-Gaussian perfusion patterns, which are likely in patients with different atrophies. Furthermore, we discuss and evaluate important
1089
aspects of the atlas construction: spatial and intensity normalization. Finally, the evaluation is used to analyze which brain regions are best modeled and hence where abnormalities are easier to detect. Other simulation studies for assessing methods that characterize cerebral lesions in SPECT and PET images have been described, but none attempt to evaluate the model assumptions themselves. In (Stamatakis et al., 1999), the authors decrease and increase the perfusion on a sphere in the right frontal lobe of the mean image to evaluate the capacity of SPM to detect changes. We think this approach yields an optimistic estimate because an abnormality can vary in an additive manner around the variation of normal images, not only around their mean. Furthermore, the study is limited to only one location of the abnormality, whereas we show in this article that the detection sensitivity of the atlas is locationdependent. Another study (Van Laere et al., 2002) adds inclusions to a software phantom, thereby simulating a single-subject activation study, whereas we are more concerned with multiple individuals. In Missimer et al. (1999), the authors compare SPM and the computerized brain atlas (CBA) (Greitz et al., 1991) for PET activation studies using both human volunteers and simulations. The simulated images are also derived from a single (simulated) PET image. Our context is somewhat different from a standard activation study with multiple conditions/multiple subjects: we have several control images in rest state (learning set) and only one activation image of a subject that is not represented in the learning set. Another study (Davatzikos et al., 2001) also compares the effects of different registration and filtering algorithms on the detection capacity of SPM by simulating PET images. These are simulated from the MR images, but this time from 16 different persons. This way natural anatomical variance is present in the database, but the functional variance still lacks.
Material and methods Database of normal subjects Our database contains 34 99mTc-ECD-SPECT images (Elsinct Ltd. Helix double-headed camera with parallel collimator, filtered back-projection with FWHM of 8 mm) and T1-weighted MRI (GE3D, FOV = 25.6 cm, voxel size 1 mm3) images of normal, healthy volunteers at rest (12 males, 22 females, two groups aged 26.7 F 6.1 and 46.9 F 4.3 years were pooled together). Atlas creation overview An overview of the different components of the atlas creation is depicted in Fig. 1. We first describe the spatial normalization (registration) procedure, then the intensity normalization before we consider the different statistical models, their estimation and finally the validation scheme. All image processing was implemented using the open-source image processing library ImLib3D, freely available under http://imlib3d.sourceforge.net. Registration The steps performed for spatial normalization are depicted in Fig. 2. In summary, the SPECT image of one subject is coregistered with the MR image of the same subject. Inter-subject registrations of the MR images are then performed and a final transformation is composed and applied to the original SPECT
1090
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
We analyzed the influence of different approaches to intersubject MRI registration: !
!
!
!
Only affine registration, no deformable registration (steps 5 and 6 were left out). We shall refer to this as the affine registration strategy. Deformable registration at the third resolution and no filtering of the deformation field (step 6 was left out). We shall refer to this as the Deform3 registration strategy. Deformable registration at the sixth resolution and no filtering of the deformation field (step 6 was left out). We shall refer to this as the Deform6-NF registration strategy. Deformable registration at the third resolution and filtering of the deformation field. We shall refer to this as the Deform6-F registration strategy.
Finally, the different parametric transformations and deformation fields are combined, step 7, downsampled and scaled, step 8, before the final global transformation takes the original image into the reference space using a tri-linear interpolation, step 9.
Fig. 1. Overview of the different processing steps implicated in the atlas creation and its application to compare images with a normal population.
images so that anatomical variations is reduced to a minimum in a global reference space. Registration is crucial for voxel-based statistical modeling (Woods, 1996). We now detail the different processing steps in Fig. 2. In step 1 (see Fig. 2), the SPECT images are resampled to 1 mm isotropic resolution in order to prepare for intra-subject, SPECTMRI registration, step 2. In this second step, a rigid transformation was found by minimizing the mutual information between the images (Wells et al., 1996). Inter-subject MRI registration is performed in three steps 3–5. First, in step 3, each subject is registered using an affine transformation model to the database reference image. Then, in step 4, MRI registration using a deformable (non-linear) transformation model is performed. This step is further described in the next paragraph. The database reference image has been acquired using the same imaging device that was used to acquire the database images, but is itself not part of the atlas. This image is in step 5 brought into standard stereotactic space by means of a rigid registration with the ICBM brain template.1 The deformable registration used in step 4 is performed with the algorithm described in (Noblet et al., in press; Musse et al., 2001; Noblet et al., 2004). The transformation is based on a multiresolution spline decomposition of the transformation field. The algorithm permits the use of different numbers of free parameters for the deformation based on the resolution that is chosen. For example, by choosing the sixth level resolution, a deformation field with 750,141 free parameters is obtained. Choosing the third level resolution, a deformation field with 1 029 free parameters results. We found empirically that the field obtained with the sixth resolution level resulted in transformed SPECT images that were highly artificial. This was solved by low-pass filtering the resulting transformation field, step 6.
1
http://www.loni.ucla.edu/ICBM/ICBM_ICBMAtlases.html.
Intensity normalization In this study, we want to compare relative brain perfusion or patterns of brain perfusion, so we need to compensate global differences in intensity. For count images such as SPECT, a proportional scaling factor is most appropriate, eventually with an additive constant term (linear regression). The correlation scatterplot, or joint histogram, has been shown to be a useful source for determining these factors (Pe´rault et al., 2002). This is especially true when the images are well registered. However, instead of estimating this linear transfer function using standard least squares regression, we propose to use the total least squares technique (TLS) (de Groen, 1996). This type of linear regression considers an equal amount of errors in the reference image and in the image that is to be normalized. This is in contrast to least squares regression where the reference image is effectively supposed to be error-free. Statistical models and detection After having considered the spatial and intensity normalization problems, we now develop the statistical models that supports the creation of the probabilistic atlas. A probabilistic atlas is a statistical model, which is based on assumptions about the underlying distribution of normal images. Different assumptions lead to different ways of estimating the model parameters and different hypothesis tests for comparing an image with the atlas. The model parameters are estimated from the learning set images (database) in the stage of model estimation or learning. Images that are compared to the atlas are called test images, resulting in score images, see also Figs. 1 and 4. Note that the test and learning sets are disjoint and that we can make distinguished assumptions about their distributions. In the following, each image is associated with a vector by lexicographically ordering the voxels inside of a common brain mask into a d-dimensional vector. This mask is taken as the intersection of all (segmented) brains in the learning set. It is useful to think of such a vector (brain image) as a point in a ddimensional observation space. Together, all the brains of a population thus form a cloud of points in the observation space. This cloud represents the distribution of brains and a useful tool for compactly describing this distribution is a parametric probability
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
1091
Fig. 2. Overview of registration process, see the text. The whole process was implemented by generating UNIX makefiles from a simple database that automatically calculate dependencies. This ensures reproducibility and automatically performs the necessary updates when new images are added or changes are made in a particular step.
density function (pdf). This is related to our belief that all SPECT images of normal brain perfusion patterns only occupy a subspace of this observation space and that their intrinsic dimension is considerably lower than d, say q. We have considered three models that can all be derived from the following linear, image generative model (Tipping and Bishop, 1999): Y y ¼ WY x þY lþY e;
ð1Þ Y relating a low dimensional (unknown) latent variable x of dimension q with the high dimensional image Y y of dimension d by means of the generation matrix W, which is d q. The vector Y l is the model mean and Y e is the voxel noise. The mean image Y l is estimated as the learning set average and is equal for all three models. The models result from different assumptions on the random variables Y x and Y e. Local model In the first model, which we refer to as the local model, the variations around the mean image is not taken into account W¼Y 0 and the noise to be Gaussian and independent is assumed for each voxel, Y e fN Y 0; 8 with 8 diagonal. This means that the model is entirely defined by the voxel-mean and -variance. Let Y e ð jÞ denote the observed residual of the jth image: Y e ð jÞ ¼ Y y ð jÞ Y l; j ¼ 1; N ; J :
ð2Þ
The variance 8 is then estimated from the J learning base images as ð 8Þ i ¼
J 1 X Y e ð jÞ Y e ð jÞT J 1 j¼1
!
where the subscript i is the ith diagonal element (voxel). This is simply the voxel-variance of the learning images. Abnormalities in a test image Y y are detected by comparing at each voxel the z-score Y t ¼ 81=2 Y y Y l against a threshold. The z score Y t can be related to the probability of observing the residual Y e¼Y yY l, which follows a student t distribution with J 1 degrees of freedom when the model assumptions are correct. Global linear model Though differently formulated and slightly different, the second model is akin to the PCA model that was used by (Houston et al., 1994) for analyzing abnormal perfusion patterns. Here, the q column vectors of W are the q first (normalized) principal components of the (mean-free) learning set images. The model is based on the assumption that the noise-distribution p(Y e ) is Gaussian (with 8 diagonal). What this means is that an image is considered to be the superimposition of a systematic part, Wx, and an observation noise, e. When the systematic part of the image is known, the noise is considered to be identically and independently distributed (as a Gaussian). Let Y ¼ Y y ð0Þ Y l NY y ð J 1Þ Y l denote the mean-free sample matrix. With the singular value decomposition of Y: 1 1 Y ¼ UL 2 V T ; J 1
where U and V are square orthonormal matrices and L is a diagonal matrix with elements k j , the first q principal components are the first q column vectors of U and are also called eigenimages. These are uncorrelated and eigenimage j accommodates for a fraction PkjJ 1 of the total learning base variance.!By choosing q Pq kj j
; i
ð3Þ
ð4Þ
principal components of the model,
kj
j 100 PJ 1 j
kj
% of the learn-
ing base variance is accommodated for by the model. If the
1092
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
learning base images are effectively representative of the underlying population, these components capture the patterns with the largest variance of this distribution. We have conducted our experiments with 1–4 principal components. From an observed image Y y, the unknown corresponding latent variable Y x must be estimated. This can be done by maximum likelihood (ML) estimation. Given the likelihood of observing Y y under the model, 1 1 p Y y jY x ¼ pffiffiffiffiffiffi d expð ðY y WY xY lÞT 2 2p 8 8 1 ðY y WY xY l ÞÞ;
ð5Þ
the corresponding latent variable Y x is estimated by minimizing the exponential, which yields: 1 ˆ Y x W LS ¼ W T 81 W W T 81 Y yY l ð6Þ also known as the weighted least squares (WLS) estimate. Note that in standard PCA the latent variable, Y x, is estimated as the LS estimate 1 ˆ Y x LS ¼ W T W W T Y yY l ; ð7Þ and is obtained when the noise, 8, is considered isotropic r 2I. ˆx W LS , the score image is again defined With the estimate Y as the z score of the residual and is calculated as Y ˆx W LS . The noise variance 8 is yY l WY t ¼ 81=2 Y estimated as for the local model, but this time from the ˆx ðjÞ a s l e a r n i n g rP esiduals Y e ð jÞ ¼ Y y ð jÞ Y l WY LS Y 1 ð jÞ Y ð jÞT ; with the adjusted degrees-ofe ð8Þi ¼ J 1q j e i freedom J1q. The global model differs from the model in Houston et al. (1994) in that the WLS estimate, which is the ML estimate, is used instead of the LS estimate for calculating the residual. This ensures that the non-isotropic voxel variance is taken into account. Robust global non-linear model The third model that we have considered is a non-linear version of the above global model. As for the previous global model, the q column vectors of W are taken as the q first (normalized) principal components of the (mean-free) learning set images. This assumes a Gaussian distribution on the learning base images and the residual noise variance, 8, is estimated in the same way. In the robust model however, observed patient images (that are compared to the atlas) are modeled with a non-Gaussian noise distribution as these may contain lesions and artefacts. This is a more realistic model for large lesions of high (or low) intensity since these represent outliers to the Gaussian distribution. The standard LS/WLS estimates are known to be sensitive to such outliers. The reason for this is that the residuals are all weighted with a quadratic cost function, which is the exponential of Eq. (5), so that residuals with high values might influence the estimate in an arbitrary manner. The model in (Houston et al., 1994) attempts to control such outliers by truncating the latent variable at a statistically defined cut-off value. However, since the estimate before truncation can be corrupted, this method does not improve the break-down point (i.e., the lowest number of corrupted data that can arbitrarily skew the estimate) of the model. A more rigorous approach to reduce this sensitivity (adopted here), is to replace the quadratic function in the exponential of the Gaussian by a robust cost function q(d ), that down-weights the
contribution of the outliers. This leads to the family of M-estimators (Huber, 1981). With this model, the ML estimate of Y x is obtained as:
! X 1 e i ˆ Y x RM L ¼ arg max exp ; ð8Þ q rq wi 2 i Y x and must be estimated iteratively. The robust cost function q(d ) is one of the weight functions depicted in Fig. 3. For q(e) = Q = e 2, we have the case of Gaussian noise. By choosing a weight function that takes on smaller values for large residuals, we make the estimator robust to outliers. Typically, q(e) is chosen so that for small residuals, e, the cost follows closely the square function and at some point deviates from it. This assures that in the case of only Gaussian noise, the estimate will be close to the LS (WLS) estimate, which is then optimal (in the sense of minimum variance). The point of the weight function where the influence of the residual starts to diminish as compared to the square function is regulated by the global scale parameter r q. It can be set as a function of what significance we consider to be abnormal, for example, what we want to detect. Since we have a model with position-dependent variance, we weight the local residual with the local estimated variance, w i , which is the ith diagonal element of 8. ˆx RM L that maximizes (8) must be found iteratively. The Y Using one of the algorithms ARTUR or LEGEND (see Charbonnier et al., 1997 and Appendix A) all convex weight functions are guaranteed to yield one global minimum. However, efficient weight functions that almost eliminate the influence of outliers are non-convex and will not necessarily converge to the global minimum. For better convergence properties, we apply the three weight functions L1–L2, the non-convex Cauchy followed by the Geman–McClure function (see Fig. 3) in continuation: starting with the WLS estimate, each weight function is optimized after being initialized by the estimate of the preceding one. This way, a solution close to the global minimum can be found, that efficiently discards outliers. The score image is again defined as ˆ Y t ¼ 81=2 Y x RM L . yY l WY Validation procedure Fig. 4 shows an overview of the validation procedure. The different processing steps will now be outlined before we describe the evaluation criterion.
Fig. 3. Weight functions q(r) that we have used in continuation: Q, L1–L2, Cauchy and Geman–McClure (see Appendix A). As the residual error grows, the influence on the total cost function in Eq. (8) diminishes compared to the quadratic Q function, and so the resulting estimator becomes less sensitive to gross errors and voxels that follow the model dominate the ˆ The point of transition is regulated by the scale parameter r q. estimate Y x.
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
(4)
1093
abnormality that is introduced is known, we have a ground truth of what the atlas comparison is supposed to detect. Compare the test image with the atlas to obtain the score image.
The procedure yields J score images and the corresponding ground truth images. These are averaged and analyzed using ROC analysis (Performance evaluation). Test images Images with abnormal perfusion (test images) were synthesized by mathematically adding inclusions to the base image. An inclusion is a sphere with a fixed intensity. An inclusion with a positive (negative) value thus leads to a test image with a hyperperfused (hypoperfused) abnormality. Intensities of F15%, F25%, F35% of average brain perfusion were used and six different positions were examined as shown in Fig. 5. Two different sizes of inclusions were used, small inclusions of 20 mm in diameter, and large inclusions of 64 mm in diameter. An example test image is shown in Fig. 6a.
Fig. 4. Schematic view of the validation strategy employed. See Validation procedure for a detailed description. A bgoodQ atlas describes the base image well (no detections) and is sensitive enough to detect the inclusions. The procedure is repeated for all images in the DB and the results are averaged to measure the overall performance. We used inclusions at six different locations, of two sizes and six intensities, so that we obtained a total set of 72 score images for every base image, model and preprocessing method.
Leave-one-out ( bJackknife Q) The leave-one-out methodology is a standard method in pattern classification to estimate recognition and modeling performance. It is particularly useful for small size data sets (Duda et al., 2001). For a database of size J (here J = 34), the following four steps are iterated to obtain a series of result images (Fig. 4): (1) Remove one image (base image) from the learning set. (2) Estimate/create atlas from the remaining learning set of J1 images. (3) Synthetize a test image (image with known abnormality) from the base image (as described in Test images). Since the
Performance evaluation Performance evaluation was done by calculating the receiver operating characteristics (ROC) and the partial area under this curve (AUC) was estimated. For this, we define the true detection rate (TDR, sensitivity) and the false detection rate (FDR, 1specificity) as probabilities (Bowyer, 2000; Duda et al., 2001). The score images are thresholded at different levels to obtain binary images from which we calculate the TDR and the FDR on a voxel basis (Fig. 6). By varying the threshold level, we obtain a ROC curve. Only the partial AUC, denoted AUC0.05, in the range of FDR a [0, 0.05] was calculated since a FDR of 0.05 already represents a large quantity of voxels (about 1000 voxels as compared to the small inclusions that are 110 voxels). The standard error of the AUC0.05 estimates was equally estimated using the jackknife technique as described in [Duda et al., 2001, p. 473]. This error was then used to assess the significance of the difference between two curves (Hanley and McNeil, 1982, 1983).
Results If not stated otherwise, all results were obtained using the deformable registration scheme (Deform6-F) described in Regis-
Fig. 5. All inclusions of 20 mm diameter size superimposed on the database reference MR image (Fig. 2). SPECT test images were constructed with only one inclusion at a time in the leave-one-out strategy (Fig. 4). LF/RF: left/right middle/superior frontopolar gyri, RT: right temporal gyrus, and LIP/RIP: left/right intraparietal sulcus.
1094
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
Fig. 6. A transversal slice of (a) a test image with a small inclusion of 25% of average brain perfusion in the right frontal lobe (RF), (b) a score image, (c) a thresholded score image, which is compared to (d) the ground truth image in order to determine the FDR and TDR rates.
tration and the total least squares intensity normalization as described in Intensity normalization. We first compare models using small inclusions (20 mm) then using large inclusions (64 mm). We then compare registration schemes and show the overall improvement of our approach over the more common atlas approach (affine registration, average normalization and local model). Finally, we describe the location-specific results. Only a representative subset of hypo-inclusions is reported. The results for hyper-inclusions yielded similar results and the conclusions are similarly valid for these. In Fig. 7 are depicted the ROC curves for the local model and the global model using small inclusions. The curves were obtained by averaging over all locations and with a varying number of eigenvectors for the global model. The standard errors of these curves were also estimated as described in Performance evaluation and the significance levels of the differences between any two curves are tabulated in Table 1 for the 25% hypoperfusions. Here, we see that the PCA2–PCA4 models are all significantly above the local and PCA1 models. The difference between the local and the PCA1 model on one hand and the difference between the PCA3 and PCA4 models are
however not significant. With an odds of about 1.6 (1.7), the difference between the PCA3 (PCA4) model and the PCA2 model is not very significant. In Fig. 8 are depicted the ROC curves obtained using large inclusions. Whereas for small inclusions, the global model and the robust global model yield the same results, this is no longer the case when the size of abnormalities becomes larger. By design, the robust model (Robust-PCA3) will always be at least as good as the global model, which justifies one-sided P values for comparing these two models. These are 0.0061 and 0.1044 for the 25% and the 15% hypo-perfusions, respectively. The differences between the PCA3 and the local model have two-sided significance values as P b 0.0078 and P b 0.0671 for the 25% and the 15% hypoperfusions, respectively. Fig. 9 shows the results obtained using different registration schemes. These schemes were explained in Registration. All differences are significant (Deform6-F vs. Affine: P b 0.0046, Affine vs. Deform3 P b 0.0033, and Deform3 vs. Deform6-NF is negligible), again using two-sided P values. In Fig. 10 the overall improvement of our atlasing approach using the robust global model is shown, deformable registration (with filtering) and TLS normalization as compared to using the more widespread atlas method using a local model, affine registration and average normalization. The AUC0.05 for the ROC curves in Fig. 10 are 0.9101 in the first case (our approach) and 0.8359 in the second case, the z score being 4.83 ( P b 1.34E-6). Fig. 11 shows the ROC curves obtained for the different locations used in this experiment. The variance of each region as present in the database learning images is given as standard deviations in parenthesis. The significance table corresponding to
Table 1 Two-sided P values associated with the 25% hypo-inclusions in Fig. 7
Fig. 7. Comparison of models. Local is the local model, PCAq is the global model using q principal components. ROC curves averaged over all locations. The curves were obtained for small hypo-inclusions (20 mm) with intensities of 15% and 25% of average brain perfusion. Similar curves were obtained for the other intensities studied. Significance values are given in Table 1.
Model
PCA1
PCA2
PCA3
PCA4
Local PCA1 PCA2 PCA3
0.9972
0.0061 0.0045
0.0001 0.0001 0.1567
0.0001 4.24E-5 0.1398 0.9254
For example, one can see that the odds that the PCA3 model is different from the local model only by chance are 1:10,000 (second row, fourth column). Since the PCA3 model shows a better ROC curve in Fig. 7, we therefore conclude that this model’s performance is significantly better than that of the local model.
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
Fig. 8. ROC curves averaged over all locations. The curves were obtained for large hypo-inclusions (64 mm) with intensities of 15% and 25% of average brain perfusion. Note that the effective size of the inclusions were inferior to the volume of the sphere because regions outside the brain covered by the sphere was ignored.
Fig. 11 is given in Table 2. We see that the abnormalities are more difficult to detect in the right frontal region (RF) than in the other regions ( P b 0.0108 or less). The difference between the right and left curves is highly significant in the frontal region ( P b 0.0003), significant in the parietal region ( P b 0.0463) and not significant in the temporal region. Averaging the right and left regions and comparing regions, we have that the temporal region is easier for detection than the parietal region with significance P b 0.0339.
1095
Fig. 10. Overall improvement of our atlasing approach that consists of deformable registration, total least squares intensity normalization and a robust global model as compared to the more usual approach that consists of affine registration, mean intensity normalization and the local model. Results are averaged over all locations using small 25% hypo-inclusions.
In this paper, we quantify the accuracy of different models and we show for the first time that the global model of Houston et al. performs significantly better than the local model as seen in Fig. 7 and Table 1. This is true for the model with 2, 3, and 4
eigenvectors. The appropriate choice of number of eigenvectors is however quite difficult. Too few components limit description power and too many may result in overfitting (Duda et al., 2001) yielding eigenimages or patterns that are specific of the sample images, not of the underlying distribution. Both cases lead to lower model performance. In Table 1, we see that between the models with 3 and 4 eigenvectors there is no significant difference (bottom-right entry), whereas these two are only moderately different from the PCA2 model. That covariance patterns in one or several populations carry useful information has also been shown in other studies (Jones et al., 1998; Pagani et al., 2002). Note however, that we do not interpret these patterns, they merely serve as complex descriptors of normal variation. In realistic cases, the zones of abnormal perfusion might be large or small. Our robust model showed better results for large abnormalities than the other models (Fig. 8) and had equal
Fig. 9. Comparison of different registration strategies for the PCA3 model. ROC curves averaged over all locations. The curves were obtained for small hypo-inclusions (20 mm) with intensities of 25% of average brain perfusion with similar curves for other intensities. See Registration for an explanation of the different strategies examined.
Fig. 11. Comparison of ROC curves for each location where inclusions were added using the PCA3 model. See Fig. 5 for legend. The curves were obtained for small hypo-inclusions (20 mm) with intensities of 25% of average brain perfusion. The curves were similar for other intensities. In parenthesis is given the average standard deviation in the area of the inclusion.
Discussion
1096
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
Table 2 Two-sided P values associated with the 25% inclusions in Fig. 11 Location
RF
LT
RT
LIP
RIP
LF RF LT RT LIP
0.0003
0.0024 2.62E-8
0.0935 2.60E-6 0.2117
0.3036 0.0108 0.0005 0.0155
0.2463 1.32E-5 0.0759 0.6146 0.0463
performance for small inclusions. With increasingly good model description, due to a more representative database and better preprocessing, we expect the model to show even clearer advantages over the local and standard global model. In fact, in a non-realistic simulation study where we added inclusions to the database mean image (which is perfectly represented by the atlas), we were able to provoke a very large performance gain of the robust model over the global model. An influential factor on voxel-based, inter-subject studies is undeniably the spatial normalization or registration (Fig. 9) (Woods, 1996). Since the assumption is that only similar structures should (grossly) yield comparative functional signals, these must be correctly superimposed. The leave-one-out strategy presented in this paper is clearly valid for comparing differently registered, normal images under this assumption. However, this is not true as to the algorithm’s capacity of registering abnormal images since these have been simulated after registration. The best results were obtained with the deformable registration scheme and filtering of the deformation field. In particular, it is better than a linear 12parameter affine registration as is often used (Acton and Friston, 1998). However, this approach is only possible when both SPECT and MR images of the patient are available. The asymmetry in variation found in the frontal cortex (Fig. 11) is in accordance with a higher uptake in this region as found in Van Laere et al. (2002) as well as a known anatomical asymmetry in the frontal lobe, the right being larger than the left in general [Duvernoy, 1999, p. 17]. The performances obtained represent a combined measure of model-fit and detection sensitivity. The lower performance in the frontal region is therefore not only explained by a higher variance in this region, Fig. 11, but also reflects a limited capacity of the model to represent this region. For example is the average variance in the bLIPQ-region (6.722) lower than in the bLFQ region (7.982), even though the ROC curves show lower performance in that region (Fig. 11). The issue of model validation is intricate. It would be desirable to have statistical goodness-of-fit tests to evaluate whether a model is appropriate or not. Multivariate tests of normality for the models used in this study do indeed exist (Mardia, 1980), but the sample size ( J = 34) is too small to yield very meaningful results. Another possibility would be to test marginal probabilities by applying a Kolmogorov–Smirnov or Lilliefors test at each voxel. This is also problematic because one would need to interpret the results (e.g., is it reasonable to obtain rejections in 10% of all voxels, given the spatial correlations in the data?), the sample size is still small, and finally, we expect relatively simple models like these to be rather approximate. Another problem with goodness-of-fit tests is that to know anything about the power of the test, one needs to define an alternative hypothesis. In practice, we do not possess an alternative hypothesis that is constraining enough and the result is a bnoisyQ test with low statistical power.
In pattern recognition, where models are often considered to be approximate and/or methods might be non-statistical, a different approach is taken where one evaluates the generalizing power (i.e., performance on unseen data) of a method as a measure of how well a model describes the data. To evaluate this, it is important to avoid a biased estimate. In this evaluation study, a biased estimate is obtained by not removing the base image from the learning set in Fig. 4. We note that all the simulation/validation studies described in the literature (see the end of the introduction) have used some form of biased estimates. This is appropriate for activation studies (where the same person is scanned at rest and under activation), but not for subject–atlas comparison. Because the biased estimate yields better results than the unbiased estimate, the results obtained in these studies are therefore optimistic estimates of performance. We also performed the biased evaluation and found that the global models largely improved with respect to the local model (this is favored by the biased strategy). We therefore believe that the global models have the potential of increasing its performance when estimated from a larger cohort of normal images. Since the definition of the false detection rate does not make any distinction between bgoodQ and bbadQ false alarms, one might discuss whether the ROC curves used in this study are appropriate or not. In certain pathologies, the physicist is only concerned by grossly identifying an atrophical region or just lateralize abnormal brain perfusion (qualitative evaluation). However, the advantage of the chosen criterion is that it provides objective results. The SPECT images in this study were reconstructed without correcting for scatter events. We believe such correction to have quite an impact on the modeling performance. We will examine this impact in future work.
Conclusion A probabilistic, computerized atlas will not replace visual inspection of SPECT scans. However, it can add value to SPECT examinations by permitting the comparison of an image with normative data. For this to be possible, complex image processing must be undertaken, whereby the impact and consequences of such processing must be evaluated and well understood. Furthermore, the assumptions underlying the statistical models and tests must be validated. In this paper, we present a realistic evaluation study that can be used to quantitatively compare preprocessing steps as well as statistical models’ capacity of modeling normal images and detecting abnormalities. We propose a new robust model in this context, a pragmatic improvement to image registration, and a simple improvement to intensity normalization. The statistical significance of these improvements is assessed.
Appendix A. Optimization of a robust cost function The solution ˆx RML that minimizes Eq. (8) must be found iteratively in the case where q(d ) is not the quadratic function. Two different approaches in literature, the half-quadratic theory (Charbonnier et al., 1997; Geman and Yang, 1995; see also Dahyot, 2001) and the theory of Huber (1981), both lead to two possible and equivalent algorithms: the iteratively reweighted least squares algorithm (IRLS) or the equivalent ARTUR algorithm, and the less known iteratively reweighted residuals algorithm equivalent to the LEGEND algorithm.
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
Using the LS estimate as an initial estimate, the algorithm ARTUR consists of the following three steps at each iteration m, which are repeated until convergence: Y e ðmÞ ¼ Y yY l WY x ðm Þ ðA:1Þ
ðmþ1Þ
bi
ð mÞ e qV rqi w i
¼
ðmÞ
2
ei rq wi
f or i ¼ 1; N ; d
ðA:2Þ
1 Y yY l ðA:3Þ x ðmþ1Þ ¼ W T 81 Bðmþ1Þ W W T 81 Bmþ1 Y Here, the matrix B is B ¼ diag Y b , where Y b ¼ ðb1: : : bd ÞT . The matrix 8 is a diagonal matrix with the voxel variances w 2i on the diagonal (see Statistical models and detection). The alternative algorithm LEGEND (Geman and Reynolds, 1992) can be written as: Y e ðmÞ ¼ Y yY l WY x ðm Þ ðA:4Þ 0 ðmþ1Þ
bi
¼
ðm Þ ei
B B1 rq wi @
qV
ð mÞ
ei rq wi ðmÞ
2
ei r q wi
1 C Ci ¼ 1; N ; d A
1 Y x ðmþ1Þ ¼ W T 81 W W T 81 1 b ðmþ1Þ Y yY l rq 8 2 Y
ðA:5Þ
ðA:6Þ
Note, that there is a matrix to invert at each iteration in Eq. (A.3) contrary to the projection matrix in Eq. (A.6) that need only to be calculated once. The algorithms are both guaranteed to converge for the cost functions depicted in Fig. 3 (see Charbonnier et al., 1997, for more details on convergence properties):
Name Q HS HL GM
q(e) e 2pffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1 þ e2 2 log (1+e 2) (e 2/(1 + e 2))
Because of the matrix inversion at each iteration, an iteration of the ARTUR algorithm is more cost intensive than for the LEGEND algorithm. On the other hand, it has been shown that ARTUR converges faster than LEGEND (Dahyot, 2001) and so the overall performance depends on the actual problem present. A good compromise is to iterate a few times with ARTUR and subsequently iterating using LEGEND.
References Acton, P., Friston, K., 1998. Statistical parametric mapping in functional neuroimaging: beyond PET and fMRI activation studies. Eur. J. Nucl. Med. 25 (7), 663 – 667 (July, Editorial). Bowyer, K.W., 2000. Validation of medical image analysis techniques. In: Sonka, M., Fitzpatrick, J.M. (Eds.), Handbook of Medical Imaging, vol. 2. SPIE Press, pp. 567 – 607.
1097
Chang, D., Zubal, I., Gottschalk, C., Necochea, A., Stokking, R., Studholme, C., Corsi, M., Slawski, J., Spencer, S., Blumenfeld, H., 2002. Comparison of statistical parametric mapping and SPECT difference imaging in patients with temporal lobe epilepsy. Epilepsia 43 (1), 68 – 74. Charbonnier, P., Blanc-Fe´ raud, L., Aubert, G., Barlaud, M., 1997. Deterministic edge-preserving regularization in computed imaging. IEEE Trans. Image Process. 6 (2), 298 – 311. Dahyot, R., Analyse d’images se´quencielles de sce´nes routie´res par mode´les d’apparence pour la gestion du re´seau routier. PhD thesis, Universite´ Louis Pasteur- Strasbourg I, France, 2001. Davatzikos, C., Li, H.H., Herskovits, E., Resnick, S.M., 2001. Accuracy and sensitivity of detection of activation foci in the brain via statistical parametric mapping: a study using a PET simulator. NeuroImage 13, 176 – 184. de Groen, P., 1996. An introduction to total least squares. Nieuw Arch. Wiskd. 4 (14), 237 – 253. Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification, second ed. Wiley. Duvernoy, H.M., 1999. The Human Brain: Surface, Blood Supply, and Three- Dimensional Sectional Anatomy, second ed. Springer-Verlag. Geman, D., Reynolds, G., 1992. Constrained restoration and the recovery of discontinuities. IEEE Trans. Pattern Anal. Mach. Intell. 14 (3), 367 – 383 (March). Geman, D., Yang, C., 1995. Nonlinear image recovery with half-quadratic regularization and FFT’s. IEEE Trans. Image Process. 4 (7), 932 – 946. Greitz, T., Bohm, C., Holte, S., Eriksson, L., 1991. A computerized brain atlas: construction, anatomical content, and some applications. J. Comput. Assist. Tomogr. 15 (1), 26 – 38. Hanley, J.A., McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 29 – 36 (April). Hanley, J.A., McNeil, B.J., 1983. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148, 839 – 843 (September). Honda, N., Machida, K., Matsumoto, T., et al., 2003. Three-dimensional stereotactic surface projection of brain perfusion SPECT improves diagnosis of Alzheimer’s disease. Ann. Nucl. Med. 17 (8), 641 – 648. Houston, A.S., Kemp, P.M., Macleod, M.A., 1994. A method for assessing the significance of abnormalities in HMPAO brain SPECT images. J. Nucl. Med. 35 (2), 239 – 244 (February). Houston, A.S., Kemp, P.M., Macleod, M.A., James, T., Francis, R., Colohan, H.A., Matthews, H.P., 1998. Use of significance image to determine patterns of cortical blood flow abnormality in pathological and at-risk groups. J. Nucl. Med. 39 (3), 425 – 430 (March). Huber, P.J., 1981. Robust Statistics. John Wiley & Sons. Imran, M.B., Kawashima, R., Sato, K., Kinomura, S., Ito, H., Koyama, M., Goto, R., Ono, S., Yoshioka, S., Fukuda, H., 1998. Mean regional cerebral blood flow images of normal subjects using Technetium99m-HMPAO by automated image registration. J. Nucl. Med. 39 (1), 203 – 207 (January). Jannin, P., Fitzpatrick, J.M., Hawkes, D.J., Pennec, X., Shahidi, R., Vannier, M.W., 2002. Validation of medical image processing in image-guided therapy. IEEE Trans. Med. Imag. 21 (12), 1445 – 1449 (December, Editorial). Jones, K., Johnson, K., Becker, J., Spiers, P., Albert, M., Holman, B., 1998. Use of singular value decomposition to characterize age and gender differences in SPECT cerebral perfusion. J. Nucl. Med. 39 (6), 965 – 973 (June). Kang, K.W., Lee, D.S., Cho, J.H., et al., 2001. Quantification of F-18 FDG PET images in temporal lobe epilepsy patients using probabilistic brain atlas. NeuroImage 14, 1 – 6. Koyama, M., Kawashima, R., et al., 1997. SPECT imaging of normal subjects with technetium-99m-HMPAO and technetium-99m-ECD. J. Nucl. Med. 38 (4), 587 – 591 (April). Lee, J.D., Kim, H.-J., et al., 2000. Evaluation of ictal brain SPET using statistical parametric mapping in temporal lobe epilepsy. Eur. J. Nucl. Med. 27 (11), 1658 – 1665 (November).
1098
T. Vik et al. / NeuroImage 24 (2005) 1088–1098
Mardia, K.V., 1980. Tests of univariate and multivariate normality. In: Krishnaiah, P.R. (Ed.), Handbook of Statistics 1: Analysis of Variance. North-Holland, pp. 279 – 320. Minoshima, S., Frey, K., Koeppe, R., Foster, N., Kuhl, D., 1995. A diagnostic approach in Alzheimer’s disease using three-dimensional stereotactic surface projections of Fluorine-18-FDG PET. J. Nucl. Med. 36 (7), 1238 – 1248 (July). Missimer, J., Knorr, U., Maguire, R.P., Herzog, H., Seitz, R.J., Tellman, L., Leenders, K.L., 1999. On two methods of statistical image analysis. Hum. Brain Mapp. 8, 245 – 258. Musse, O., Heitz, F., Armspach, J.P., 2001. Topology preserving deformable image matching using constrained hierarchical parametric models. IEEE Trans. Im. Proc. 10 (7), 1081 – 1093. Noblet, V., Heinrich, C., Heitz, F., Armspach, J.-P., 2004. Topology preserving nonrigid registration method using a symmetric similarity function—Application to 3-D brain images. 8th European Conference on Computer Vision (ECCV), Prague (April). Noblet, V., Heinrich, C., Heitz, F., Armspach, J.-P., 2005. 3-D deformable image registration: a topology preserving scheme based on hierarchical deformation models and interval analysis optimization. IEEE Trans. Image Process (in press). O’Brien, T., So, E., Mullan, B., Hauser, M., Brinkmann, B., Bohnen, N., Hanson, D., Cascino, G., Cascino, G., Sharbrough, F., 1998. Subtraction SPECT co-registered to MRI improves postictal SPECT localization of seizure foci. Neurology 50 (2), 445 – 454 (February). Pagani, M., Salmaso, D., Jonsson, C., Hatherly, R., Jacobsson, H., Larsson, S.A., W?gner, A., 2002. Regional cerebral blood flow as assessed by principal component analysis and 99mTc-HMPAO SPET in healthy subjects at rest: normal distribution and effect of age and gender. Eur. J. Nucl. Med. 29 (1), 67 – 75 (January). Pe´rault, C., et al., 2002. Computer-aided intrapatient comparison of brain SPECT images: the gray-level normalization issue applied to children with epilepsy. J. Nucl. Med. 43, 715 – 724.
Rizzo, G., Gilardi, M.C., Prinster, A., Grassi, F., Scotti, G., Cerutti, S., Fazio, F., 1995. An elastic computerized brain atlas for the analysis of clinical PET/SPET data. Eur. J. Nucl. Med. 22 (11), 1313 – 1318. Stamatakis, E.A., Glabus, M.F., Wyper, D.J., Barnes, A., Wilson, J.T.L., 1999. Validation of statistical parametric mapping (SPM) in assessing cerebral lesions: a simulation study. NeuroImage 10, 397 – 407. Stamatakis, E.A., Wilson, J.T.L., Hadley, D.M., Wyper, D.J., 2002. SPECT imaging in head injury interpreted with statistical parametric mapping. J. Nucl. Med. 43 (4), 476 – 483. Tanaka, F., Vines, D., Tsuchida, T., Freedman, M., Ichise, M., 2000. Normal patterns on 99mTc-ECD brain SPECT scans in adults. J. Nucl. Med. 41 (9), 1456 – 1464 (September). Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61 (3), 611–622. Also available as technical report from Aston University. Van Laere, K., Versijpt, J., Audenaert, K., Koole, M., Goethals, I., Achten, E., Dierckx, R., 2001. 99mTc-ECD brain perfusion SPET: variability, asymmetry and effects of age and gender in healthy adults. Eur. J. Nucl. Med. 28 (7), 873 – 887 (July). Van Laere, K., Versijpt, J., Koole, M., Vandenberghe, S., Lahorte, P., Lemahieu, I., Dierckx, R.A., 2002. Experimental performance assessment of SPM for SPECT neuroactivation studies using a subresolution sandwich phantom design. NeuroImage 16, 200 – 216. Wells, W., Viola, P., Atsumi, H., Nakajima, S., Kikinis, R., 1996. Multimodal volume registration by maximization of mutual information. Med. Image Anal. 1 (1), 35 – 51. Woods, R.P., 1996. Modeling for intergroup comparisons of imaging data. NeuroImage 4, S84 – S94. Zubal, I., Spencer, S., Imam, K., Seibyl, J., Smith, E., Wisniewski, G., Hoffer, P., 1995. Difference images calculated from ictal and interictal technetium-99m-HMPAO SPECT scans of epilepsy. J. Nucl. Med. 36 (4), 684 – 689.