An evaluation of four automatic methods of segmenting the subcortical structures in the brain

An evaluation of four automatic methods of segmenting the subcortical structures in the brain

NeuroImage 47 (2009) 1435–1447 Contents lists available at ScienceDirect NeuroImage j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l...

3MB Sizes 1 Downloads 20 Views

NeuroImage 47 (2009) 1435–1447

Contents lists available at ScienceDirect

NeuroImage j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / y n i m g

An evaluation of four automatic methods of segmenting the subcortical structures in the brain Kolawole Oluwole Babalola ⁎, Brian Patenaude, Paul Aljabar, Julia Schnabel, David Kennedy, William Crum, Stephen Smith, Tim Cootes, Mark Jenkinson, Daniel Rueckert University of Manchester, Imaging Science and Biomedical Engineering, Stopford Building, Oxford Road, Manchester M13 9PT, UK

a r t i c l e

i n f o

Article history: Received 18 March 2008 Revised 6 May 2009 Accepted 7 May 2009 Available online 20 May 2009

a b s t r a c t The automation of segmentation of subcortical structures in the brain is an active research area. We have comprehensively evaluated four novel methods of fully automated segmentation of subcortical structures using volumetric, spatial overlap and distance-based measures. Two methods are atlas-based — classifier fusion and labelling (CFL) and expectation–maximisation segmentation using a brain atlas (EMS), and two incorporate statistical models of shape and appearance — profile active appearance models (PAM) and Bayesian appearance models (BAM). Each method was applied to the segmentation of 18 subcortical structures in 270 subjects from a diverse pool varying in age, disease, sex and image acquisition parameters. Our results showed that all four methods perform on par with recently published methods. CFL performed better than the others according to all three classes of metrics. In summary over all structures, the ranking by the Dice coefficient was CFL, BAM, joint EMS and PAM. The Hausdorff distance ranked the methods as CFL, joint PAM and BAM, EMS, whilst percentage absolute volumetric difference ranked them as joint CFL and PAM, joint BAM and EMS. Furthermore, as we had four methods of performing segmentation, we investigated whether the results obtained by each method were more similar to each other than to the manual segmentations using Williams' Index. Reassuringly, the Williams' Index was close to 1 for most subjects (mean = 1.02, sd = 0.05), indicating better agreement of each method with the gold standard than with the other methods. However, 2% of cases (mainly amygdala and nucleus accumbens) had values outside 3 standard deviations of the mean. © 2009 Elsevier Inc. All rights reserved.

Introduction Functional and structural brain imaging are playing an expanding role in neuroscience and experimental medicine. The amount of data produced by imaging increasingly exceeds the capacity for expert visual analysis, resulting in a growing need for automated image analysis. In particular, accurate and reliable methods for segmentation (classifying image regions) are a key requirement for the extraction of qualitative or quantitative information from images. Image segmentation aims to separate an image into a number of different and disjoint sets of voxels that correspond to anatomically meaningful regions. The segmentation of the brain in MR images is challenging. Initial approaches e.g. (Friston et al., 1995; Smith et al., 2004), focussed primarily on the segmentation of brain MR images into different tissue classes: white matter (WM), grey matter (GM), and cerebral–spinal fluid (CSF). The reason is that these tissue classes can be identified based on their characteristic signal intensity in T1 or T2 weighted MR images. The segmentation of subcortical structures ⁎ Corresponding author. Fax: +440161275514. E-mail addresses: [email protected], [email protected], [email protected] (K.O. Babalola). 1053-8119/$ – see front matter © 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2009.05.029

(thalamus, putamen, caudate, etc.) is typically more challenging, since the signal intensity alone is not sufficient to distinguish between different subcortical grey matter structures as observed and characterised by Fischl et al. (2002). However, subcortical structures have characteristic shapes and spatial relationships with each other. Thus segmentation algorithms for these structures usually incorporate apriori information about their expected location and shape. Manual and semi-automatic segmentations of these structures involve the use of tools such as ITK-SNAP (Yushkevich et al., 2006) or methods developed specifically for neuroanatomical segmentation e.g. (Xia et al., 2007) in which the user specifies two coordinates for the AC–PC line for the segmentation of the caudate, and (Chupin et al., 2007) which requires a bounding box and the position of two seeds for the segmentation of the hippocampus and amygdala. We are interested in fully automatic 3D methods requiring no user intervention, which are applicable to subcortical structures in general and not tailored to a specific structure. These can incorporate the use of anatomical atlases, statistical models of shape and intensity, machine learning techniques, or other segmentation methods such as level sets or region growing. In atlas-based approaches the atlas may be a single labelled image or a probabilistic atlas obtained from more than one labelled image by

1436

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

non-rigid registration. To segment a query subject, it is registered with the atlas and the labels are transferred using the warp field. Methods using this standard approach (e.g. (Gee et al., 1993; Collins et al., 1995; Iosifescu et al., 1997; Wang et al., 2005; Wu et al., 2006; Gouttard et al., 2007)) differ in the way the atlas is created and how the registration between query subject and atlas is performed. Modifications to the standard atlas-based approach have given promising results. Fischl et al. (2002) incorporate structure specific models using Markov Random Fields. Han and Fischl (2007) extend this work by including an intensity renormalisation step. Khan et al. (2008) augment the results produced by Fischl et al. (2002) with diffeomorphic warps. Heckemann et al. (2006) treat the result of a single atlas on a query subject as a classifier, and combine the results obtained by multiple atlases. Xue et al. (2001) use atlas-based registration as a precursor to region growing guided by a fuzzy model. Ciofolo and Barillot (2005) incorporate a-priori information from an atlas using a fuzzy decision system and combine this with level sets. Zhou and Rajapakse (2005) use affine registration coupled with fuzzy templates and information fusion. Shape and appearance model approaches involve establishing correspondences across a training set and learning the statistics of shape and intensity variation using PCA models. To segment a query subject, model parameters giving an instantiation that best approximates the subject are sought. Kelemen et al. (1999) use a statistical shape model and elastic deformations to segment the hippocampus, thalamus, putamen, and pallidum. Pitiot et al. (2004) employ a framework involving deformable templates constrained by statistical shape models and other expert prior knowledge. Tu et al. (2008) combine statistical models of shape with discriminative models based on appearance features at different scales. Methods with strong machine learning elements include AkselrodBallin et al. (2006) who use a support vector machine classifier trained on sets of features (e.g. average intensities, maximum intensities, location, shape moments and boundary surface area of the subcortical structures) as well as prior knowledge from a probabilistic atlas. In further work Akselrod-Ballin et al. (2007) used graphs at multiple scales coupled by features from a probabilistic atlas. Pohl et al. (2007) also adopt a graph-based approach using a tree to capture hierarchical relationships between structures and an EM algorithm to apply the tree to segmentation1. Tohka et al. (2006) use a deformable mesh followed by graph cuts to segment the caudate and putamen from PET images. Powell et al. (2008) use neural networks and support vector machines to classify voxels resulting from segmentation by an atlasbased method and demonstrated this improved segmentation accuracy. The objective evaluation of image segmentation methods is crucially important in order to get automated image segmentation methods accepted in clinical practice. One of the key challenges for the evaluation of automated image segmentation methods is the lack of a gold standard against which to compare segmentation methods. Attempts have been made to address this. Warfield et al. (2004) developed an EM-based algorithm (STAPLE) which uses multiple segmentations from manual and/or automatic methods to estimate the true gold standard and at the same time measures the performance of the methods. Martin-Fernandez et al. (2005) introduced the Williams' Index (Williams, 1976) as an alternative to STAPLE in the case where ground truth is not available and showed that it gives comparable results for this special case. Bouix et al. (2007) compared STAPLE, the Williams' Index and Multidimensional scaling in the evaluation of segmentations without ground truth. They concluded that each method had its advantages — STAPLE was comprehensive and provided an estimate of the ground truth, the

Williams' Index was simple and efficient, and Multidimensional scaling facilitated visual comparisons of similarity of the methods. However, in most cases expert manual segmentations are still regarded as the gold standard and their reliability in terms of interrater and intra-rater variability is assessed using measures such as the intraclass correlation coefficient (ICC). Given any gold standard segmentation there exist a large number of different methodologies that can be used to evaluate the quality of a given segmentation. These can be broadly divided into three groups. The first group, spatial overlap measures quantify the overlap between regions, e.g. the Dice (1945) and Jaccard (1912) coefficients and generalised overlap indices (Crum et al., 2006). Distance measures quantify the distance between manually and automatically segmented surfaces, e.g. signed surface distances, the mean absolute distance and the Hausdorff distance (Gerig et al., 2001). Thirdly, there are volumetric measures based solely on the volumes of the segmented regions e.g. the difference between volumes (Collins et al., 1995). The selection of an appropriate validation metric depends on the clinical application of interest. Volumetric and spatial overlap measures are useful for evaluation where volume changes are of importance, and distance measures where the shape of the boundary of the structure is being considered. There have also been attempts at providing a figure of merit to encompass the different classes of metrics (Cardenes et al., 2007) or to allow clinically meaningful ranking (Popovic et al., 2007). Another challenge to comprehensive evaluation particular to medical imaging in 3D is the dearth of freely available standard datasets. This is due to the costs of acquisition, the need for manual segmentation and the ethical considerations preventing sharing of data. The result is evaluation of methods addressing similar problems being carried out on datasets composed of different types of cohorts, acquired in different ways, and sometimes consisting of small numbers of subjects. In practice the performance (robustness) of an algorithm may change with scanner type, imaging parameters, state of subject (disease or control, age, sex) amongst other factors. Consequently, it is not straightforward to compare different algorithms based on published evaluations. However, there are attempts to address this, for example by the Segmentation Challenge Workshop held during MICCAI 2007 (Heimann et al., 2007). In this paper we compare four state-of-the-art algorithms for automatic segmentation of subcortical structures in MR brain images. Two of these are primarily atlas-based, and the other two are based on statistical models of shape and appearance. For objective comparisons and to address the limitations mentioned above, they have been evaluated using the same data and criteria. The dataset was large (270 subjects) and varied in age, disease and sex (see Description of data section). 18 subcortical structures were segmented from each subject and evaluated using a mixture of spatial overlap, distance and volumetric measures. We present qualitative and quantitative results of the evaluation of each method with respect to manually annotated data, and with respect to the other methods. We also explore the agreement between the manual segmentations and the results from the automatic methods using the Williams' Index. The four segmentation methods are described in the section Description of automated segmentation methods and details of the quantitative measures used in validation are in Quantitative metrics for evaluation. The experiments are described in the Experiments section and results are presented in Results. Analysis of these results allows us to draw conclusions about the accuracy and robustness of the methods. A discussion is given in the Discussion section. Materials and methods Description of data

1

This method has been implemented in the National Alliance of Medical Imaging Computing (NAMIC) toolbox, Slicer3, http://www.slicer.org/slicerWiki/index.php/ Slicer3:EM.

The dataset consisted of 270 MR brain images which had been extensively labelled manually. The images were T1-weighted, and

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

1437

Table 1 Details of the images used including demographics of the subjects where available. Group 1 Markis et al., 2006

2 Frazier et al., 2005, 2008

No. 36

118

Image size

Resolution (mm)

Demographics

256 × 128 × 256

1 × 1.5 × 1

256 × 128 × 256 or 256 × 158 × 256

0.9375 × 1.5 × 0.9375

Aged 13–72; 16 M, 20 F 24 — Controls 12 — At risk of schizophrenia Aged 4–17 18 — ADHD 9 — At genetic risk for bipolar disorder 51 — Bipolar disorder 24 — Controls 16 — Schizophrenia spectrum illness Aged 65–83; 11 M, 6 F 4 — Alzheimer's disease (3 M, 1F) 9 — Controls (5 M, 4 F) 4 — Converters (3 M, 1 F) Aged 23–66; 55 M, 30 F 45 — Controls (26 M, 19 F) 40 — Schizophrenic (29 M, 11 F) Children, aged 9–11; 7 M, 7 F 8 — Prenatal exposure to cocaine (4 M, 4 F) 4 — Exposure to other drugs, but not cocaine (3 M, 1 F) 2 — Unexposed

3 Desikan et al., 2006

17

256 × 256 × 256

1×1×1

4 Goldstein et al., 2007

85

256 × 60 × 256

0.9375 × 3 × 0.9375

5 Walhovd et al., 2007

14

256 × 256 × 256

1×1×1

had been acquired at different times with different scanners. However, their labelling was carried out following a strict and detailed protocol (see Segmentation procedure for gold standard section). For the purposes of this study, we used the manual labels of the brain stem, the fourth ventricle, and the left and right pairs of the accumbens, amygdala, caudate nucleus, hippocampus, lateral ventricles, pallidum, putamen and thalamus as gold standards. The imaged cohorts included control subjects as well as subjects with Alzheimer's Disease, Schizophrenia, Attention Deficit Hyperactivity Disorder (ADHD), and prenatal drug exposure. Their ages ranged from 4 years to 83 years. The datasets were separated into five groups based on their sources. This categorisation is solely to illustrate the diversity of the data. Properties of the images in each group are given below and demographics of the subjects are summarised in Table 1. Group 1 — superset of the Internet Brain Segmentation Repository 2 (IBSR) This group comprised 36 images of subjects aged between 13 and 72 years, 20 female and 16 male. 24 of the subjects were controls and 12 were categorised as “at risk of schizophrenia”. 17 of the subjects are present in the publicly available IBSR dataset of 18 subjects. More detailed information about the imaging protocol of the IBSR dataset can be obtained from Makris et al. (2006). In summary the T1 images were acquired at Massachusetts General Hospital on a 1.5 T Siemens Sonata scanner into 128 contiguous 1.33 mm slices, with TR = 2730 ms, TE= 3.39 ms, FA = 8°, FOV = 256 mm, matrix size= 256 × 256, and averages= 2. The images were acquired in the sagittal plane; however, after preprocessing (see Segmentation procedure for gold standard section) the images used for the segmentation task were axially oriented with resolution 1 mm × 1.5 mm × 1 mm. Group 2 — children and adolescents from a study of mental disorders This group constituted of 118 images from 3 separate ongoing neuroimaging studies described in more detail (along with image acquisition information) in Frazier et al. (2008) and Frazier et al. (2005). The age of the subjects ranged between 4 years and 17 years, and Table 1 gives a breakdown of the subjects by mental disorder. The images were acquired at McLean Hospital Brain Imaging Centre, Massachusetts, on a 1.5 T GE Medical Systems Signa scanner (Milwaukee) into 124 1.5 mm slices in a 256 × 192 matrix. The sequence was a 3D inversion recovery-prepped spoiled gradient echo

2

www.cma.mgh.harvard.edu/ibsr/.

coronal series with prep = 300 ms, TE = 60s, FOV = 24 cm2, FA = 25° and number of excitations = 2. After preprocessing, the images segmented in this study were axially oriented with resolution 0.9375 mm × 1.5 mm × 0.9375 mm. Group 3 — adults (over 65) from an Alzheimer's study Group 3 was composed of 17 subjects (11 male, 6 female) aged 65– 83 years from an Alzheimer's disease study. Image acquisition was similar to the description given in Desikan et al. (2006), and Table 1 gives a detailed breakdown of the cohort. After preprocessing, the images used in this study were axially oriented with isotropic resolution of 1 mm. Group 4 — adults from a schizophrenia study There were 85 subjects aged between 23 and 66 years (45 controls — 26 male, 19 female, and 40 schizophrenics — 29 male, 11 female) in this group which is described in detail in Goldstein et al. (2007). Acquisition of images was on a 1.5 T GE Signa scanner (Waukesha, Wisconsin) at the Athinoula Marinos Biomedical Imaging Centre, Massachusetts General Hospital. The images were obtained as spoiled gradient echo images in the coronal plane in contiguous 3.1 mm slices with TR = 40 ms, TE = 8 ms, FA = 50°, FOV = 30 cm, matrix size =256 × 256, and averages = 1. After preprocessing, the images used here were axially oriented with resolution 0.9375 mm× 3 mm×0.9375 mm. Group 5 — children from a study on prenatal drug exposure The 14 subjects in this group (aged 9 years to 11 years, 7 female and 7 male) were involved in an investigation into effect of prenatal drug exposure on brain morphometry (Walhovd et al., 2007) from which further details can be obtained. The subjects were scanned on a Siemens Sonata 1.5 T scanner with an 8-channel head coil. Gradient echo MPRAGE T1 sequences were acquired for each subject with TR = 2730 ms, TE = 3.5 ms, TI = 1000 ms and FA = 7°. The matrix size was 256 × 256 with FOV = 256 mm. Acquisition was in the sagittal plane with 1.3 mm × 1 mm in-plane resolution and 1.33 mm slice thickness. After preprocessing the images used here were axially oriented with isotropic 1 mm resolution. Segmentation procedure for gold standard The segmentation of subcortical structures in all cases was performed at the Centre for Morphometric Analysis, Boston. The protocol is semi-automated and has been extensively described (for

1438

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

example see Filipek et al. (1994) and Nishida et al. (2006)). It involves spatial alignment of each brain to a standard coordinate frame based on the AC–PC line and reslicing in the coronal orientation — however, information obtained from the sagittal and axial planes are also used. The brain is parcellated into units and boundaries for the specific anatomical structures are obtained semi-automatically with manual guidance using intensity contour mapping and histogram analysis of intensity distributions between regions. The segmentation is also guided by detailed anatomical protocols for each subcortical structure (see Filipek et al., 1994). Estimates for inter-rater and intra-rater reliability have shown this procedure to be highly reliable. For example, Frazier et al. (2005) obtained the following intra-rater and inter-rater values: amygdala — 0.88 and 0.84; thalamus — 0.95 and 0.96; and hippocampus — 0.93 and 0.94. Frazier et al. (2008) obtain the following ICC values for the inter-rater reliability over two raters and ten scans of the same subject (left, right): accumbens — (0.79, 0.57); amygdala — (0.81, 0.83); caudate — (92.2, 92.4); hippocampus — (85.7, 86.7); pallidum — (84.5, 86.7); putamen — (89.7, 90.2); thalamus — (90.3, 91.3); It should the be noted that the large discrepancy of the left and right ICC values of the accumbens is due to their small size. Description of automated segmentation methods Classifier Fusion and Labelling (CFL) It is possible to obtain a structural segmentation by propagating labels from multiple atlases to the query subject, treating them as classifiers and fusing them to achieve a consensus estimate. This approach can be viewed as an extension of atlas-based segmentation methods (Iosifescu et al., 1997; Svarer et al., 2005). Classifier fusion, based on the majority vote rule, has been shown to be robust and accurate when used to segment brain structures (Rohlfing et al., 2004; Heckemann et al., 2006). However, problems of scale occur when classifier fusion is used in conjunction with a large repository of labelled atlases. These problems are discussed in Aljabar et al. (2007) and schemes for the selection of smaller numbers of appropriate classifiers, prior to propagation and fusion, are presented. The use of such selection schemes was shown to result in increased accuracy of the final segmentations. The work presented in Aljabar et al. (2007) focussed on estimating segmentations for images that are spatially normalised to a reference image. The approach used here is an extension allowing segmentation to be carried out in the native space of the query subject. This was carried out using the following steps 1. Skull stripping is performed on all images. 2. Align all atlas anatomies to a reference image (MNI single subject image (Cocosco et al., 1997)) using affine registration. 3. Align the query image to the reference using an affine transformation. 4. Assess the similarity between the spatially normalised query image and atlas images using the normalised mutual information (Studholme et al., 1999) over the sub-cortical region. 5. Rank the atlas images based on similarity with the query. 6. Select the twenty top-ranked atlases as classifiers. 7. Non-rigidly register the selected classifiers with the query using a free-form deformation model (Rueckert et al., 1999). 8. Propagate the labels from the selected classifiers to the query image and fuse them using the vote rule. Profile Active Appearance Models An Active Appearance Model (AAM) is a statistical model of both the shape of a structure and its appearance, together with an algorithm for matching it to an image (Cootes et al., 2001). The method described here uses a variant of AAMs which models the

intensities along profiles that are normal to the boundary of a structure. During image search the model is capable of synthesising the texture across the surface of the object of interest, and the residual differences between the synthesised texture and that in the query image are used to drive the search. To construct a profile AAM for a particular structure, that structure is extracted from the labelled image and converted into a binary image for each member of the training set. The crucial part in the construction process is the identification of homologous points across the image — the so called correspondence problem. This is addressed in an automated manner by applying a groupwise registration (Cootes et al., 2005) algorithm to the binary images with the end result that surface meshes approximating each member of the training set can be constructed. The surface meshes all have the same number of vertices located at homologous points. A profile AAM is built using the vertex positions to represent shape, and grey level profiles from the T1 images sampled normal to the surface at each vertex to represent texture. A more complete description of the model building process is in Babalola et al. (2007). A separate model is constructed for each of the 18 structures, and a composite model containing all structures is also constructed. In this two layer approach the composite model has a lower degree of freedom than the single structure models and its search results are less accurate. However, the composite model is less prone to falling into local minima, and its results are used to provide good initialisation for the single structure models which give more accurate results. During search each model uses the number of eigenmodes representing 95% of the variance in the data. Given a query image, the following steps are used to segment it: 1. Perform skull stripping using the Brain Extraction Tool of FSL (Smith et al., 2004). 2. Align the query image with the image determining the pose of the model (1st image in training set) using affine transformation. 3. Search the aligned image using the composite model. 4. Use the inverse transformation of step 2 to transform the search results to the native space of the query image. 5. For each structure, search in the native space of the query image with a single structure model initialised with the result for the structure from the composite model. 6. Convert the result from a mesh to a binary voxel image with the same dimensions as the query image. Bayesian Appearance Models (BAM) Similar to the profile AAM, the BAM models texture along the normal to a surface representation of the structural boundary (Patenaude et al., 2007). The BAM also restricts the search space to linear combinations of the eigenvectors of a shape model. The BAM differs mainly in the method of representing the relationship between shape and intensity which is modelled via the conditional distribution of intensity given shape. Rather than synthesising a single predicted intensity image, the predicted intensity distribution is calculated. This is the conditional distribution of intensity given the shape (vertex locations). The BAM aims to maximise the probability of the shape given the observed intensities. The surface parameterisations used to train the shape model were achieved through the use of deformable models. The point correspondence criteria were embedded within the deformation process using shape-based forces. Prior to applying the deformable models to the manual labels the training data was linearly registered to the MNI152 template (based on the T1-weighted image). A subcortical reference-space mask was used to place emphasis on the subcortical regions. The same registration process is applied to new images so that the BAM may be registered to the native image space using the inverse transform.

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

Expectation–maximisation-based segmentation using a brain atlas (EMS) This method is a probabilistic approach to brain segmentation, combining a standard expectation–maximisation segmentation (Van Leemput et al., 1999) with a brain atlas (Hill et al., 2002) constructed from the training data. The method operates in the following steps: 1. Affine registration followed by non-rigid registration of the T1 weighted volumes of all the training data to the defined reference space using the Free-form deformation (FFD) technique described in Rueckert et al. (1999). An isotropic FFD mesh resolution of 20 mm was used to obtain a crisper structure definition than when using only affine registration. 2. Construction of a brain atlas from the training data by mapping all subcortical and cortical structures from the labelled versions of the training data using the non-rigid mappings obtained in the previous step, and averaging the result. 3. Mapping of the brain atlas to the query subject using the nonrigid mappings from step 1, to obtain a subject-specific probabilistic atlas. 4. Application of an in-house implementation (Murgasova et al., 2007) of the standard EMS technique (Van Leemput et al., 1999), using the subject-specific atlas as a prior. Quantitative metrics for evaluation The manually labelled images of each subject are used as gold standards and the results of each segmentation method are converted into binary images of the same voxel resolution and image dimensions as the query image. A number of volumetric and distance-based measures often used in evaluation of segmentation are computed for each segmentation result. In the following description of these measures, the segmentation result is termed A and the gold standard G. Dice coefficient The Dice coefficient D (Dice, 1945) is one of a number of measures of the extent of spatial overlap between two binary images. It is commonly used in reporting performance of segmentation and its values range between 0 (no overlap) and 1 (perfect agreement). In this paper the Dice values are expressed as percentages and obtained using Eq. (1). D=

2 jA \ Gj × 100 jA j + jGj

ð1Þ

The Jaccard coefficient (Jaccard, 1912) is another widely used spatial overlap measure. It can be expressed as a percentage as shown in Eq. (2), however, we do not compute this measure as it can be obtained from the Dice measure by Equation (3). J=

jA \ Gj × 100 jA [ Gj

ð2Þ

J=

D ð2 − DÞ

ð3Þ

False positive and false negative Dice To further characterise the segmentations by the different methods, the false positive Dice (FPD) and the false negative Dice (FND) are obtained using Eqs. (4) and (5). G and A are the complements of the gold standard and the segmentation results respectively. The FPD gives a measure of over-segmentation and the FND of under-segmentation. 2jA \ Gj × 100 FPD = jA j + jGj 2jA \ Gj FND = × 100 jAj + jGj

Symmetric mean absolute surface distance The symmetric mean absolute surface distance attempts to estimate the error between the surfaces of A and G using distances between their surface voxels. We define the surface distance of the ith surface voxel on A, dag i as the distance to the closest voxel on G. A sign convention can be adopted, such as assigning negative values if a particular voxel in A is within the surface defined by G, and positive values otherwise. The mean absolute surface distance (MAD) for A is the mean of the absolute values of the surface distance for all na surface voxels (Eq. (6)). Obviously, this is different for G, and the symmetric mean absolute surface (SMAD) is the mean of surface distance for A and G as given in Eq. (7). MADag =

SMADag

ð5Þ

na 1 X ag jd j na i = 1 i

0 1 ng na X X 1 ag ga A @ = jd j + jdj j na + ng i = 1 i j=1

ð6Þ

ð7Þ

The surface distance values were calculated using 3D distance transforms as described in Gerig et al. (2001). To calculate the 3D distance transform, the surface voxels of the binary image are obtained and the distance transform based on this is obtained using the D-Euclidean metric proposed by Borgefors (1984). In the distance transform image, the value of each voxel is the Euclidean distance in millimetres to the nearest surface voxel. The values on the surface voxels are 0. Given the distance transforms, calculating the surface distances of the binary image A with respect to G is reduced to looking up the values of the surface voxels of A in the distance transform image of G. Hausdorff distance The directed Hausdorff distance Hag, between two sets of points A and G can be obtained in a two stage manner. First, for each point in A the minimum distance to all points in G is obtained. Hag is the maximum of this set of minimum distances. In the present case, the minimum distance for the ith surface voxel in A to the set of surface voxels in G is dag i , therefore Hag is the maximum value of the surface distance of all surface voxels in A (Eq. (8)). The Hausdorff distance, H, is the maximum of the directed form for A → G and G → A (Eq. (9)).  ag  Hag = max di ; i = f1 N na g

ð8Þ

  H = max Hag ; Hga

ð9Þ

Volumes Finally, we also look at the most basic of measures — the volumes. For each individual segmentation result we find the volume, V, as the number of labelled voxels multiplied by the voxel dimensions. We then calculate the percentage absolute volumetric difference (AVD) as the ratio of the absolute difference between the gold standard volume and the segmented volume, to the gold standard volume (Eq. (10)). The absolute value is used to account for some segmentation results having a lower volume than the gold standard, and others having a higher volume

AVD = ð4Þ

1439

jVa − Vg j × 100 Vg

ð10Þ

Williams' Index The segmentation result of each method is an independent estimate of an anatomical structure whose ground truth is unknown,

1440

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

Table 2 Summary table for results of all methods applied on all structures over all measures with respect to manual annotations. Structure

Method

Dice coefficient

Symmetric MAD (mm)

Hausdorff dist (mm)

Percentage mean AVD

FP Dice

FN Dice

Accumbens

CFL PAM BAM EMS CFL PAM BAM EMS CFL PAM BAM EMS CFL PAM BAM EMS CFL PAM EMS CFL PAM BAM EMS CFL PAM BAM EMS CFL PAM BAM EMS CFL PAM BAM EMS CFL PAM BAM EMS CFL PAM BAM EMS

75.82 (7.18) 67.69 (9.91) 68.73 (7.85) 67.90 (7.78) 77.67 (5.75) 66.88 (12.33) 73.10 (6.91) 70.83 (7.35) 94.17 (1.38) 87.80 (3.01) 88.53 (1.98) 82.88 (3.61) 88.09 (2.80) 83.43 (5.11) 85.60 (3.53) 82.61 (5.66) 83.28 (4.73) 70.62 (9.88) 77.37 (8.63) 83.50 (3.73) 76.80 (6.23) 79.12 (4.32) 76.37 (5.88) 91.25 (3.66) 80.94 (6.83) 79.53 (9.62) 82.87 (12.02) 81.87 (4.79) 79.33 (5.09) 79.47 (4.26) 80.47 (4.49) 89.84 (2.37) 86.33 (2.80) 86.40 (2.61) 86.63 (2.48) 90.80 (1.61) 87.68 (2.75) 87.62 (2.45) 85.16 (2.07) 85.29 (7.16) 78.70 (10.33) 80.45 (8.45) 79.22 (9.01)

0.91 (0.27) 1.21 (0.42) 1.09 (0.24) 1.17 (0.28) 1.16 (0.31) 1.68 (0.66) 1.38 (0.40) 1.43 (0.35) 0.85 (0.19) 1.36 (0.33) 1.32 (0.22) 1.59 (0.32) 0.77 (0.16) 0.95 (0.21) 0.88 (0.16) 1.09 (0.38) 0.92 (0.30) 1.34 (0.45) 1.25 (0.48) 0.96 (0.21) 1.20 (0.27) 1.18 (0.23) 1.32 (0.36) 0.86 (0.46) 1.51 (0.59) 1.70 (0.76) 1.12 (0.50) 0.96 (0.20) 1.13 (0.25) 1.14 (0.21) 1.03 (0.20) 0.76 (0.12) 0.94 (0.15) 0.97 (0.14) 0.96 (0.13) 0.93 (0.14) 1.16 (0.24) 1.19 (0.20) 1.28 (0.17) 0.91 (0.28) 1.24 (0.45) 1.20 (0.41) 1.20 (0.37)

3.09 (1.01) 3.79 (1.19) 3.47 (1.03) 4.33 (1.53) 4.37 (1.64) 5.32 (2.15) 4.80 (2.15) 5.43 (1.61) 4.83 (1.52) 6.02 (1.75) 6.37 (2.09) 7.74 (2.05) 4.07 (1.94) 4.10 (1.96) 4.60 (2.14) 6.41 (3.18) 6.61 (2.92) 7.72 (3.11) 9.04 (4.21) 4.51 (1.51) 5.24 (1.62) 5.01 (1.71) 6.44 (1.97) 9.76 (7.31) 13.97 (8.43) 16.16 (9.47) 10.46 (6.84) 3.64 (1.05) 3.76 (1.00) 3.75 (0.97) 3.92 (1.09) 3.60 (1.14) 3.79 (1.13) 4.38 (1.46) 4.50 (1.16) 3.98 (1.00) 4.12 (1.01) 4.29 (0.96) 5.46 (1.36) 4.75 (3.42) 5.66 (4.47) 5.84 (5.24) 6.15 (3.61)

17.55 (16.92) 18.24 (13.48) 31.72 (29.06) 28.63 (33.14) 17.02 (15.77) 20.91 (17.64) 24.67 (22.82) 22.10 (22.68) 3.98 (2.98) 6.80 (5.74) 7.80 (5.82) 21.10 (8.42) 7.67 (6.23) 5.02 (5.46) 13.17 (11.46) 14.04 (12.85) 14.96 (11.48) 15.44 (13.90) 39.10 (34.22) 9.23 (8.68) 11.97 (7.40) 22.14 (16.52) 14.53 (13.31) 6.50 (6.27) 7.28 (9.85) 38.95 (34.66) 39.40 (53.25) 9.94 (7.13) 9.43 (9.93) 22.79 (15.68) 13.83 (10.53) 6.86 (6.17) 3.96 (4.18) 14.57 (8.89) 8.20 (6.56) 4.83 (4.09) 4.03 (3.33) 13.71 (9.31) 10.60 (6.45) 9.90 (10.82) 10.22 (11.68) 21.84 (21.92) 20.16 (27.01)

24.21 (14.12) 25.56 (13.85) 36.49 (17.25) 40.11 (17.75) 22.91 (12.98) 28.38 (14.08) 30.16 (15.28) 33.40 (16.40) 5.10 (2.54) 12.49 (4.20) 10.94 (4.49) 5.14 (3.20) 12.68 (5.85) 17.54 (7.13) 17.16 (8.86) 15.99 (12.09) 14.82 (11.01) 28.94 (15.42) 36.85 (19.46) 18.71 (7.95) 17.68 (7.79) 28.76 (10.73) 27.73 (11.04) 9.72 (6.37) 19.84 (10.79) 21.80 (26.45) 29.84 (25.52) 15.54 (7.40) 21.90 (8.37) 30.01 (9.28) 20.44 (9.43) 10.92 (5.15) 14.61 (3.86) 19.93 (5.78) 15.69 (5.52) 9.87 (3.33) 13.20 (3.85) 18.20 (6.28) 10.34 (4.92) 14.95 (10.26) 19.94 (11.07) 24.47 (15.45) 23.84 (17.72)

24.15 (11.51) 39.06 (15.38) 26.04 (20.62) 24.08 (11.51) 21.74 (10.48) 37.87 (20.74) 23.65 (16.59) 24.93 (10.74) 6.57 (3.04) 11.91 (6.21) 12.00 (5.91) 29.10 (8.54) 11.13 (5.00) 15.60 (4.82) 11.64 (7.37) 18.78 (8.39) 18.62 (9.26) 29.81 (11.57) 8.40 (5.44) 14.28 (4.57) 28.73 (8.58) 13.00 (8.43) 19.54 (7.68) 7.79 (4.53) 18.28 (5.32) 19.14 (16.34) 4.42 (4.29) 20.72 (7.74) 19.43 (7.56) 11.05 (5.91) 18.63 (9.19) 9.41 (4.72) 12.72 (3.75) 7.26 (4.22) 11.05 (4.46) 8.52 (3.56) 11.43 (3.41) 6.56 (3.82) 19.34 (5.72) 14.48 (9.38) 22.66 (14.47) 14.63 (13.53) 17.72 (10.61)

Amygdala

Brain stem

Caudate

Fourth ventricle

Hippocampus

Lateral ventricle

Pallidum

Putamen

Thalamus

Over all structures

Values are means over left and right of structures and the standard deviations are in parentheses.

but has been assumed to be given by the manually segmented version. In order to investigate the validity of this assumption, the results of each method can be measured relative to each other as well as to the gold standard using any of the metrics described above. The Williams Index (Williams, 1976) for the gold standard, Wg, can be calculated using Eq. (11) and the Dice coefficient as the similarity metric (Bouix et al., 2007).

Wg =

P ðn − 1Þ ni = 1 Dig Pn P 1 2 i = 2 ij − = 1 Dij

ð11Þ

where Dig is the Dice coefficient of method i with respect to the gold standard, Dij is that of method i with respect to method j, and n is the number of methods. The values of Williams' Index can be in one of three categories. Greater than 1: The results of the automatic methods are more similar to the gold standard than to each other. Equal (or close to 1): The results of the automatic methods are as similar to each other as they are to the gold standard. Less than 1: The results of the automatic method are more similar to each other than each is to the gold standard.

Experiments The 270 subjects were randomly assigned into 27 groups of 10. Each of the methods described in the section Description of automated segmentation methods was applied to segment each image in one of the groups of 10 using data from the 26 other groups. The results from each method were converted into binary voxel images with the same resolution as the input images. One binary file was obtained for each structure for each subject, and the measures described in Quantitative metrics for evaluation section were applied to obtain quantitative results. Qualitative results were obtained by superimposing contours derived from the binary images onto the respective T1 images. Results Accuracy with respect to manual annotations Table 2 shows the results of applying the metrics of the section Description of automated segmentation methods, to the segmentation results of each structure using the manual labels as gold standards. The results are averages for each structure (left and right pairs combined) over the 27 sets of leave ten out experiments (n = 540 for all structures except brain stem and fourth ventricle where n = 270). The box and

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

1441

Fig. 1. Box and whisker plots of Dice coefficients of each structure for each method. The whiskers are 1.5 × the inter-quartile range, and values outside these are plotted individually.

whisker plots of the values of the Dice coefficient, Hausdorff distance and percentage absolute volumetric difference are given in Figs. 1–4. Table 3 ranks the methods according to the values obtained by the Dice, Hausdorff distance and percentage average volume difference metrics. For each metric, differences between the methods are considered significant if the 95% confidence intervals of the estimate of the true means do not overlap. To obtain the ranking of a structure 18 t-tests are carried out (6 possible pairings of the methods for each of the 3 metrics). Adjustment for multiple comparisons was performed by Bonferroni correction of the confidence intervals.

Fig. 5 shows the outlines of these segmentations and those of the gold standard overlaid on slices of the MR images for a subject with average values of the metrics. Comparative analysis of automatic and manual segmentations The Williams' Index for the gold standard was calculated for each structure of each subject using Eq. (11). Fig. 6 is a box and whisker plot of Wg by structure. The median values are very close to 1 for all structures, and the values for the nucleus accumbens

Fig. 2. Box and whisker plots of Hausdorff distance by structure for each method. The whiskers are 1.5 × the inter-quartile range, and values outside these are plotted individually.

1442

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

Fig. 3. Box and whisker plots of percentage absolute volumetric difference by structure for each method. The whiskers are 1.5 × the inter-quartile range, and values outside these are plotted individually.

and the amygdala have the largest ranges. The mean of the values was 1.02 ±0.05. 4860 structures were segmented (18 × 270), and of these 108 had Wg values outside 3 standard deviations of the mean. The breakdown by structure is shown in Table 4. Fig. 7 shows the results of each automatic method overlaid on the gold standard outline for the structures that had the highest and lowest values of Wg.

Discussion Summary of results relative to gold standard Table 2 contains results for each method relative to the gold standard as measured by spatial overlap, volumetric and distancebased metrics. The results are shown by structure, and a summary

Fig. 4. Box and whisker plots of Dice coefficients, Hausdorff distance, and percentage absolute volumetric difference over all structures for each method. The whiskers are 1.5 × the inter-quartile range, and values outside these are plotted individually.

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447 Table 3 Ranking of methods according to the Dice, Hausdorff distance and percentage absolute volume difference metrics. Structure

Metric

Ranking

Accumbens

Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD Dice Hausdorff AVD

CFL N BAM = EMS = PAM CFL N BAM N PAM N EMS CFL = PAM N EMS = BAM CFL N BAM N EMS N PAM CFL = BAM; CFL N PAM = EMS; BAM = PAM; BAM N EMS CFL N PAM = EMS = BAM CFL N BAM N PAM N EMS CFL N PAM = BAM N EMS CFL N PAM = BAM N EMS CFL N BAM N PAM = EMS CFL = PAM N BAM N EMS PAM N CFL N BAM = EMS CFL N EMS N PAM CFL N PAM N EMS CFL = PAM N EMS CFL N BAM N PAM = EMS CFL N BAM = PAM N EMS CFL N PAM N EMS N BAM CFL N EMS = PAM; EMS N BAM; PAM = BAM CFL = EMS N PAM N BAM CFL = PAM N EMS = BAM CFL N EMS = BAM; EMS N PAM; BAM = PAM CFL = BAM = PAM; CFL N EMS; BAM = PAM = EMS; PAM = CFL N EMS N BAM CFL N EMS = BAM = PAM CFL = PAM N BAM = EMS PAM N CFL = EMS N BAM CFL N PAM = BAM N EMS CFL = PAM; CFL N BAM N EMS; PAM = BAM N EMS; PAM = CFL N EMS N BAM CFL N BAM N EMS = PAM CFL N PAM = BAM N EMS CFL = PAM N BAM = EMS

Amygdala

Brain stem

Caudate

Fourth ventricle1

Hippocampus

Lateral ventricle

Pallidum

Putamen

Thalamus

All structures

For each metric, differences between the methods are considered significant if the 95% confidence intervals of the estimate of the true means do not overlap. Adjustment for multiple comparisons (18 tests per structure), was performed by Bonferroni correction of the confidence intervals. The leftmost method performs best, ‘N’ indicates significantly better, and ‘=’ means no significant difference. 1 The Fourth ventricle was combined with the Brain stem in BAM.

over all structures is given in the last set of rows. Using the summary over all structures the methods were ranked in order of decreasing performance by a spatial overlap, a distance-based, and a volumebased metric (see Table 3) as follows: Spatial overlap (Dice): CFL → BAM → EMS and PAM Distance-based (Hausdorff): CFL → BAM and PAM → EMS Volumetric (AVD): CFL and PAM → EMS and BAM The CFL method clearly gives the best overall performance with respect to the gold standard. When considering performance on a structure by structure basis the Dice coefficients of CFL are significantly better than those of the other three methods for all structures. The Dice values of BAM are either the same or better than those of PAM and EMS for all structures except on the lateral ventricles (and pallidum for EMS). The Dice values of EMS were better than those for PAM for the amygdala, fourth ventricle, lateral ventricle and pallidum. The Hausdorff distances of CFL at the structure level are better than those of all other methods for all structures except on caudate, pallidum and thalamus where it performs at least as well as the other methods. BAM and PAM have a mixture of better and same results relative to each other for this metric. EMS only performed better than BAM and PAM for the lateral ventricle. CFL performed better than BAM and EMS for all structures, according to AVD. However, it did not perform as well as PAM on the caudate and putamen. PAM was better than EMS and BAM for all structures except on the amygdala where it performed the same as EMS, and the brain stem where it was the same with BAM. EMS

1443

performed better than BAM on the hippocampus, pallidum, putamen and thalamus. When comparing how well a method performed on the different structures, the order of best performance depends on the metric used to judge the performance. According to the spatial overlap metrics, the accumbens gives the worst overlap results for all methods, and the brain stem gives the best results. However the accumbens gives the best (lowest) Hausdorff distance and the lateral ventricles the worst for all methods. These differences are a combination of the fact that the accumbens are the smallest structures (hence small errors in overlap give high changes in the spatial overlap measures), and the shape of the lateral ventricles (in particular the length of the occipital horns coupled with partial volume effects along them) makes them difficult to segment especially for methods relying on prior shape topology (BAM and PAM). Comparing the values obtained here with those from other studies The values presented for the methods are comparable with results from recently published studies performing fully automated segmentation of subcortical structures in the brain. Table 5 lists values from five recent studies. The comparison is purely indicative as the number of subjects and their demographics, the scanning protocol and the image dimensions all affect the performance of methods. However, to our knowledge, the results presented here cover the largest number of subcortical structures, and uses the largest and most varied subject base of papers on the subject published to date. Unlike other studies in which the principal goal is the exposition of a particular method, we have dealt with four methods based on different principles and applied them to the same data. Gold standard agreement with automatic methods Fig. 6 shows that in general the segmentations of the automatic methods are as similar to each other as they are to the gold standard (as the Williams' Index values for the gold standard are close to 1). The cases where Wg is greater than 1 suggest that one or more of the automatic methods is not performing as well as the others. Where it is less than one it suggests that either most of the automatic methods failed, or there were errors in the manual segmentations. An interesting observation is the fact that the greatest deviation from 1 is attributed to the amygdala and the nucleus accumbens. The cases with low Williams' Index can be flagged for inspection and if necessary corrections to the segmentations can be made. The CFL method The CFL approach to segmentation, when combined with the selection of classifiers, can be regarded as a type of kNN classifier where the nearest neighbours are represented by the atlases selected for label propagation and fusion. The main features of the CFL approach are the use of registrations for identifying correspondences between atlas images and the query and the avoidance of the construction of an explicit model to represent the structure being segmented. The higher levels of accuracy achieved by CFL in segmenting the structures studied would indicate that there is a benefit to this approach although there is a trade-off with the need for larger amounts of memory to store all the atlases and the need for more computation time. The time required for generating a segmentation typically ranges from 120 to 180 min with the nonrigid atlas to query registrations making up the bulk of this time. The non-rigid registrations were carried out in parallel using a cluster of standard desktop machines managed by the Condor scheduling system3. An advantage of the CFL method is that a multi-label image 3

http://www.cs.wisc.edu/condor.

1444

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

Fig. 5. Overlays of the segmentation results of all four methods for a typical subject. Contours in red are the gold standard and blue the result of each method. The mean Dice, Hausdorff and AVD over all structures for this subject were CFL (86.1, 5.4, 8.6), PAM (79.2, 7.3, 10.9), BAM (79.3, 7.6, 18.9), and EMS (79.2, 6.4, 23.3). The value of Wg was 1.01.

Fig. 6. Box and whisker plot of the Williams' Index of the gold standard by structure. The whiskers are 1.5 × the inter-quartile range, and values outside these are plotted individually.

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447 Table 4 Numbers of structures with Wg values outside three standard deviations of the mean. Wg less than mean + 3 sds

Wg greater than mean + 3 sds

29 Accumbens 43 Amygdala 1 Fourth ventricle 2 Hippocampi 5 Pallidum 80

11 Accumbens 2 Amygdala 11 Fourth ventricle 2 Caudate nuclei 2 lateral ventricle 28

can be generated in a single step for a query subject. This contrasts with the use of separate models for each structure as done by PAM and BAM. The best accuracy rates for the CFL method were achieved when segmenting the brain stem and the ventricles. The relatively strong contrast boundaries of these structures indicate the dependence of CFL on good correspondence as such boundaries are the easiest for the registration algorithm to align. The PAM method The PAM method is an application of the general active appearance model framework (Cootes et al., 1998) to the problem of segmenting subcortical structures in 3D medical images. The use of the two stage approach with the composite model followed by the single structure models further augments its robustness. The results showed good volumetric agreement with the gold standard, but the Dice coefficients of the amygdala were lower than those of the other methods. This is probably a result of bad correspondences during the automated model building stage leading to a sub-optimal model. We anticipate that the performance on this structure (and the nucleus accumbens) can be improved by adjusting the parameters involved in the model building stage. Above, we have used registration to obtain an initial position for the model. Although a coarse translation will suffice, the protocol involved skull stripping and affine registration. Other approaches to initialisation are being investigated, the most promising of which is the identification of salient features which can be used to initialise the pose of the composite model, eliminating the need for registration based initialisation. The PAM method, like the BAM method, is topologically constrained. This is advantageous for most structures, but in the case of the lateral ventricles, which exhibit gross deformations and topological change in some subjects, it can be a drawback. The model-based approach also limits localised deformations which may prevent accurate surface fitting. However, the model parameters are useful for shape analysis, and can allow the visualisation of specific types of shape changes. Applying the PAM to search for a structure takes less than 1 min on a standard Pentium 4 desktop computer with a 1 GHz CPU and 1 GB of RAM. However, to perform a full segmentation the current procedure is to initialise by registration of a skull stripped image, and optionally convert the resulting meshes into binary voxel images.

1445

This takes about 20 min per subject. Processing a batch of subjects takes less time on average as the models only have to be loaded once for each batch. The BAM method BAM takes approximately 3 min for the initial linear registration then approximately 5 min per structure on a 2.4 GHz Intel Core 2 Duo MacBook Pro. After the linear registration the structures are fitted independently and this stage can be parallelised. The models provide a compact representation of the structural and intensity variation observed in the training data. Furthermore, point correspondence is implicit to the output due to the correspondence imposed in the model creation. Like PAM, the point correspondence as well as parameters obtained from the model search facilitates further shape analysis. Deformable surfaces are used to parameterise the manual labels and to enforce vertex correspondence. A limitation of the method is seen in the performance of BAM for the lateral ventricles. Due to the inability of the deformable surface to adapt to the long posterior horns of the ventricles, the horns were truncated. Since the horns are truncated in the model, the segmentation performance suffers due to the failure of the model to fit to the horns. Another limitation of the BAM models is that the fourth ventricle and brainstem are combined into a single structure. The two structures were combined because of a difficulty in fitting the deformable model to the brainstem since the fourth ventricle passes through the brainstem. Despite comparable spatial overlaps, BAM tends to produce a higher volumetric difference from the manual segmentations compared to the other methods. We believe this is a consequence of fitting surfaces onto the manual labels when establishing correspondence for model building, as the surfaces are not fitted to voxel centres. The segmentations typically have a larger false positive Dice (penultimate column of Table 2) indicating an over-estimation of volume. The EMS method Given the simplicity of EMS, it gave a good overall performance, similar to the shape-based methods PAM and BAM, but generally underperformed with respect to CFL. In structures with a high partial volume effect, such as the hippocampus and thalamus, misclassification is more likely to happen than for methods employing shape constraints, leading to lower Dice and higher Hausdorff values. In structures with more subject-specific individual and elongated shapes or shape pose, such as brain stem and associated fourth ventricles, registration errors in the atlas construction and mapping phase can propagate to the final EM classification which is reflected by higher Hausdorff distances, but not necessarily lower Dice coefficients, as there is still a good degree of overlap. On the other hand, since EMS does not necessarily produce a connected set of voxels, it can classify more remote voxels (e.g. at the occipital horns of the lateral ventricles)

Fig. 7. Axial overlays of the segmentation results of all four methods for the structure and subject that gave the highest and lowest values of Wg. Red is the gold standard, blue the result of each method. Note that for the value of Wg greater than 1, the automatic methods give very different results whereas for the value less than 1 their results are similar to each other.

1446

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447

Table 5 Dice values for methods discussed here and other methods for automatic segmentation of subcortical structures from recent publications. Author and details

Number of subjects

Acc

Amy

Bs

Cau

Fv

Hip

Lv

Pal

Put

Tha

CFL (see Table 1) PAM (see Table 1) BAM (see Table 1) EMS (see Table 1) (Pohl et al., 2007) Hierarchical parcellation, left and right separate (Gouttard et al., 2007) Atlas to autistic and IBSR datasets (Han and Fischl, 2007) Atlas based images from different scanners (Akselrod-Ballin et al., 2007) Multiscale segmentation on IBSR dataset (Powell et al., 2008) Multimodal images, atlas + artificial neural network

270 270 270 270 50 (left) 50 (right) 18 (IBSR) 20 (Autism) 14 (GE) 13 (Siemens)

76 68 69 68 – – – – – –

78 67 73 71 87 85 64 79 69 81

94 88 89 83 – – – – – –

88 83 86 83 – – 76 84 81 88

83 71 – 77 – – – – – –

84 77 79 76 81 81 67 75 79 86

91 81 80 83 – – 85 64 – –

82 79 80 81 – – 71 84 71 81

90 86 86 87 – – 78 88 – –

91 88 88 85 – – – – 88 88

18 (IBSR)



63

84

80



69



74

79

84

25 (T1,T2 and PD) (n = 15, hippocampus)







91



85





92

93

better than methods with hard shape constraints which display a lower Dice coefficient and higher Hausdorff distance on average for these types of structures. EMS is a purely voxel-driven technique, where no explicit shape priors are taken into account (although shape is implicitly encoded in the probabilistic atlas that guides the algorithm). A major strength of this approach lies in its comparative simplicity: it is fast, fully automated, initialisation independent, and can be applied using different priors (e.g. disease-specific, or age/gender matched). Moreover, it provides a full brain segmentation (results on white matter, cerebellum and cortex are not reported here). A weakness is that the lack of shape constraints can lead to individual voxel outliers, predominantly in areas of high partial volume effect or for elongated shapes, and shapes of different poses, and its reliance on a fairly accurate registration for atlas construction as well as atlas mapping. Further efforts are under way to compensate for these effects. Application of the EMS method takes about 30 min using a Linux cluster. Relevance to practitioners of volumetry and segmentation The BAM method has been implemented as FIRST (FMRIB's Integrated Registration and Segmentation Tool) in the current release of FSL (version 4.0)4, and the PAM method will be available for download from the website of the Imaging Science and Biomedical Engineering group (ISBE) of the University of Manchester5. The code to implement the CFL method on a database of images is available from the Visual Information Processing Group of Imperial College London6. The EMS method is not available for download in its current form but can be implemented from Murgasova et al. (2007). For practitioners seeking to segment one or more brain structures, the rankings in Tables 3–5 can guide their choice of method. The nature of the investigation should determine the emphasis to place on the results by the various metrics. For instance in cases where the goal is measurement of global volumes or volumetric differences, the Dice overlap and volumetric measures will be more relevant than the Hausdorff distance. However, in the case where shape of the surface is being considered the Hausdorff distance comes into play. BAM and PAM have an advantage in these cases as they inherently allow localised measurements on the surfaces. Other practical issues such as turnaround time and availability of computing and technical resources will affect the choice of method. CFL and EMS require access to a database of manually labelled images, whereas the models required by BAM and PAM are readily accessible. 4 5 6

www.fmrib.ox.ac.uk/fsl/. www.isbe.man.ac.uk/∼kob/ibim. www.doc.ic.ac.uk/∼dr/software/.

BAM, EMS and PAM work in the order of minutes and are therefore suited to cases where a quick turnaround time is important, or computing resources are limited. Conclusion We have presented a comprehensive comparison of four methods for fully automatic segmentation of 18 subcortical structures in the brain. The results presented are some of the best published. Overall the CFL approach gives the best results, though it is also the most computationally expensive — if resources permit, it is the current method of choice for most structures. The results presented will enable other groups to compare their work against these methods and allow practitioners to select the most appropriate technique for their application. All four methods are subject to ongoing development — each has a number of parameters than can be tuned for the complex task of brain segmentation. The results presented here are the best produced by each method at the time of writing but are likely to improve. Acknowledgments This work was funded by the EPSRC under the IBIM project. We are grateful to Christian Haselgrove and the Center for Morphometric Analysis, Boston, for the MR images used and to the different groups that provided the images. References Akselrod-Ballin, A., Galun, M., Gomori, J.M., Brandt, A., Basri, R., 2006. Atlas guided identification of brain structures by combining 3D segmentation and SVM classification. In Proceedings of the 9th International Conference on Medical Image Computing and Computer-Assisted Intervention, volume 4191 of LNCS, 209–216. Akselrod-Ballin, A., Galun, M., Gomori, J.M., Brandt, A., Basri, R., 2007. Prior knowledge driven multiscale segmentation of brain MRI. In Proceedings of the 10th International Conference on Medical Image Computing and Computer-Assisted Intervention, volume 4792 of LNCS, 118–126. Aljabar, P., Heckemann, R., Hammers, A., Hajnal, J.V., and Rueckert, D., 2007. Classifier selection strategies for label fusion using large atlas databases. In Proceedings of the 10th International Conference on Medical Image Computing and ComputerAssisted Intervention, volume 4791 of LNCS, 523–531. Babalola, K.O., Petrovic, V., Cootes, T.F., Taylor, C.J., Twining, C.J., Williams, T.G., Mills, A., 2007. Automated segmentation of the caudate nuclei using active appearance models. In Proceedings of the workshop on 3D Segmentation in the clinic: a grand challenge at the 10th International Conference on Medical Image Computing and Computer-Assisted Intervention, 57–64, 2007. http://mbi.dkfz-heidelberg.de/ grand-challenge2007/. Borgefors, G., 1984. Distance transforms in arbitrary dimensions. Comput. Vis. Graph. Image Process. 27, 321–345. Bouix, S., Martin-Fernandez, M., Ungar, L., Nakamura, M., Koo, M., McCarley, R.W., Shenton, M.E., 2007. On evaluating brain tissue classifiers without a ground truth. NeuroImage 36, 1207–1224.

K.O. Babalola et al. / NeuroImage 47 (2009) 1435–1447 Cardenes, R., Bach, M., Chi, Y., Marras, I., Luis, R., Anderson, M., Cashman, P., Bultelle, M., 2007. Multimodel evaluation for medical image segmentation. In Proceedings of the 12th International Conference on Computer Analysis of Images and Patterns, volume 4673 of LNCS, 229–236. Chupin, M., Mukuna-Bantumbakulu, A.R., Hasboun, D., Bardinet, B., Baillet, S., Kinkingnhun, S., Lemieux, L., Dubois, B., Garnero, L., 2007. Anatomically constrained region deformation for the automated segmentation of the hippocampus and the amygdala: method and validation on controls and patients with Alzheimer's disease. NeuroImage 34, 996–1019. Ciofolo, C. Barillot, C., 2005. Brain segmentation with competitive level sets and fuzzy control. In Proceedings of the 19th Conference on Information Processing in Medical Imaging, volume 3565 of LNCS, 333–344. Cocosco, C.A., Kollokian, V., Kwan, R.K.S., Evans, A.C., 1997. BrainWeb: online interface to a 3D MRI simulated brain database. NeuroImage 5 (4), S425. Collins, D.L., Holmes, C.J., Peters, T.M., Evans, A.C., 1995. Automatic 3-D model-based neuroanatomical segmentation. Hum. Brain Mapp. 3 (3), 190–208. Cootes, T.F., Edwards, G.J., Taylor, C.J., 1998. Active appearance models. In Proceedings of the 5th European Conference on Computer Vision, volume 1407 of LNCS, 484–498. Cootes, T.F., Edwards, G.J., Taylor, C.J., 2001. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23 (6), 681–685. Cootes, T.F., Twining, C.J., Petrovic, V., Schestowitz, R., Taylor, C.J., 2005. Groupwise construction of appearance models using piece-wise affine deformations. Proceedings of 16th British Machine Vision Conference, pp. 879–888. Oxford. Crum, W.R., Camara, O., Hill, D.L.G., 2006. Generalised overlap measures for evaluation and validation in medical image analysis. IEEE Trans. Med. Imag. 25 (11), 1451–1461. Desikan, R.S., Segonne, F., Fischl, B., Quinn, B.T., Dickerson, B.C., Blacker, D., Buckner, R.L., Dale, A.M., Maguire, R.P., Hyman, B.T., Albert, M.S., Killiany, R.J., 2006. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage 31, 968–980. Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology 26, 297–302. Filipek, P., Richelme, C., Kennedy, D., Caviness, V., 1994. The young adult human brain: an MRI-based morphometric analysis. Cereb. Cortex 4, 344–360. Fischl, B., Salat, D.H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N., Rosen, B., Dale, A.M., January 2002. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33, 341–355. Frazier, J.A., Chiu, S., Breeze, J.L., Makris, N., Lange, L., Kennedy, D.N., Herbert, M.R., Bent, E.K., Koneru, V.K., Dieterich, M.E., Hodge, S.M., Rauch, S.L., Grant, P.E., Cohen, B.M., Seidman, L.J., Caviness, V.S., Biederman, J., 2005. Structural brain magnetic resonance imaging of limbic and thalamic volumes in pediatric bipolar disorder. Am. J. Psychiatry 162, 1256–1265. Frazier, J.A., Hodge, S.M., Breeze, J.L., Giuliano, A.J., Terry, J.E., Moore, C.M., Kennedy, D.N., Lopez-Larson, M.P., Caviness, V.S., Seidman, L.J., Zablotsky, B., Makris, N., 2008. Diagnostic and sex effects on limbic volumes in early-onset bipolar disorder and schizophrenia. Schizophr. Bull. 34 (1), 37–46. Friston, K.J., Holmes, A.P., Worsley, K.J., Poline, J.B., Frith, C.D., Frackowiak, R.S.J., 1995. Statistical parametric maps in functional imaging: a general linear approach. Hum. Brain Mapp. 2, 189–210. Gee, J.C., Reivich, M., Bajcsy, R., 1993. Elastically deforming a three dimensional atlas to match anatomical brain images. J. Comput. Assist. Tomogr. 17 (2), 225–236. Gerig, G., Jomier, M., Chakos, M., 2001. Valmet: a new validation tool for assessing and improving 3D object segmentation. In Proceedings of the 4th International Conference on Medical Image Computing and Computer-Assisted Intervention, volume 2208 of LNCS, 516–523. Goldstein, J.M., Seidman, L.J., Makris, N., Ahern, T., OBrien, L.M., Caviness, V.S., Kennedy, D.N., Faraone, S.V., Tsuang, M.T., 2007. Hypothalamic abnormalities in schizophrenia: sex effects and genetic vulnerability. Biol. Psychiatry 61, 935–945. Gouttard, S., Styner, M., Joshi, S., Smith, R.G., Cody, H., Gerig, G., 2007. Subcortical structure segmentation using probabilistic atlas priors. volume 6512 of Proceedings of the SPIE, 65122J. Han, X., Fischl, B., April 2007. Atlas renormalization for improved brain MR image segmentation across scanner platforms. IEEE Trans. Med. Imag. 26 (4). Heckemann, R.A., Hajnal, J.V., Aljabar, P., Rueckert, D., Hammers, A., 2006. Automatic anatomical brain MRI segmentation combining label propagation and decision fusion. NeuroImage 33 (1), 115–126. Heimann, T., Styner, M., Ginneken, B., 2007. 3D segmentation in the clinic — a grand challenge. In Workshop at the 10th International Conference on Medical Imaging and Computer-Assisted Intervention. http://mbi.dkfz-heidelberg.de/ grand-challenge2007/. Hill, D.L.G., Hajnal, J.V., Rueckert, D., Smith, S.M., Hartkens, T., McLeish, K., 2002. A dynamic brain atlas. In Proceedings of the 5th International Conference on Medical Image Computing and Computer-Assisted Intervention, volume 2488/9 of LNCS, 532–539. Iosifescu, D.V., Shenton, M.E., Warfield, S.K., Kikinis, R., Dengler, J., Jolesz, F.A., McCarley, R.W., 1997. An automated registration algorithm for measuring MRI subcortical brain structures. NeuroImage 6 (1), 13–25. Jaccard, P., 1912. The distribution of flora in the alpine zone. New Phytol. 11, 37–50. Kelemen, A., Szekely, G., Gerig, G., 1999. Elastic model-based segmentation of 3-D neuroradiological data sets. IEEE Trans. Med. Imag. 18 (10), 828–839.

1447

Khan, A.R., Wang, L., Beg, M.F., 2008. Freesurfer-initiated fully-automated subcortical brain segmentation in MRI using large deformation diffeomorphic metric mapping. NeuroImage 41, 735–746. Makris, N., Kaiser, J., Haselgrove, C., Seidman, L.J., Biederman, J., Boriel, D., Valera, E.M., Papadimitriou, G.M., Fischl, B., Caviness, V.S., Kennedy, D.N., 2006. Human cerebral cortex: a system for the integration of volume and surface-based representations. NeuroImage 33, 139–153. Martin-Fernandez, M., Bouix, S., Ungar, L., McCarley, R.W., Shenton, M.E., 2005. Two methods for validating brain tissue classifiers. In Proceedings of the 8th International Conference on Medical Image Computing and Computer-Assisted Intervention, volume 3749 of LNCS, 515–522. Murgasova, M., Dyet, L., Edwards, A.D., Rutherford, M., Hajnal, J., Rueckert, D., 2007. Segmentation of brain MRI in young children. Acad. Radiol. 11, 1350–1366. Nishida, M., Makris, N., Kennedy, D.N., Vangel, M., Fischl, B., Krishnamoorthy, K.S., Caviness, V.S., Grant, P.E., 2006. Detailed semiautomated MRI based morphometry of the neonatal brain: preliminary results. NeuroImage 32, 1041–1049. Patenaude, B., Smith, S., Kennedy, D., Jenkinson, M., 2007. Bayesian Shape and Appearance Models. Technical Report TR07BP1. FMRIB Centre, University of Oxford. Pitiot, A., Delingette, H., Thompson, P.M., Ayache, N., 2004. Expert knowledge-guided segmentation system for brain MRI. NeuroImage 23, S85–S96. Pohl, K.M., Bouix, S., Nakamura, M., Rohlfing, T., McCarley, R.W., Kikinis, R., Grimson, E.L., Shenton, M.E., Wells, W.M., 2007. A hierarchical algorithm for MR brain image parcellation. IEEE Trans. Med. Imag. 26 (9). Popovic, A., Fuente, M., Engelhardt, M., Radermacher, K., 2007. Statistical validation metric for accuracy assessment in medical image segmentation. Int. J. Comput. Assist. Radiol. Surg. 2 (3–4), 169–181. Powell, S., Magnotta, V.A., Johnson, H., Jammalamadaka, V.K., Pierson, R., Andreasen, N.C., 2008. Registration and machine learning-based automated segmentation of subcortical and cerebellar brain structures. NeuroImage 39 (1), 238–247. Rohlfing, T., Brandt, R., Menzel, R., Maurer, C.R., 2004. Evaluation of atlas selection strategies for atlas-based image segmentation with application to confocal microscopy images of bee brains. NeuroImage 21 (4), 1428–1442. Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J., 1999. Non-rigid registration using free-form deformations: application to breast MR images. IEEE Trans. Med. Imag. 18 (8), 712–721. Smith, S.M., Jenkinson, M., Woolrich, M.W., Beckmann, C.F., Behrens, T.E.J., JohansenBerg, H., Bannister, P.R., De Luca, M., Drobnjak, I., Flitney, D.E., Niazy, R., Saunders, J., Vickers, J., Zhang, Y., De Stefano, N., Brady, J.M., Matthews, P.M., 2004. Advances in functional and structural MR image analysis and implementation as FSL. NeuroImage 23 (S1), 208–219. Studholme, C., Hill, D.L.G., Hawkes, D.J., 1999. An overlap invariant entropy measure of 3D medical image alignment. Pattern Recognition 32, 71–86. Software available from www.colin-studholme.net. Svarer, C., Madsen, K., Hasselbalch, S.G., Pinborg, L.H., Haugbol, S., Frokjaer, V.G., Holm, S., Paulson, O.B., Knudsen, G.M., 2005. MR-based automatic delineation of volumes of interest in human brain PET images using probability maps. NeuroImage 24 (4), 969–979. Tohka, J., Wallius, E., Hirvonen, J., Hietala, J., Ruotsalainen, U., February 2006. Automatic extraction of caudate and putamen in [11C]raclopride PET using deformable surface models and normalized cuts. IEEE Trans. Nucl. Sci. 53 (1), 220–227. Tu, Z., Narr, K.L., Dollar, P., Dinov, I., Thompson, P.M., Toga, A.W., 2008. Brain anatomical structure segmentation by hybrid discriminative/generative models. IEEE Trans. Med. Imag. 27 (4), 495–508. Van Leemput, K., Maes, F., Vandermeulen, D., Suetens, P., 1999. Automated model-based tissue classification of MR images of the brain. IEEE Trans. Med. Imag.18 (10), 897–908. Walhovd, K.B., Moe, V., Slinning, K., Due-Tonnessen, P., Bjornerud, A., Dale, A.M., van der Kouwe, A., Quinn, B.T., Kosofsky, B., Greve, D., Fischl, B., 2007. Volumetric cerebral characteristics of children exposed to opiates and other substances in utero. NeuroImage 36, 1331–1344. Wang, Q., Seghers, D., D'Agostino, E., Maes, F., Vandermeulen, D., Suetens, P., Hammers, A., 2005. Construction and validation of mean shape atlas templates for atlas-based brain image segmentation. In Proceedings of the 19th International Conference on Information Processing in Medical Imaging, volume 3565 of LNCS, 689–700. Warfield, S.K., Zou, K.H., Wells, W.M., 2004. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imag. 23 (7), 903–921. Williams, G.W., 1976. Comparing the agreement of several raters with another rater. Biometrics 32, 619–627. Wu, M., Carmichael, O., Lopez-Garcia, P., Carter, C.S., Aizenstein, H.J., 2006. Quantitative comparison of AIR, SPM, and the fully deformable model for atlas-based segmentation of functional and structural MR images. Hum. Brain Mapp. 27, 747–754. Xia, Y., Bettinger, K., Shen, L., Reiss, A.L., 2007. Automatic segmentation of the caudate nucleus from human brain MR images. IEEE Trans. Med. Imag. 26 (4), 509–517. Xue, J.H., Ruan, S., Moretti, B., Revenu, M., Bloyet, D., 2001. Knowledge-based segmentation and labeling of brain structures from MRI images. Pattern Recogn. Lett. 22 (3–4), 395–405. Yushkevich, P.A., Piven, J., Hazlett, H.C., Smith, R.G., Ho, S., Gee, J.C., Gerig, G., 2006. Userguided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. NeuroImage 31 (3), 1116–1128 Software available for download from http://www.itksnap.org. Zhou, J., Rajapakse, C.J., 2005. Segmentation of subcortical brain structures using fuzzy templates. NeuroImage 28 (4), 915–924.