Integrating geometric configuration and appearance information into a unified framework for anatomical landmark localization

Integrating geometric configuration and appearance information into a unified framework for anatomical landmark localization

Medical Image Analysis 43 (2018) 23–36 Contents lists available at ScienceDirect Medical Image Analysis journal homepage: www.elsevier.com/locate/me...

4MB Sizes 0 Downloads 31 Views

Medical Image Analysis 43 (2018) 23–36

Contents lists available at ScienceDirect

Medical Image Analysis journal homepage: www.elsevier.com/locate/media

Integrating geometric configuration and appearance information into a unified framework for anatomical landmark localization Martin Urschler a,b,c, Thomas Ebner a, Darko Štern a,b,∗ a

Ludwig Boltzmann Institute for Clinical Forensic Imaging, Graz, Austria Institute for Computer Graphics and Vision, Graz University of Technology, Austria c BioTechMed-Graz, Austria b

a r t i c l e

i n f o

Article history: Received 29 March 2017 Revised 27 July 2017 Accepted 11 September 2017 Available online 21 September 2017 Keywords: Anatomical landmarks Localization Random regression forest Coordinate descent

a b s t r a c t In approaches for automatic localization of multiple anatomical landmarks, disambiguation of locally similar structures as obtained by locally accurate candidate generation is often performed by solely including high level knowledge about geometric landmark configuration. In our novel localization approach, we propose to combine both image appearance information and geometric landmark configuration into a unified random forest framework integrated into an optimization procedure that iteratively refines joint landmark predictions by using the coordinate descent algorithm. Depending on how strong multiple landmarks are correlated in a specific localization task, this integration has the benefit that it remains flexible in deciding whether appearance information or the geometric configuration of multiple landmarks is the stronger cue for solving a localization problem both accurately and robustly. Furthermore, no preliminary choice on how to encode a graphical model describing landmark configuration has to be made. In an extensive evaluation on five challenging datasets involving different 2D and 3D imaging modalities, we show that our proposed method is widely applicable and delivers state-of-the-art results when compared to various other related methods. © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license. (http://creativecommons.org/licenses/by-nc-nd/4.0/)

1. Introduction A large number of medical image analysis applications rely on automatic anatomical landmark localization algorithms as a preliminary step for, e.g. segmentation based on deformable and statistical shape models (Heimann and Meinzer, 2009; Zhang et al., 2012; Lay et al., 2013), registration of images using rigid (Hajnal et al., 2001) and deformable transformations (Johnson and Christensen, 2002; Urschler et al., 2006), construction of anatomical atlases for population studies (Toews et al., 2010), or to focus on anatomical structures of interest for computer-aided diagnosis (Doi, 2007) and regression tasks like skeletal age estimation (Thodberg et al., 2009; Štern et al., 2014). However, due to anatomical variation and especially in the presence of potentially ambiguous (i.e. locally similar) landmarks, the task of both accurate and robust anatomical landmark localization becomes challenging.

∗ Corresponding author at: Institute for Computer Graphics and Vision, Graz University of Technology, Austria. E-mail address: [email protected] (D. Štern).

The majority of recent machine learning approaches for multiple landmark localization either solely makes use of appearance features from the whole input image, or combines appearance features derived from the local vicinities of landmarks with a parametric or graphical model fitting step on top. Criminisi et al. (2013) have shown that global anatomical configuration can be captured with a regression-based prediction strategy, when the scale of exclusively used appearance features is allowed to vary up to the image size. By learning from the appearance of all anatomical structures present in the training dataset, their regression-based random forest (RF) for organ bounding box localization demonstrates robustness in the presence of ambiguities, however, their approach lacks in accuracy. On the other hand, restricting appearance features to a local neighborhood during training enables accurate prediction of landmark positions, however, on its own it inevitably leads to false positives due to locally similar anatomical structures not being distinguishable. Therefore, to select a likely global landmark configuration, sophisticated geometric configuration models like graphical (Donner et al., 2013) or constrained local models (Lindner et al., 2015) are required on top for disambiguation.

https://doi.org/10.1016/j.media.2017.09.003 1361-8415/© 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license. (http://creativecommons.org/licenses/by-nc-nd/4.0/)

24

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

Fig. 1. Our proposed method for multiple landmark localization integrates appearance information and geometric configuration by first accurately generating landmark candidates (top), followed by iteratively refining landmark configuration based on coordinate descent optimization to achieve robustness in the presence of locally similar structures (bottom).

In this work we propose to integrate image appearance information and geometric landmark configuration into a unified framework. Thus, we aim for structured output prediction, where the landmark configuration is implicitly modeled by simultaneously taking into account both appearance features derived from the whole image and geometric constraints that landmarks impose onto each other. We pose our novel approach as an optimization problem that maximizes the joint probability for locating all landmarks simultaneously given an input image. This problem is solved by the coordinate descent algorithm (Wright, 2015), which decouples the NP hard problem of simultaneously searching for all landmarks by successively optimizing for individual ones. Thus, as illustrated in Fig. 1, by passing information to other landmarks and receiving information required to update its own position, each landmark location iteratively achieves a self-improvement according to both image appearance features and the current best beliefs of where the other landmarks are located.

1.1. Related work While localization of multiple anatomical landmarks from medical images is in principle possible by solely relying on heuristic intensity-based image processing techniques (Wörz and Rohr, 2006; Donner et al., 2007), nowadays machine learning approaches capturing prior knowledge on appearance and shape from training data are predominantly used due to better generalizability in the presence of anatomical variation. These learning methods allow landmark localization either by using predictions for each landmark based on appearance information extracted from the whole image, or by combining potentially ambiguous predictions based on locally restricted appearance information with an explicit regularization model encoding geometric landmark configuration.

Using a regression-based approach implemented in an RF framework, Pauly et al. (2011) and Criminisi et al. (2013) proposed to learn the relative positions between the organs of interest and all the anatomy available in the training data solely with arbitrary scale Haar-like appearance features, i.e. the size of the appearance features and their distance from the training voxels are arbitrary. In a test image, the position of organs of interest is then obtained from the recognized anatomy, implemented by accumulating the relative positions of each voxel of the image to the organ of interest. Mainly aiming for image retrieval applications, the focus of Criminisi et al. (2013) lay on fast and robust but approximate and inaccurate bounding box localization of larger organ structures. The stratified decision forest method (Oktay et al., 2017) extended the work of Criminisi et al. (2013) for cases when there is significant variation of pose and size in a dataset. Gauriau et al. (2015) combined cascaded regression with a simple statistical shape prior derived from segmentation masks. For a multi-object segmentation task, Glocker et al. (2012b) developed an RF based method combining a classification and regression objective function. Ebner et al. (2014) adapted the multi-output regression approach of Criminisi et al. (2013) to the landmark localization task. They improved localization accuracy by putting more trust into the surrounding anatomy of the predicted landmark, since the relative position of closer anatomy shows less variation regarding the landmark location. This was achieved by introducing a distance weight at testing time that reduces influence of regions farther away from the anatomical landmark. Additionally, they used a two-stage regression RF cascade, with the scale range of Haar-like appearance features in the second stage restricted on capturing only the appearance of the structures locally surrounding a landmark. Applied to a segmentation task, Peter et al. (2015) developed this idea further by automatically selecting the most informative scale range of the appearance features defining surrounding

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

anatomy. All these regression-based approaches showed that by learning just from appearance information extracted from training data, they are robust in the presence of locally similar structures. However, due to variations in relative positions of the anatomy used for robust prediction of landmarks, a tradeoff between accuracy and robustness is involved. Recently, a new direction in regression based landmark localization started to appear, where CNNs, i.e. convolutional neural networks (LeCun et al., 1998), are used to model geometric configuration solely from appearance features (Payer et al., 2016). Despite their promising initial results, accuracy and robustness of this method depends on the availability of large training datasets and is limited by the need to restrict resolution of training images due to computational complexity and memory requirements. To achieve high accuracy in landmark localization, most stateof-the-art methods derive predictions solely from local appearance features describing close anatomy, with feature extraction performed in the RF framework (Breiman, 2001). As a consequence, an additional regularization step involving an explicit model of geometric configuration is required for jointly treating all landmarks to improve robustness in the presence of locally similar structures. Here, a widely used regularization strategy fits a learned point distribution model to local appearance based continuous predictions containing potentially ambiguous responses. This strategy goes back to seminal work from Cootes et al. (1995), where Active Shape Model fitting was implemented as iteratively updating a point configuration by optimizing landmark locations locally using appearance-based predictions and regularizing the landmark configuration with the point distribution model. Variations of this shape prior constraint (Heimann and Meinzer, 2009) are still widely used in medical image analysis applications, however, these methods have shown limited flexibility in the presence of anatomical abnormalities or pathologies (Beichel et al., 2005) due to their strong reliance on the constructed parametric model and the dependency on robust initialization to avoid local minima during fitting. Despite achieving state-of-the-art localization results with their accurate local RF regression voting step, the constrained local model of Lindner et al. (2015) still suffers from some of these drawbacks due to their complex, parameter-intense multiscale landmark fitting. An alternative strategy for regularizing potentially ambiguous responses discretizes the prediction derived from local appearance features by extracting a set of local maxima as candidates for each landmark. This enables the use of learned spatial priors by tree-structured (Felzenszwalb and Huttenlocher, 2005) or Markov Random Field (MRF)-based (Besbes et al., 2009) graphical models, which were successfully used for lung segmentation (Ibragimov et al., 2012), robust detection of the spinal column in human (Glocker et al., 2012a; 2013; Kelm et al., 2013) or zebrafish (Richmond et al., 2015), landmark localization in brain images (Toews and Arbel, 2007), and for labeling landmarks in hand (Donner et al., 2013) or whole-body CT images (Potesil et al., 2015). Representative for these algorithms, Donner et al. (2013) performs landmark localization using local appearance based RFs in a Hough voting scheme (Gall et al., 2011), which leads to landmark candidates that are regularized by solving an MRF encoding the spatial constraints for the most likely landmark configuration. A common property of all methods using an explicit model of geometric configuration is that after generating local predictions, appearance information is never further used in the regularization stage. Štern et al. (2016) have recently shown that the regressionbased approach using the same arbitrary scale appearance features as in (Criminisi et al., 2013) can also be used for regularization without any explicitly modeled shape constraint. However, since each landmark is trained independently of each other, a simple

25

MRF model is required on top for joint prediction to achieve stateof-the-art results on a challenging hand X-ray dataset. 1.2. Contributions By integrating image appearance information and geometric landmark configuration into a single regression-based landmark prediction framework, we overcome the limitations of related localization approaches. Depending on how strong multiple landmarks are correlated in a specific localization task, this integration has the benefit that it remains flexible in deciding whether appearance information or the geometric configuration of multiple landmarks is the stronger cue for solving a localization problem. Furthermore, no preliminary choice on how to encode a graphical model describing landmark configuration has to be made, instead geometric constraints are learned directly from training data and encoded implicitly in the regression framework. For this integration, RFs are perfectly suited to implement the decision between geometric and appearance features randomly at the node level. Finally, using a single framework for both accurate and robust landmark localization removes the need for heuristics to initialize landmark fitting and simplifies parameter tuning, which are limitations of point distribution and constrained local models. Crucial for the performance of our novel method is the iterative refinement of joint landmark predictions, implemented as coordinate descent optimization. Inspiration for this idea comes from the auto context work of Tu and Bai (2010) used for image segmentation. In auto context, initial predictions of a labeling are refined in a cascade of RFs, where prediction results of a previous cascade stage are used as additional features in the next stage. Since each cascade stage has to be trained on a separate dataset, this method is dependable on a large amount of available training data. Differently to auto context, in our approach refinement is not performed with a cascade, but is solely based on a single RF trained on the ground-truth landmark annotations, which is iteratively applied during testing as explained in Section 2. Overall, our novel integrated landmark localization approach is applicable in a very generic way and can be used for localization of multiple anatomical landmarks in different 2D and 3D imaging modalities (X-ray, CT, MRI) depicting distinct anatomical structures. We consider our experimental evaluation (see Section 3) as another important contribution, since, in addition to the diversity of the investigated datasets, we also extensively compare our proposed method to a variety of approaches, both regression-based and approaches that incorporate geometric constraints using explicitly defined models. As shown in Section 4, with these experiments we quantitatively demonstrate state-of-the-art performance on all evaluated benchmark datasets, as well as superior performance when no strong geometric correlation between landmarks is available. 2. Method To jointly localize multiple anatomical landmarks from a given input image, we propose a coordinate descent optimization strategy (Wright, 2015), which is an iterative scheme decoupling the simultaneous search for all landmark locations by successively optimizing for individual landmarks. During each iteration, this scheme maximizes the conditional probability of each landmark position given appearance information from the input image and current best beliefs of where all other landmarks are located (see Fig. 1, bottom). To estimate this conditional probability, we use a classifier trained on ground-truth landmark locations that encodes geometric constraints. However, during the iterative optimization anatomically infeasible landmark configurations may occur. We reduce the dependency of the estimated conditional probability on

26

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

Fig. 2. The iterative refinement step of our coordinate descent optimization, showing the ensemble of lower complexity models trained on randomly chosen subsets Re (i) of involve combination of local appearance based predictions with the ensemble output. K landmarks other than i. Refined predictions xt+1 i

feasible landmark configurations, by training the classifier as an ensemble with a bagging strategy (Breiman, 1996), where each classifier instance predicts using solely a subset of landmarks (see Fig. 2). As shown in Fig. 1 (top), our coordinate descent strategy is initialized with landmark candidates predicted from independently trained landmark regressors based on local appearance features. Since this initialization represents highly accurate localizations of multiple landmarks, but a potentially erroneous global landmark configuration due to ambiguities, our proposed iterative coordinate descent strategy effectively leads to both accurate and robust landmark predictions. In Section 2.1 we describe our optimization strategy in more detail, followed by a description of the RF framework that we use for implementing this strategy in Section 2.2. 2.1. Coordinate descent optimization Selecting the most likely landmark configuration x ˆ = (xˆ1 , .., xˆL )T from a set of candidate positions for each landmark can be defined by the following optimization problem: x ˆ = arg max p(x | , I ),

(1)

x

where  represents a model derived from annotated training data that jointly estimates landmark positions and I is an input image at testing time. This general discrete optimization problem is NPhard, a widely used approach to solve it approximately is by using an MRF of discretized landmark candidates as proposed e.g. in Donner et al. (2013). The discrete MRF solver requires to specify a graph structure since connecting all landmarks with each other quickly makes solving the MRF intractable when the number of landmarks L exceeds small numbers. Instead of simultaneously optimizing over a large set of variables, we consider a simpler problem that iteratively optimizes each landmark location independently, while keeping the other locations fixed. With t ∈ {1, .., T} resembling multiple iterations, this strategy is given by

xt+1 = arg max i x∈Pi

p(xt+1 , .., xt+1 , x, xti+1 , .., xtL | i , I ). 1 i−1

x∈Pi

xt+1 = arg max p(x | xt\i , i , I ) · p(xt\i , i , I ). i x∈Pi

(2)

(3)

where Pi represents the set of candidate positions for landmark i ∈ {1, .., L} and i is a model independently estimating the position of landmark i. By alternating between each of the L landmarks, the strategy in (2) is known as the coordinate descent optimization algorithm (Wright, 2015), which can be shown to converge to the global optimum in case of convex and smooth cost functions. In our case of a non-convex cost function, a local optimum is achieved which depends on the initialization, however,

(4)

By omitting the prior probabilities of landmarks other than i as being constant for the optimization, the proposed iterative optimization can be written as maximizing the following conditional probabilities for all i ∈ {1, .., L}:

xt+1 = arg max i x∈Pi

p(x | xt\i , i , I ).

(5)

We distinguish two kinds of models i , a model li that is based on local appearance features providing accurate but ambigug ous predictions, and our novel global context model i , which combines geometrical configuration of landmarks with arbitrarily scaled appearance features to improve robustness of predictions in the presence of locally similar structures. Thus, (5) can be rewritten as:

xt+1 = arg max p(x | xt\i , li , gi , I ). i

(6)

x∈Pi

By using Bayesian rule we can reformulate the conditional probag bility in (6) under the assumption that the two models li , i are conditionally independent regarding x and image I:

xt+1 = arg max i

p(x | li , I ) · p(x | xt\i , gi , I )

x∈Pi

p( x | I )

,

(7)

where the probabilities being constant for the optimization are omitted. Without giving any prior preferences to landmark candidates, the probability p(x|I) is constant for the maximization. Thus, as illustrated in Figs. 1 and 2, we propose our coordinate descent optimization scheme as follows:

xt+1 = arg max p(x | li , I ) · p(x | xt\i , gi , I ). i x∈Pi

Defining xt\i as the set {xt+1 , .., xt+1 , xti+1 , .., xtL }, the notation is sim1 i−1 plified:

xt+1 = arg max p(x, xt\i | i , I ), i

most importantly, the solution takes into account geometrical constraints that landmarks impose onto each other. When optimizing for a single landmark, the joint probability distribution in (3) can be reformulated using the product rule of probability:

(8)

Since p(x | li , I ) does not take into account the iterative improvement of the geometrical landmark configuration, the prediction based on local appearance features has a fixed contribution to each landmark candidate during the iteration. While the multiplication in (8) can be performed in a fully probabilistic manner, in our implementation the NM strongest local maxima from p(x | li , I ) are used as the set of candidate positions Pi for each landmark i during iterations (see Table 3 in Appendix A for more details). Additionally, for each landmark i the global maximum of the local appearance based prediction

x0i = arg max p(x | li , I ) x∈Pi

(9)

provides the initial landmark configuration x0 = {x01 , .., x0L } for optimization (see Fig. 1, top).

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

27

Fig. 3. (a) Evaluation of the number of outliers evolving over the iterations calculated on the 2D Hand dataset. Iteration zero is the initialization, coming from localRRF with 396 outliers. For two depicted landmarks, (b) shows on the left an accurate but ambiguous prediction of the local accumulators p(x|θil , I ) and in the middle an image series for different iterations t, showing how probabilities p(x|xt\i , θig , I ) are improving during optimization. Each color corresponds to a landmark with its intensity indicating the

probability, calculated by evaluating GCRF on every pixel of the image. The image on the right shows the final probabilities p(x|xt\i , θig , I ) · p(x|θil , I ), which is based on both local appearance and global context.

Overall, in our method the robustness of the accurate, but ambiguous prediction from local appearance features is iteratively img proved with a prediction based on a model i combining arbitrarily scaled appearance features and the geometrical configuration of the current best beliefs of where other landmarks are located (xt\i ), as shown in an example in Fig. 3(b).

2.1.1. Estimation of conditional probabilities To obtain the most probable structured output prediction x ˆ using (8), we require an estimation of conditional probabilities modg eling local appearance li and global context i , respectively. The

estimation of p(x | li , I ) is performed using an RF approach, modˆ l based on local appearance features extracted from a set of eling 

tradeoff involving different levels of model complexity. When usˆ g during training, ing all L − 1 available landmarks for modeling  i we achieve a model with low bias but high variance, which will overfit, i.e. it will not generalize well to testing data. On the other hand, a model with very low complexity (K << L − 1) will have better generalization capabilities but it shows low overall localization performance. To avoid overfitting, we therefore use an ensemble of lower complexity models trained on randomly chosen subsets of K landmarks, where the TE predictions of the ensemble are averaged: TE 1  p(e | x∗Re (i ) , D ). ={1 ,..,TE } TE e=1

ˆ g = arg max  i

i

training images D for each landmark i independently, which will be explained in Section 2.2.1. Estimating for each landmark i the conditional probability ˆ g from arbitrarily describing global context requires modeling  i scaled appearance features extracted from a set of training images D and from the geometric configuration of the landmarks x\i other than landmark i. While learning a model for all combinations of geometric configurations of landmarks x\i that may be seen during testing is computationally intractable, learning from the configuration that is expected to be seen during testing as in auto context (Tu and Bai, 2010) has its own drawbacks. Adaptation of the auto context strategy for localization requires larger amounts of training data and may still lead to infeasible landmark configurations during testing due to overfitting to predictions from the training dataset. ˆ g of We suggest a different strategy to model parameters  i the conditional probability by describing global context solely from ground truth annotations of the landmark locations x∗\i together with arbitrarily scaled appearance features extracted from training images D as follows:

ˆ g = arg max p( | x∗ , D ).  \i i

(10)



g

Since a single model per landmark i is used for all iterations of our coordinate descent optimization, this strategy does not require additional validation sets for training as in auto context. However, in our approach there is a discrepancy between training on ground truth landmark annotations x∗\i and testing on the best beliefs of where other landmarks are located resulting from the coordinate descent iteration xt\i . To overcome inevitable overfitting to geometrical configurations x∗\i , we treat possible anatomically infeasible landmark configurations xt\i as noisy observations. Thus, the

ˆ g on ground truth landdiscrepancy between training a model  i mark annotations x∗\i and testing on configurations unseen during

training xt\i can be interpreted as finding a suitable bias-variance

(11)

Here, for predictor e ∈ {1, .., TE }, Re (i) is a random subset of K landmarks other than the landmark of interest i. Depending on the number of landmarks located on anatomically infeasible positions during initialization, an optimal tradeoff between bias and variance can be found by varying model complexity parameter K. For estimating conditional probability (11) combining global context from arbitrarily scaled appearance features and geometric configuration of landmarks, we use the RF framework due to its simplicity in integrating different feature types as explained in the next section. 2.2. Random forest framework We use RFs (Breiman, 2001) in (8) to model our local appearˆ l generating locally accurate landmark preance based regressor  i ˆ g that is used to redictions as well as our ensemble of classifiers  i

move ambiguities due to locally similar structures by taking global landmark configuration into account. During RF training, for each landmark a separate forest consisting of TE trees is constructed by using selected voxels vj from training images D. Starting at the root node with the set S0 , the voxels of set Sn , reaching node n, are recursively split into two subsets l f t rgt Sn , Sn according to a splitting decision associated with the node. The splitting decision is made by thresholding a feature response evaluated on the voxels in Sn . We use two different types of features: •



Haar-like appearance feature responses are calculated as differences between mean image intensity of two rectangles with maximal size s and maximal offset o relative to a voxel position vj ∈ Sn . Geometric feature (GF) responses are calculated as the difference in a random dimension x, y or z between the voxel vj ∈ Sn and a randomly selected reference landmark from the set of landmarks Re (i) as defined in (11).

28

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

Until maximum depth ND of the tree is reached, each node stores a feature and threshold selected from a pool of NF randomly generated Haar-like appearance or GFs and NT thresholds, maximizing a regression or classification objective function. •

Regression objective function JR is given with:

JR =



 d j − d (Sn )2 − 2

j∈Sn





 d j − d (Snc )2 ,

c∈{l f t ,rgt } j∈Snc

2

(12)

where dj is the jth distance vector between the voxel vj and the



landmark i for which the forest is trained, while d (Sn ) is the mean voting vector of voxels in Sn . For later testing, we store at each leaf node ζ the mean value of relative voting vectors dζ of all voxels reaching ζ . Classification objective function JC is defined using information gain:

JC = H (Sn ) − H ( Sn ) = −





|Snc | H (Snc ), |Sn | c∈{l f t ,rgt } pH (b) log( pH (b)),

(13)

b∈{ pos,neg}

where |S| denotes the cardinality of the respective set S and pH (b) is calculated as the normalized empirical histogram of positive pos or negative neg class corresponding to the training voxels in Sn . For later testing, the probability pH (b) of being the landmark is stored at each leaf node ζ . 2.2.1. LocalRRF - accurate candidate generation As the first stage of our method, different approaches for feature extraction could be used, delivering accurate but locally ambiguous predictions, e.g. using convolutional neural networks as in Aubert et al. (2016). However, we implemented the localRRF stage using Hough regression RFs (Gall et al., 2011) to benefit from having a unified framework for all stages of our method. ˆ l for each landmark i To capture local appearance in a model  i as required in (8), we select training voxels S0 locally around the landmark with radius rrrf and use only Haar-like appearance features with small maximum offset orrf and size srrf (see Fig. 1, top). LocalRRF is trained as a regression RF using objective function (12). During testing, all voxels of a previously unseen image I are pushed through all the TE trees of localRRF until they reach the leaf nodes. Based on the voxel’s position and the relative voting vector dζ stored in the leaf node, each tree contributes to an accumulator image ai defining the most likely location of landmark i ∈ {1, .., L} in the coordinate system of the image I. We extract the NM strongest maxima Pi from the accumulator ai of each landmark i, defining possible candidate locations, which are accurately located on locally similar structures, see Fig. 1 (top). Finally, the strongest candidate x0i for each landmark i defines the landmark configuration x0 = {x01 , .., x0L } that is used to initialize coordinate descent optimization. 2.2.2. Global context random forest (GCRF) A crucial part of our proposed method is the integration of appearance information and geometrical configuration of landmarks ˆ g as required in (8). Global apin our global context RF model  i pearance information is captured using Haar-like features with a large maximum offset ogcrf and a large maximum size sgcrf . The geometrical configuration is captured by using GFs, where during training for each landmark i, information about the ground-truth location of all other landmarks x∗\i is used. Since during testing the

ground truth position of other landmarks is not available, they are replaced by the best beliefs of where other landmarks are located (xt\i ), as explained in Section 2.1.1. To overcome the overfitting to

ground-truth landmark positions x∗\i , we train the GCRF as an ensemble with a bagging strategy by generating for each tree a random pool of K reference landmarks Re (i), which are used as input for computing GFs (see Fig. 2). Optimizing classification objective (13), GCRF training starts by selecting Npos positive and Nneg negative voxels from each training image. Inspired by the voxel selection in Štern et al. (2016), training voxels for landmark i are randomly selected within a maximal radius rgcrf around candidate locations Pi as obtained for the training images with localRRF. The voxels around ground-truth landmark positions are defined as positive, remaining training voxels are used as negatives. Thus, negative voxels are concentrated around locally similar anatomical structures, forcing GCRF to learn how to distinguish potentially ambiguous landmark candidates. During testing, to calculate the conditional probability p(x | g xt\i , i , I ) for landmark i in (8), we push candidate locations Pi through all trees and average the probabilities pH (b) of being landmark i as stored in the reached leaf nodes for all trees. Embedded in the coordinate descent scheme, this leads to an iterative improvement of all landmark locations xt according to information on both appearance and geometrical configuration. Thus, an implicit connection between the independent landmarks modeled in separate RFs is established. 3. Experimental setup To show that our proposed landmark localization algorithm is applicable to a large variety of medical data from different 2D and 3D imaging modalities, we evaluate performance on five challenging datasets (three publicly available, two in-house) and compare our results extensively to state-of-the-art landmark localization methods. As evaluation measure, we use Euclidean distance between estimated and ground truth landmark positions. We perform cross-validation experiments and report accuracy by median deviations on testing images, while robustness is assessed using the number of outliers defined as localization errors larger than a dataset specific threshold. Additionally, mean and standard deviation of localization errors are computed. 3.1. Data sets 2D hand dataset (2DHand) consists of 895 2D X-ray hand images of subjects older than 10 years from the publicly available Digital Hand Atlas Database 1 . Due to their lacking physical pixel resolution, we normalized image resolution according to wrist widths as suggested in (Lindner et al., 2015; Štern et al., 2016). We used a histogram matching step to normalize intensities regarding contrast and brightness. This normalization was solely performed inside an Otsu threshold-based segmentation of the outline of the hand by using histogram matching to find a nonlinear intensity mapping transforming each image according to the histogram of a reference image (Gonzalez and Woods, 1992). For evaluation, L = 37 landmarks were manually annotated by three experts (see Fig. 4). These annotations are available at our website2 . The dataset was evaluated using a three-fold cross-validation, splitting images into 67% training and 33% testing data for each fold. Cephalogram dataset (2DCep) contains 2D X-ray images of the skull from 400 subjects with an image size of 1935 × 2400 and a pixel spacing of 0.1 × 0.1 mm2 from the Automatic Cephalometric X-Ray Landmark Detection Challenge (Wang et al., 2016). Annotation consists of L = 19 anatomical landmarks (see Fig. 4). The dataset was evaluated by training on 150 images and testing on 1 2

http://www.ipilab.org/BAAweb/, as of Feb. 2017. https://tinyurl.com/y7787nsv .

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

29

Fig. 4. Hand-crafted MRF graph structures for 3DBody, 2DHand and 2DCep. For 3DHand, a graph structure similar to 2DHand was used. Graphs are shown using ground truth landmark annotations for the respective image.

two test-sets, Test1 and Test2 (total of 250 images) as proposed in Wang et al. (2016). We compare to the challenge results in terms of different success detection thresholds between 2 and 4 mm. Whole body dataset (3DBody) consists of 20 whole body CTs from the Whole Body Morphometry Project with an average size of 512 × 512 × 1900 voxels and a voxel size of 1.3 × 1.3 × 1 mm3 . Data and annotation, consisting of L = 57 anatomical landmarks, were provided by Donner et al. (2013) (compare Fig. 4 for annotation). The dataset was evaluated using a two-fold cross-validation with 10 training and 10 testing images. 3D hand dataset (3DHand) consists of left hand T1-weighted 3D gradient echo MR images of scans from 60 male Caucasian subjects between 13 and 23 years from an in-house forensic age estimation study (Urschler et al., 2015). The average dimension of the volumes is 294 × 512 × 72 with a voxel size of 0.45 × 0.45 × 0.9 mm3 . In each volume, L = 28 landmarks were manually annotated by a scientist similar to the annotation of 2DHand. The dataset was evaluated using the same setup as in Ebner et al. (2014), a crossvalidation with five rounds by splitting randomly into 43 training and 17 testing images. 3D teeth dataset (3DTeeth) consists of 280 3D proton density weighted MR images of left or right side of the head from an inhouse forensic age estimation study (Štern et al., 2017). Left side images were mirrored to create a consistent dataset of images with 208 × 256 × 30 voxels and a physical resolution of 0.59 × 0.59 × 1 mm3 per voxel. Specifying their center locations, two wisdom

teeth per dataset were annotated by a dentist. The dataset was evaluated using a three-fold cross-validation, splitting the data into 67% training and 33% testing data for each fold. 3.2. Implementation details All trees of the RFs in our proposed method (localRRF and GCRF) were trained independently for each landmark up to a maximum depth of ND = 25. When experimenting with different numbers of randomly generated features NF and thresholds NT , we found that NF = 20 and NT = 10 were sufficient to keep localization performance while minimizing training time. Haar-like appearance features were calculated using integral image data structures (Viola and Jones, 2004), with a memory efficient integral volume implementation (Urschler et al., 2013) being used for 3D datasets. For localRRF the forests consisted of TE = 8 trees, while for GCRF we used TE = 128 trees. LocalRRF was trained by selecting training voxels inside a radius of rrr f = 10 mm around each landmark. Haar-like features were restricted to a maximum offset orr f = 3 mm and a maximum size srr f = 1 mm for 2DHand and 3DHand datasets, for 2DCep we used orr f = 5 mm; srr f = 3 mm and for 3DBody orr f = 10 mm; srr f = 4 mm. To generate candidate locations Pi , the NM = 100 strongest maxima with a minimum distance of 10 mm to each other were extracted using non-maximum suppression. GCRF was trained by selecting for all datasets N pos = Nneg = 10, 0 0 0 voxels, except for 3DBody where 10 0,0 0 0 voxels per

30

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

class were used. The radius rgcrf for voxel selection in GCRF was set to 10 mm, while Haar-like appearance feature offset ogcrf and size sgcrf were restricted to 50 mm, respectively. We implemented the coordinate descent optimization by simultaneously updating all landmarks during each iteration to benefit from parallelized execution. As it will be explained in Section 4.4, we found that K = 4 is a parameter value that can be applied generically to all investigated datasets, except for 3DTeeth, where only two landmarks were available. Therefore, in accordance with the resulting biasvariance tradeoff behavior as shown in Fig. 8, we chose to train 10% of GCRF trees with both geometric and appearance features on this dataset, while the other trees solely used appearance information. For localRRF+MRF, solutions to the MRF models were computed using loopy belief propagation (Wainwright and Jordan, 2008). All experiments were performed on an Intel CoreDuo i7 920 workstation with 24 GB RAM. 4. Experimental results and discussion In the following sections we provide detailed discussion of the quantitative and qualitative results from the experimental evaluation, with Section 4.1 describing our MainExperiment and Section 4.2 describing the experiment when no strong geometrical model is available (ReducedExperiment). Further, we investigate in more detail the benefit of combining geometrical and appearance information in Section 4.3 (IntegrationExperiment), evaluate model complexity in Section 4.4 (ModelComplexityExperiment), and comment on the runtime of our proposed method in Section 4.5. We finish the discussion with an outlook in Section 4.6. 4.1. MainExperiment The main evaluation of our novel approach was performed on a variety of challenging datasets, including three publicly available and one in-house dataset, that show different anatomical structures and come from three distinct imaging modalities (2D X-ray, 3D CT and 3D MRI). Furthermore, we extensively compared our approach to a large number of state-of-the-art methods resembling different strategies (direct regression, model based regularization and forest based regularization) for localization of multiple anatomical landmarks. As a base-line method resembling a direct regression strategy, we used the accurate but ambiguous localRRF, where for each landmark independent RFs were trained. The implementations of methods Criminisi,2013 and Ebner,2014 were taken from the work of Ebner et al. (2014) and applied to 2DHand, while results for 3DHand were taken from their paper. Both methods were trained as multi-output regression RFs, where each tree votes for all landmark locations simultaneously. Payer et al. (2016) provided implementations of two convolutional neural network based landmark localization algorithms, involving their SpatialConfigurationNet (Payer,2016) and a U-Net (Ronneberger et al., 2015) that was adapted for the localization task. Both algorithms were applied to all four datasets. Lindner et al. (2015) applied their model based regularization method to our 2DHand dataset using our preprocessed images and for 2DCep we looked up their results from Wang et al. (2016). Similar to Štern et al. (2016), we implemented an MRF based approach that adds robustness to the localRRF using an explicit graphical model (localRRF+MRF), and applied it to all available datasets. We experimented with different hand-crafted loosely connected MRF configurations and chose the graphical models depicted in Fig. 4. The results of Donner et al. (2013) on 3DBody and Ibragimov et al. (2012) on 2DCep were taken from the respective publications. Finally, we compared our method with forest based regularization approaches. For localRRF+Criminisi,2013 we used the multi-output forest implementation of Criminisi,2013

from Ebner et al. (2014) and multiplied it with the localRRF response to achieve a probabilistic prediction that takes into account the geometrical configuration among landmarks. For localRRF+Štern,2016 we used the classification based backprojection RF trained independently for each landmark as proposed in Štern et al. (2016), which regularizes locally accurate but ambiguous predictions of localRRF solely based on appearance information. Regarding accuracy and robustness, the overall quantitative results of our MainExperiment are shown in Table 1. Fig. 5 compares the better performing methods graphically by cumulative localization error distributions. For our proposed method, Fig. 6(a) illustrates qualitatively the landmark localization results of all five datasets of MainExperiment, drawn relative to ground truth positions of the depicted image based on the error vectors from crossvalidation. To enable this visualization, for 3D datasets we project volumetric data to sagittal and coronal views, respectively. Additionally, Fig. 6(b) shows a few characteristic cases where outliers occur. In the following subsections, we compared the performance of the methods resembling direct regression, model based regularization and forest based regularization strategy in more details. 4.1.1. Direct regression methods Local appearance based regression methods (localRRF) are capable of accurately localizing landmarks as can be seen from Table 1 and Fig. 5a–d, however, in the presence of locally similar structures their robustness is low leading to approximately 10% outliers for all investigated datasets. Extending the scale of appearance features to the whole image as proposed in Criminisi et al. (2013) may improve robustness by giving localization results near to the mean landmark locations that were seen in the training data, but at the same time reduces accuracy of predictions (see results of 2DHand experiment in Table 1). Using the same large scale appearance features, the work of Ebner et al. (2014) showed how to improve on accuracy by introducing a distance weighting and a refinement stage, but still the amount of outliers is large, especially for the 2DHand dataset (see Table 1 and Fig. 5b). A novel direction in machine learning based medical image analysis is to use deep neural networks to automatically learn the scale and type of appearance features instead of pre-defined Haar-like features. If a large enough training dataset is available and image resolution is small enough to be stored in memory, the regression method of Payer et al. (2016) and the adaptation of the U-Net (Ronneberger et al., 2015) architecture for landmark localization show state-of-the-art performance in robustness, while due to working with downsampled images accuracy is worse than the state-of-the-art methods on 2DHand and 2DCep (see Fig. 5a,c). Still for the 3DBody dataset, which has only 10 available training images that additionally are too large for downsampling without merging individual landmarks, both CNN methods failed to converge during training. 4.1.2. Model based regularization methods Starting from accurate local appearance based regression methods like localRRF, robustness can be improved by introducing an explicit model of geometric configuration for regularization. Using a constrained local model for regularization as in Lindner et al. (2015) shows best results on the 2DCep dataset as well as close to the best results for 2DHand, when using their implementation with parameters tuned for the respective datasets (see Table 1 and Fig. 5a and c). Drawbacks of their method are its dependency on a good initalization of their multi-scale landmark fitting algorithm and that extension to 3D data is not trivial. After discretizing predictions from a localRRF, finding the best geometric configuration from a set of landmark candidate locations can efficiently be performed by pre-defining a graphical model and

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

31

Fig. 5. Cumulative localization error distributions for different datasets.

solving for the optimal configuration. As illustrated in Fig. 5a–d, in our experiments such an approach (localRRF+MRF) shows one of the best performances on all evaluated datasets, being robust in terms of outliers while keeping high accuracy. To obtain these results we had to test different hand-crafted graphical models for each dataset. For 2DHand, 2DCep and 3DHand we found loosely connected, cyclic graphs, for which belief propagation could be used to efficiently compute solutions. However, for 3DBody we had to further reduce the number of connections for belief propagation, which was achieved by modeling the connections in a tree-structure reflecting anatomy at different scales (see Fig. 4). Donner et al. (2013) proposed to find an optimal graph structure for the MRF automatically, which could be solved using standard belief propagation as well. This approach is closely related to the localRRF+MRF model, but our hand-crafted MRF model performs better than the automatically generated one on the 3DBody dataset (see Table 1). Using a different strategy, Ibragimov et al. (2012) investigated a fully connected graph for representing landmark connections. However, due to computational intractability when using belief propagation in this case, to compute a solution they used a game-theoretic optimization approach to find a Nash equilibrium. Comparison on the 2DCep dataset indicates slightly worse perfor-

mance of this approach in both accuracy and robustness compared to the state-of-the-art methods (see Fig. 5c). 4.1.3. Forest based regularization methods Random forest based approaches can also be used to disambiguate locally accurate but ambiguous predictions of local appearance based regression methods. In our experiments we have investigated two such models for regularization that are solely trained on Haar-like appearance features. Using localRRF+Criminisi,2013 the arrangement between landmarks encoded in the leaf nodes of the multi-output regression random forest (Criminisi et al., 2013) is taken into account for regularization. As shown in Table 1, an improvement in accuracy and robustness over Criminisi,2013 can be observed for 2DHand and 3DHand, however, all evaluated methods with a strong geometrical model outperfom it. Although trained independently for each landmark as a classification RF, localRRF+Štern,2016 does not have any embedded geometrical model, but due to its backprojection step that focuses on hard to distinguish locally similar structures it shows state-of-the-art results on 2DCep (see Fig. 5c), however on 2DHand and 3DBody it lacks in robustness (see Table 1). These comparisons show that training an RF to discriminate locally similar structures based on

32

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

Fig. 6. Qualitative results on all five datasets, presented on 2D projections for 3D datasets. Images in (a) show the results of all cross-validation folds with error vectors drawn relative to the ground truth landmark positions of the respective image, while the images (b) depict some specific outlier cases of predicted landmarks compared to their ground-truth locations.

appearance features alone is not sufficient and a stronger model of the anatomical configuration of landmarks is needed. In our proposed approach, we used our novel geometric features to integrate a strong regularization model into the RF framework. This allows geometrical information of relative distances between landmarks to interact with appearance information during coordinate descent optimization, thus iteratively improving initial

landmark locations from an accurate but ambiguous local prediction stage. On the 2DHand dataset, performance of our integrated approach is in-line with the accuracy of the best performing methods using explicit geometric models (localRRF+MRF, Lindner et al. (2015)), and the robustness of upcoming approaches based on CNNs (see Table 1 and Fig. 5a). Regarding results of the 2DCep dataset, the approaches Lindner et al. (2015), localRRF+MRF,

1.87 1.12 0.72 1.58 1.23 1.31 – 155.01 3.41 – (12.02%) (3.33%) – 3.47 3.44 – 10.69 10.99 159 57 18 0.83 0.51 0.51

– –

(0.48%) (0.17%) (0.05%)

1.22 0.84 0.80

2.32 1.58 0.93

– 29.71 29.79

– 22.53 23.09

– 17.33 17.96

– 3.42 – 2.71

32

137 38

– 41.23 4.26

– 2.90 – 15.08

1.33 1.01 1.10

1

15 3 0

(0.63%) (0.13%) (0.00%)

10.82 4.36 1.51 1.48 1.31 0.75 -

mm mm

4.62 7.18 1.44 1.19 1.18 – 1.32 – – (11.68%) (21.64%) (0.25%) (0.13%) (0.13%) – (0.04%) – – 278 515 6 3 3

> 10 mm mm

1.27 6.13 1.27 1.01 1.01 – 1.10 – – 135.25 – –

mm mm

(12.63%) 30.86 – – – – not converged not converged – – (2.81%) 4.12 – – – 5.25

> 10 mm

4.01 – –

144

mm > 4 mm

16.02 – – 11.94 11.24 10.15 11.54 13.13 –

> 3 mm

22.15 – – 20.23 18.88 17.83 18.36 20.23 – 26.74 – – 27.83 24.82 23.07 23.49 25.37 –

> 2.5 mm > 2 mm

32.88 – – 37.37 32.38 29.35 30.13 31.87 – 18.14 2.35 2.45 0.98 1.05 1.01 0.91 – –

mm mm

2655 198 228 12 15 20 14 0.75 2.98 0.51 0.91 0.68 0.64 0.51 – –

localRRF Criminisi, 2013 Ebner, 2014 Payer, 2016 U-Net Lindner, 2015 localRRF+MRF Ibragimov, 2012 Donner, 2013 localRRF +Criminisi, 2013 +Štern, 2016 proposed

2DHand

(8.02%) (1.80%) (0.69%) (0.04%) (0.05%) (0.06%) (0.04%) > 10 mm mm

5.43 3.44 0.97 1.13 0.87 0.85 0.80 – –

std Mean Outliers 3DHand

mdn std Oean Outliers mdn

3DBody 2DCep

Outliers (%) std Mean Outliers mdn Method

Table 1 Landmark localization results of MainExperiment, extensively comparing our proposed method with state-of-the-art approaches. Accuracy and robustness in terms of median (mdn), mean and standard deviation (std) of errors as well as number and percentage of outliers.

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

33

Table 2 Localization results of ReducedExperiment, given as number of outliers when using a subset of 2DHand landmarks and 3DTeeth. Method

2DHand > 10mm

3DTeeth > 7mm

localRRF+MRF localRRF+Štern, 2016 proposed

495 11 4

74 40 27

(9.22%) (0.20%) (0.07%)

(13.41%) (7.25%) (4.89%)

Table 3 Localization results with different number of candidates on the 2DHand dataset. We use the same error measures as in Table 1. Number of candidates

mdn mm

Outliers > 10mm

Mean mm

std mm

5 candidates 10 candidates 25 candidates 50 candidates 100 candidates (proposed) dense

0.52 0.52 0.52 0.52 0.52 0.51

99 22 8 6 5 6

1.26 0.89 0.83 0.81 0.81 0.80

5.54 2.53 1.66 1.03 0.95 0.91

(0.898%) (0.200%) (0.073%) (0.054%) (0.045%) (0.054%)

localRRF+Štern,2016, and our proposed method do not show practically relevant differences (see Fig. 5c). On the 3DBody dataset, we also have nearly the same result as the best performing localRRF+MRF (see Fig. 5d). Finally, despite the accuracy of the CNNbased methods (see Fig. 5b), for the 3DHand dataset we achieve the overall best quantitative results compared to all approaches in Table 1. Our MainExperiment therefore shows that our method delivers state-of-the-art results on all investigated datasets coming from a variety of 2D and 3D imaging modalities, demonstrating its generic applicability by effectively putting a point distribution model into our RF learning framework. 4.2. ReducedExperiment While the strategies of combining local appearance based regression with a graphical model show state-of-the-art results in localization (see Section 4.1.2), they are limited when there is no strong geometric model available. Such limitations range from the case of weak correlation between landmarks, where the contribution of a geometric model is negligible, up to the extreme case of only a single annotated landmark being available, where the use of a graphical model is not meaningful. Furthermore, the need for costly and tedious expert annotation is a regular problem in medical image analysis, with localization of age-relevant anatomical structures for automatic age estimation (Štern et al., 2014) being one specific example. In multi-factorial age estimation (Schmeling et al., 2011), several anatomical structures like hand bones and teeth have to be localized for studying closing of epipyhseal gaps and mineralization of tooth roots and crowns. This could be achieved by the costly expert annotation of all teeth as well as hand bones, as in our 2DHand datasets, or by solely annotating wisdom teeth (Demirjian et al., 1973) as well as radius and ulna bones (Tanner et al., 1983) in an age estimation application. In our ReducedExperiment, we therefore reduced the number of investigated landmarks to six radius/ulna landmarks in our 2DHand dataset, and we utilized the 3DTeeth dataset where only two landmarks representing the wisdom teeth are available. This experiment was performed for localRRF+MRF and localRRF+Štern,2016 in addition to our proposed method. As can be seen from Table 2, the localization performance of the localRRF+MRF method explicitly representing geometrical configuration is poor in terms of outliers compared to our proposed approach. Thus, if solely a low number of weakly correlated annotated landmarks is available, our method outperforms state-of-the-art approaches based on a strong graph-

34

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

Fig. 7. Cumulative distribution of the results of our algorithm on one crossvalidation fold of the 2DHand dataset, when using different feature types for GCRF.

ical model (localRRF+MRF) or on forest-based regression from appearance information (localRRF+Štern,2016). This performance gain is enabled by our integration of geometrical landmark configuration with appearance information into the nodes of the GCRF regularizer, where geometric and Haar-like appearance features are randomly generated. This integration will be further investigated in the following subsection. 4.3. IntegrationExperiment Solely using appearance information to select the best among a set of local appearance based landmark candidates, localRRF+Štern,2016 outperforms the model-based localRRF+MRF method when only a low number of weakly correlated annotated landmarks is available (see Table 2). On the other hand, when there is strong correlation between landmarks, we show in MainExperiment that methods utilizing geometrical configuration are superior (see Table 1 and Fig. 5). In our proposed approach, we therefore exploit the strengths of integrating appearance information with geometric landmark configuration for regularization in the localization task. To investigate this integration in more detail, we compared the cumulative error distribution of our proposed approach with the two variants, where solely Haar-like appearance features, very similar to localRRF+Štern,2016, or solely GFs were used in the nodes of the GCRF regularizer. This experiment was performed on a single fold of the 2DHand cross-validation that had the largest number of outliers in MainExperiment. Fig. 7 shows that both types of features alone demonstrate reasonable localization results (43 and 96 outliers, respectively), but are outperformed significantly regarding robustness (13 outliers) by our proposed combination that integrates both kinds of features. Therefore, both kinds of information were needed to achieve the state-of-the-art results in our MainExperiment, while it outperforms both, methods that are using solely appearance information or solely a graphical model, when there is weak correlation among landmarks as in our ReducedExperiment. Thus, of the compared approaches our proposed method is the most flexible regarding representation of different landmark configurations. 4.4. ModelComplexityExperiment In our proposed approach landmark locations are iteratively improved from previous predictions, which has similarities to the auto context strategy (Tu and Bai, 2010) that is based on a cascade of stages. The drawback of auto context is the need for a

Fig. 8. Performance evaluation on one cross-validation fold of the 2DHand dataset using different number of landmarks K selected as subsets for each tree with different initializations. The selected optimal value is K = 4. The red line shows the algorithm’s behavior when initialized using localRRF, the blue line when initialized with ground truth locations. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

large amount of data separated into individual sets for training a model in each cascade stage. Contrary to that, our single prediction model is trained on the full amount of data, while predictions are iteratively improving at testing time. On the 2DHand dataset, we exemplarily show in Fig. 3(a) how during the coordinate descent iterations the number of outliers is decreased, reaching an optimum after four iterations. This behavior is visualized qualitatively in Fig. 3(b), where, starting from the localRRF initialization, two depicted landmarks are iteratively refined. Since we use the annotated ground-truth locations of landmarks (compare Fig. 4) for our GFs, training a model with high model complexity by using all other L − 1 landmarks leads to overfitting to the ground-truth landmark configuration. We show this with our ModelComplexityExperiment performed on a reduced training and test set consisting of 50% images of one fold of the 2DHand cross-validation, due to the large number of training steps required for the evaluation. Overfitting can be seen as the difference between the red and blue lines for K = L − 1 = 36 in Fig. 8, where the red line shows the error in terms of number of outliers larger than 10 mm on the test set when the coordinate descent optimization is initialized with the localRRF predictions x0 . When the localRRF prediction is artificially replaced with the known groundtruth locations x∗ for initializing coordinate descent, the blue error line resembles the best-case scenario for testing, since it simulates the case that the local regression stage produces a perfect initialization. At the same time, the blue line represents the worst-case scenario of the training error, allowing to observe the tradeoff between bias and variance when varying the model complexity parameter K. To overcome overfitting to the ground-truth landmark configuration, we use a bagging strategy and ensemble learning by reducing the number of randomly selected landmarks K when training the GFs of each tree of the RF. Effectively, by lowering K in our bagging strategy, we reduce the probability of a tree to select landmarks being located on anatomically infeasible positions, thus increasing the probability of the tree voting for the correct position and leading to more robust predictions of the RF ensemble. While a model with large K leads to a low bias–high variance model that does not generalize well, reducing K to too small numbers generates a high bias model with low prediction performance in terms of robustness. When varying the model complexity parameter K, we observed for all of our evaluated datasets a similar

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

behavior regarding the tradeoff between bias and variance, with the sweet spot located in between 10 − 20% of total landmarks, as exemplarily shown in Fig. 8 for 2DHand. By choosing K = 4 we obtained a generically applicable parameter value located inside this range for all investigated datasets, except for the special case of the 3DTeeth dataset, where only one other landmark can be used for computing GFs. Nevertheless, by choosing 10% of trees trained on both geometric and appearance features, while the other trees use appearance information only, we reproduce the same behavior regarding the tradeoff between bias and variance as for the other evaluated datasets. 4.5. Runtime Exemplarily shown for the 2DHand dataset, on our workstation localRRF requires an execution time of 56 s per image during testing due to the need for pushing all pixels/voxels of an image through the RF. Runtime of this method scales with the number of input pixels/voxels. All methods that use localRRF as a first step require additional time for regularization. For Štern,2016 an additional 56 s are needed, MRF requires three extra seconds, and Criminisi,2013 as well as our proposed method need an additional four seconds. The total runtime of the approach from Ebner et al. (2014) is 60 s. Without being able to directly compare runtimes due to the implementations on their workstations, the runtimes of the CNN-based methods (Payer et al., 2016) were ten seconds and the fastest reported timings are those from Lindner et al. (2015) with five seconds. To produce results of our proposed method, we did not use any GPU acceleration for forest evaluation as proposed e.g. in Sharp (2008), nor did we investigate a hierarchical localization approach as in Lindner et al. (2015). While runtimes of the regularization steps for our proposed method and comparable state-of-the-art methods are essentially the same, we did not focus on execution speed in our C++ implementation of the localRRF, thus improvements in runtime are possible. 4.6. Outlook From our quantitative results we still see a few outliers that are caused by occlusions in the datasets during testing, as illustrated in Fig. 6(b), where A shows a missing tooth in the lower jaw, and B an occlusion due to a label on the X-ray image. We intend to further study this type of limitation of our method in future work, by investigating datasets with stronger occlusions or varying fields of view, where not all the landmarks are always visible. This problem is also of interest in computer vision tasks like human pose estimation (Girshick et al., 2011; Chen and Yuille, 2014), which could be a potential future application of our method. Another source of outliers in our results are landmarks where annotation is ambiguous due to hard to define anatomical structures, an example is shown in Fig. 6(b) with the cyan landmark in C and D. However, our method shares this problem with all other state-of-the-art approaches. Finally, we have not yet explored how models containing many densely sampled landmarks, like they are often used in point distribution models for segmentation of anatomical structures, will behave in our framework, an issue that may be considered as another interesting direction for future work. 5. Conclusions State-of-the-art methods for multiple anatomical landmark localization use locally accurate but ambiguous predictions regularized by explicit models of geometric configuration. While in these methods appearance and geometric information do not interact during the landmark prediction stage, our proposed novel ap-

35

proach follows this regularization strategy differently by combining geometrical landmark configuration with appearance information in a unified random forest based framework. Thus, our method is capable of learning from both kinds of information simultaneously, internally deciding on the combination of geometric configuration and appearance information that is most informative for a specific localization task. Without requiring an explicit hand-crafted geometrical model and with only a few iterations of coordinate descent, our model is capable of regularizing initial predictions of landmark positions according to the best beliefs of where other landmarks are located. From the extensive comparison of our proposed approach to various state-of-the-art methods on a total of five challenging datasets, we observe that there is currently no single method that dominates localization performance, despite compared methods being tuned for best parameters in author-provided or reimplemented versions. However, our approach is able to achieve results that are best or similar to the individual best performing methods for all datasets simultaneously, without dataset specific parameter tuning or the need for task-specific hand-crafted models, thus showing its generic applicability and flexibility. Moreover, when there is no strong correlation among landmarks available and therefore the appearance information might be most informative for the localization task, our novel combination of otherwise separate knowledge about appearance information and geometrical configuration outperformed the compared state-of-the-art methods. Due to its simplicity and its lack of requiring an explicitly defined model of geometric configuration, we believe that our novel flexible approach will generalize well to other application scenarios.

Acknowledgements We thank the anonymous reviewers for their thorough work and excellent comments that helped us to improve our manuscript. We acknowledge Franz Zehentner for helping with our quantitative evaluation and Alexander Shekhovtsov for fruitful discussions. We thank Claudia Lindner for applying their algorithm on our 2DHand dataset, Christian Payer for results from the CNN based methods, and Rene Donner for providing 3DBody landmark annotations. This work was partly supported by the KIRAS program (no 850183) under supervision of the Austrian Research Promotion Agency (FFG) as well as by the Austrian Science Fund (FWF): P28078-N33.

Appendix A. Number of candidates from localRRF predictions We performed an additional experiment on one cross-validation fold of the 2DHand dataset, where we compare the results of varying the number of candidates per landmark (5,10,25,50,100) with the fully probabilistic treatment when all voxels in the image are used as candidates (dense). Results of this experiment are shown in Table 3. We conclude that even with 50 candidates, accurate and robust landmark localization can be achieved, thus our choice of 100 selected candidates is a conservative one. Further increasing the number of candidates to achieve a fully probabilistic treatment does not bring any benefit in localization performance, however it would lead to an unnecessary increase in runtime. References Aubert, B., Vazquez, C., Cresson, T., Parent, S., De Guise, J., 2016. Automatic spine and pelvis detection in frontal X-rays using deep neural networks for patch displacement learning. In: Proc. Int. Symp. Biomed. Imaging, 2016-June, pp. 1426–1429. doi:10.1109/ISBI.2016.7493535.

36

M. Urschler et al. / Medical Image Analysis 43 (2018) 23–36

Beichel, R., Bischof, H., Leberl, F., Sonka, M., 2005. Robust active appearance models and their application to medical image analysis. IEEE Trans. Med. Imag. 24 (9), 1151–1169. doi:10.1109/TMI.2005.853237. Besbes, A., Komodakis, N., Langs, G., Paragios, N., 2009. Shape priors and discrete MRFs for knowledge-based segmentation. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1295–1302. doi:10.1109/CVPRW.2009. 5206649. Breiman, L., 1996. Bagging predictors. Mach. Learn. 24 (421), 123–140. doi:10.1007/ BF0 0 058655. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Chen, X., Yuille, A.L., 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. Adv. Neural Inf. Process. Syst. 1736–1744. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J., 1995. Active shape models - their training and application. Comput. Vis. Image Understand. 61 (1), 38–59. Criminisi, A., Robertson, D., Konukoglu, E., Shotton, J., Pathak, S., White, S., Siddiqui, K., 2013. Regression forests for efficient anatomy detection and localization in computed tomography scans. Med. Image Anal. 17 (8), 1293–1303. Demirjian, A., Goldstein, H., JM, T., 1973. A new system of dental age assessment. Hum. Biol. 45 (2), 211–227. Doi, K., 2007. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput. Med. Imag. Graph. 31 (4–5), 198–211. doi:10.1016/j.compmedimag.20 07.02.0 02. Donner, R., Menze, B.H., Bischof, H., Langs, G., 2013. Global localization of 3D anatomical structures by pre-filtered hough forests and discrete optimization. Med. Image Anal. 17 (8), 1304–1314. doi:10.1016/j.media.2013.02.004. Donner, R., Micusik, B., Langs, G., Szumilas, L., Peloschek, P., Friedrich, K., Bischof, H., 2007. Object localization based on Markov random fields and symmetry interest points. In: Proc. Med. Image Comput. Comput. Interv.. Springer, pp. 460–468. Ebner, T., Štern, D., Donner, R., Bischof, H., Urschler, M., 2014. Towards automatic bone age estimation from MRI: localization of 3D anatomical landmarks. In: Proc. Med. Image Comput. Comput. Interv.. Springer, pp. 421–428. doi:10.1007/ 978- 3- 319- 10470- 6_53. Felzenszwalb, P.F., Huttenlocher, D.P., 2005. Pictorial structures for object recognition. Int. J. Comput. Vis. 61 (1), 55–79. doi:10.1023/B:VISI.0 0 0 0 042934.15159.49. Gall, J., Yao, A., Razavi, N., van Gool, L., Lempitsky, V., 2011. Hough forests for object detection, tracking, and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 33 (11), 2188–2201. Gauriau, R., Cuingnet, R., Lesage, D., Bloch, I., 2015. Multi-organ localization with cascaded global-to-local regression and shape prior. Med. Image Anal. 23 (1), 70–83. doi:10.1016/j.media.2015.04.007. Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A., 2011. Efficient regression of general-activity human poses from depth images. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 415–422. doi:10.1109/ICCV.2011.6126270. Glocker, B., Feulner, J., Criminisi, A., Haynor, D.R., Konukoglu, E., 2012. Automatic localization and identification of vertebrae in arbitrary field-of-view CT scans. In: Proc. Med. Image Comput. Comput. Interv., pp. 590–598. doi:10.1007/ 978- 3- 642- 33454- 2_73. Glocker, B., Pauly, O., Konukoglu, E., Criminisi, A., 2012. Joint classification-regression forests for spatially structured multi-object segmentation. In: Proc. Eur. Conf. Comp. Vis., 7575 LNCS, pp. 870–881. doi:10.1007/978- 3- 642- 33765- 9_62. Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A., 2013. Vertebrae localization in pathological spine CT via dense classification from sparse annotations. In: Proc. Med. Image Comput. Comput. Interv., 8150 LNCS, pp. 262–270. doi:10.1007/978- 3- 642- 40763- 5_33. Gonzalez, R.C., Woods, R.E., 1992. Digital Image Processing, second ed. Addison-Wesley, Reading, MA, USA. Hajnal, J., Hill, D., Hawkes, D.J. (Eds.), 2001, Medical Image Registration. CRC Press, Boca Raton. Heimann, T., Meinzer, H.P., 2009. Statistical shape models for 3D medical image segmentation: a review. Med. Image Anal. 13 (4), 543–563. doi:10.1016/j.media. 20 09.05.0 04. Ibragimov, B., Likar, B., Pernuš, F., Vrtovec, T., 2012. A game-theoretic framework for landmark-based image segmentation. IEEE Trans. Med. Imag. 31 (9), 1761–1776. doi:10.1109/TMI.2012.2202915. Johnson, H.J., Christensen, G.E., 2002. Consistent landmark and intensity-Based image registration. IEEE Trans. Med. Imag. 21 (5), 450–461. doi:10.1109/TMI.2002. 1009381. Kelm, B.M., Wels, M., Zhou, S.K., Seifert, S., Suehling, M., Zheng, Y., Comaniciu, D., 2013. Spine detection in CT and MR using iterated marginal space learning. Med. Image Anal. 17 (8), 1283–1292. doi:10.1016/j.media.2012.09.007. Lay, N., Birkbeck, N., Zhang, J., Zhou, S.K., 2013. Rapid multi-organ segmentation using context integration and discriminative models. In: Proc. Inf. Proc. Med. Imag., 7917 LNCS, pp. 450–462. doi:10.1007/978- 3- 642- 38868- 2_38. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324. Lindner, C., Bromiley, P.A., Ionita, M.C., Cootes, T.F., 2015. Robust and accurate shape model matching using random forest regression-Voting. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9), 1862–1874. doi:10.1109/TPAMI.2014.2382106. Oktay, O., Bai, W., Guerrero, R., Rajchl, M., de Marvao, A., O’Regan, D.P., Cook, S.A., Heinrich, M.P., Glocker, B., Rueckert, D., 2017. Stratified decision forests for accu-

rate anatomical landmark localization in cardiac images. IEEE Trans. Med. Imag. 36 (1), 332–342. doi:10.1109/TMI.2016.2597270. Pauly, O., Glocker, B., Criminisi, A., Mateus, D., Möller, A.M., Nekolla, S., Navab, N., 2011. Fast multiple organ detection and localization in whole-body MR dixon sequences. In: Proc. Med. Image Comput. Comput. Interv., 6893 LNCS, pp. 239– 247. doi:10.1007/978- 3- 642- 23626- 6_30. Payer, C., Štern, D., Bischof, H., Urschler, M., 2016. Regressing Heatmaps for Multiple Landmark Localization using CNNs. In: Proc. Med. Image Comput. Comput. Interv., pp. 230–238. Peter, L., Pauly, O., Chatelain, P., Mateus, D., Navab, N., 2015. Scale-adaptive forest training via an efficient feature sampling scheme. In: Proc. Med. Image Comput. Comput. Interv., pp. 637–644. Potesil, V., Kadir, T., Platsch, G., Brady, M., 2015. Personalized graphical models for anatomical landmark localization in whole-Body medical images. Int. J. Comput. Vis. 111 (1), 29–49. doi:10.1007/s11263- 014- 0731- 7. Richmond, D., Kainmueller, D., Glocker, B., Rother, C., Myers, G., 2015. Uncertaintydriven forest predictors for vertebra localization and segmentation. In: Proc. Med. Image Comput. Comput. Interv.. Springer, pp. 653–660. doi:10.1007/ 978- 3- 319- 24553- 9_80. Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: convolutional networks for biomedical image segmentation. In: Proc. Med. Image Comput. Comput. Interv., pp. 234–241. doi:10.1007/978- 3- 319- 24574- 4_28. Schmeling, A., Garamendi, P.M., Prieto, J.L., Landa, M.I., 2011. Forensic age estimation in unaccompanied minors and young living adults. In: Vieira, D.N. (Ed.), Forensic Med. - From Old Probl. to New Challenges. In Tech, pp. 77–120. doi:10.5772/ 19261. Sharp, T., 2008. Implementing decision trees and forests on a GPU. In: Proc Eur. Conf. Comput. Vis., 5305 LNCS, pp. 595–608. doi:10.1007/ 978- 3- 540- 88693- 8- 44. Štern, D., Ebner, T., Bischof, H., Grassegger, S., Ehammer, T., Urschler, M., 2014. Fully automatic bone age estimation from left hand MR images. In: Proc. Med. Image Comput. Comput. Interv.. Springer, pp. 220–227. doi:10.1007/ 978- 3- 319- 10470- 6_28. Štern, D., Ebner, T., Urschler, M., 2016. From local to global random regression forests: Exploring anatomical landmark localization. In: Proc. Med. Image Comput. Comput. Interv., pp. 221–229. Štern, D., Kainz, P., Payer, C., Urschler, M., 2017. Multi-factorial age estimation from skeletal and dental MRI volumes. In: Proc MICCAI Workshop Mach. Learn. Med. Imaging (MLMI 2017). Springer International AG, pp. 61–69. doi:10.1007/ 978- 3- 319- 67389- 9_8. Tanner, J.M., Whitehouse, R.H., N, C., Marshall, W.A., Healy, M.J.R., Goldstein, H., 1983. Assessment of Skeletal Maturity and Prediction of Adult Height (TW2 Method), second ed. Academic Press. Thodberg, H.H., Kreiborg, S., Juul, A., Pedersen, K.D., 2009. The bonexpert method for automated determination of skeletal maturity. IEEE Trans. Med. Imag. 28 (1), 52–66. doi:10.1109/TMI.2008.926067. Toews, M., Arbel, T., 2007. A statistical parts-based model of anatomical variability. IEEE Trans. Med. Imag. 26 (4), 497–508. doi:10.1109/TMI.2007.892510. Toews, M., Wells, W., Collins, D.L., Arbel, T., 2010. Feature-based morphometry: discovering group-related anatomical patterns. Neuroimage 49 (3), 2318–2327. doi:10.1016/j.neuroimage.2009.10.032. Tu, Z., Bai, X., 2010. Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 32 (10), 1744–1757. doi:10.1109/TPAMI.2009.186. Urschler, M., Bornik, A., Donoser, M., 2013. Memory efficient 3D integral volumes. In: IEEE Int. Conf. Comput. Vis. Work., pp. 722–729. doi:10.1109/ICCVW.2013.99. Urschler, M., Grassegger, S., Štern, D., 2015. What automated age estimation of hand and wrist MRI data tells us about skeletal maturation in male adolescents.. Ann. Hum. Biol. 42 (4), 356–365. doi:10.3109/03014460.2015.1043945. Urschler, M., Zach, C., Ditt, H., Bischof, H., 2006. Automatic point landmark matching for regularizing nonlinear intensity registration: Application to thoracic CT images. In: Proc. Med. Image Comput. Comput. Interv.. Springer, pp. 710–717. doi:10.1007/11866763_87. Viola, P., Jones, M.J., 2004. Robust real-time face detection. Int. J. Comput. Vis. 57 (2), 137–154. Wainwright, M.J., Jordan, M.I., 2008. Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1, 1–305. doi:10.1561/ 220 0 0 0 0 0 01. Wang, C.W., Huang, C.T., Lee, J.H., Li, C.H., Chang, S.W., Siao, M.J., Lai, T.M., Ibragimov, B., Vrtovec, T., Ronneberger, O., Fischer, P., Cootes, T.F., Lindner, C., 2016. A benchmark for comparison of dental radiography analysis algorithms. Med. Image Anal. 31, 63–76. doi:10.1016/j.media.2016.02.004. Wörz, S., Rohr, K., 2006. Localization of anatomical point landmarks in 3D medical images by fitting 3D parametric intensity models. Med. Image Anal. 10 (1), 41– 58. doi:10.1016/j.media.20 05.02.0 03. Wright, S.J., 2015. Coordinate descent algorithms. Math. Program. 151 (1), 3–34. doi:10.1007/s10107-015-0892-3. Zhang, S., Zhan, Y., Dewan, M., Huang, J., Metaxas, D.N., Zhou, X.S., 2012. Towards robust and effective shape modeling: sparse shape composition. Med. Image Anal. 16 (1), 265–277. doi:10.1016/j.media.2011.08.004.