A multi-scale data fusion framework for bone age assessment with convolutional neural networks

A multi-scale data fusion framework for bone age assessment with convolutional neural networks

Accepted Manuscript A multi-scale data fusion framework for bone age assessment with convolutional neural networks Yu Liu, Chao Zhang, Juan Cheng, Xun...

2MB Sizes 0 Downloads 78 Views

Accepted Manuscript A multi-scale data fusion framework for bone age assessment with convolutional neural networks Yu Liu, Chao Zhang, Juan Cheng, Xun Chen, Z. Jane Wang PII:

S0010-4825(19)30088-5

DOI:

https://doi.org/10.1016/j.compbiomed.2019.03.015

Reference:

CBM 3243

To appear in:

Computers in Biology and Medicine

Received Date: 28 November 2018 Revised Date:

14 March 2019

Accepted Date: 14 March 2019

Please cite this article as: Y. Liu, C. Zhang, J. Cheng, X. Chen, Z.J. Wang, A multi-scale data fusion framework for bone age assessment with convolutional neural networks, Computers in Biology and Medicine (2019), doi: https://doi.org/10.1016/j.compbiomed.2019.03.015. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

RI PT

A Multi-Scale Data Fusion Framework for Bone Age Assessment with Convolutional Neural Networks Yu Liua , Chao Zhanga , Juan Chenga , Xun Chenb,∗, Z. Jane Wangc a Department

M AN U

SC

of Biomedical Engineering, Hefei University of Technology, Hefei 230009, China. b Department of Electronic Science and Technology, University of Science and Technology of China, Hefei 230026, China. c Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada.

Abstract

Bone age assessment (BAA) has various clinical applications such as diagnosis of endocrine disorders and prediction of final adult height for adolescents. Recent studies indicate that deep learning techniques have great potential in developing automated BAA methods with significant advantages over the con-

TE D

ventional methods based on hand-crafted features. In this paper, we propose a multi-scale data fusion framework for bone age assessment with X-ray images based on non-subsampled contourlet transform (NSCT) and convolutional neural networks (CNNs). Unlike the existing CNN-based BAA methods that adopt

EP

the original spatial domain image as network input directly, we pre-extract a rich set of features for the input image by performing NSCT to obtain its multiscale and multi-direction representations. This feature pre-extraction strategy

AC C

could be beneficial to network training as the number of annotated examples in the problem of BAA is typically quite limited. The obtained NSCT coefficient maps at each scale are fed into a convolutional network individually and the information from different scales are then merged to achieve the final prediction. Specifically, two CNN models with different data fusion strategies are presented for BAA: a regression model with feature-level fusion and a classification model ∗ Corresponding

author. Email addresses: [email protected] (Yu Liu), [email protected] (Xun Chen)

Preprint submitted to Computers in Biology and Medicine

March 17, 2019

ACCEPTED MANUSCRIPT

RI PT

with decision-level fusion. Experiments on the public BAA dataset Digital Hand Atlas demonstrate that the proposed method can obtain promising results and

outperform many state-of-the-art BAA methods. In particular, the proposed approaches exhibit obvious advantages over the corresponding spatial domain approaches (generally with an improvement of more than 0.1 years on the mean

SC

absolute error), showing great potential in the future study of this field. Keywords: Bone age assessment (BAA), non-subsampled contourlet

data fusion

1

1. Introduction

M AN U

transform (NSCT), convolutional neural networks (CNNs), feature extraction,

As a frequently performed procedure for skeletal maturity evaluation in pe-

3

diatric radiology, bone age assessment (BAA) can determine the discrepancy

4

between a child’s skeletal age and chronological age, which may indicate ab-

5

normalities in skeletal development. BAA is widely used in the diagnosis of

6

endocrine disorders and the monitoring of growth hormone therapy [1]. In

7

addition, accurate assessment of skeletal age is of great significance for the pre-

8

diction of a child’s adult height [2, 3]. Some other related studies may also

9

potentially benefit from the BAA technique, such as the assessment of cortical

EP

10

TE D

2

bone fracture resistance curves for estimating the risk of bone fracture [4]. The most popular approach for BAA is based on a radiological examina-

12

tion of left (non-dominant) hand and wrist to reveal the maturity of skeletal

13

development, due to the advantages like simplicity and minimum radiation ex-

14

posure. Figure 1 (the left subfigure) shows a typical example of hand radiograph

15

for skeletal age assessment. The image is from the public BAA dataset Digi-

16

tal Hand Atlas 1 [5]. The patient is a 10.49 years old (chronological age) girl,

17

while the estimated results of her skeletal age from two radiologists using this

18

radiograph are 11.5 and 11.0 years old, respectively. Thus, the patient in this

AC C

11

1 The

dataset is available at: http://ipilabmysql.usc.edu/newindex.php

2

ACCEPTED MANUSCRIPT

Number of Images

100

80

60

40

20

0 0

2

4

6

8

10

12

14

16

18

20

SC

Bone Age

RI PT

120

Figure 1: Left: a radiograph from the Digital Hand Atlas dataset for bone age assessment.

M AN U

Right: the distribution of examples in this dataset based on the ground truth bone age.

19

case is likely to have an advanced bone age, revealing some possible physical

20

abnormalities (e.g., precocious puberty) [1].

Currently, there exist two widely-used clinical methods for radiologists and

22

pediatricians to assess skeletal age based on the hand X-ray images: Greulich

23

and Pyle (GP) [6] and Tanner-Whitehouse (TW) [2, 7]. The GP method deter-

24

mines the skeletal age by comparing the whole target radiograph with a list of

25

standard bone age maps known as an atlas. In contrast, the TW method gives a

26

more detailed solution to this problem. It takes into account a set of local regions

27

of interest (ROIs) in an X-ray image and provides an evaluation by a numerical

28

score for each ROI that corresponds to a specific bone. The final bone age is es-

29

timated by combining all the scores using some pre-defined strategies. Although

30

the GP and TW methods have their own characteristics such as simplicity for

31

GP and precision for TW, these manual methods suffer from several common

EP

TE D

21

drawbacks. They both rely heavily on the domain knowledge and experience of

33

radiologists. A considerable intra-rater and inter-rater variability always exists

34

in clinical practice [8], as shown in the above example where a difference of 0.5

35

years between two expert readings occurs. In addition, the time efficiency of

36

assessing the bone age in a manual manner is relatively low [9–11]. As stated

37

in [9], it takes around 30 minutes per image for a radiologist to accomplish the

38

assessment procedure using the TW method. The GP-based evaluation will be

39

more efficient due to its simplicity, but an experienced radiologist still requires

40

a few minutes to accomplish the task in general.

AC C 32

3

ACCEPTED MANUSCRIPT

In the last two decades, the computer-assisted methods for automated bone

42

age assessment have emerged an active topic in the field of medical image anal-

43

ysis, and a variety of automatic BAA approaches have been proposed [12–25].

44

These methods usually handle BAA as a classification or regression problem

45

which involves essential steps like hand segmentation, ROI detection, feature

46

extraction and classifier or regressor design. In these methods, high quality X-

47

ray images are usually required for robust hand segmentation and ROI detection

48

results. Moreover, these methods need to manually design distinctive features

49

for the subsequent classification or regression, leading to a heavy dependence of

50

experience and skills. The above steps all have critical impacts on the system

51

performance and they are highly correlated with each other, which greatly limits

52

the convenience in designing effective BAA methods.

M AN U

SC

RI PT

41

More recently, deep learning techniques [26], and in particular convolutional

54

neural networks (CNNs) [27, 28] have achieved great success in a wide range

55

of medical image analysis problems [29–31]. As for the issue of bone age as-

56

sessment, a couple of CNN-based methods have been proposed in the last three

57

years [9–11, 32–35], leading to promising results and obvious advantages over

58

conventional methods based on hand-crafted features. However, just like the

59

situation in most medical image related applications [30, 31], a main challenge

60

existing in CNN-based BAA study is the lack of annotated data for supervised

61

learning, due to many factors such as privacy constraint, high demand of profes-

62

sional knowledge, etc. Transfer learning is recognized as an effective solution to

AC C

EP

TE D

53

63

this problem and has been widely applied in this area. The deep models trained

64

by natural images are employed for fine-tuning using the labelled medical im-

65

ages. Despite of this, in many cases, the scale of a medical image dataset may

66

still seem to be too small to achieve a satisfactory training. For instance, in the

67

BAA dataset Digital Hand Atlas, there exist nearly 1400 X-ray images and the

68

range of ground truth bone age (defined as the average of two expert readings

69

given by the dataset) is 0-19 years old. Therefore, the BAA task on this dataset

70

can be naturally modelled as a classification problem with 20 classes while each

71

class only contains about 70 images on the average. The right subfigure in 4

ACCEPTED MANUSCRIPT

Figure 1 provides the distribution of X-ray images in this dataset based on the

73

ground truth bone age after the rounding operation. Some commonly-used data

74

augmentation techniques like cropping and flipping could alleviate this problem,

75

but the extent may still be limited.

RI PT

72

In this paper, a multi-scale data fusion framework for bone age assessment

77

based on deep convolutional networks is proposed. Unlike the existing CNN-

78

based BAA methods that directly use the original spatial domain image as

79

network input, we pre-extract a rich set of features for the input image by

80

performing non-subsampled contourlet transform (NSCT), which is an effective

81

multi-scale geometric analysis approach that can obtain multi-scale and multi-

82

direction features of an image. This feature pre-extraction procedure can be

83

beneficial to network training when the quantity of annotated examples in BAA

84

is quite limited. Furthermore, by performing NSCT, the information contained

85

in the spatial domain image is well separated into different scales, which sig-

86

nificantly improves the flexibility in designing effective CNN models via data

87

fusion. The main contributions of this paper are summarized as follows:

TE D

M AN U

SC

76

1. We propose a multi-scale data fusion framework for BAA based on NSCT

89

and CNNs, which has several advantages over the conventional spatial

90

domain CNN-based BAA methods.

91

92

2. Under the proposed framework, a regression model for BAA using a featurelevel fusion strategy is presented. 3. Under the proposed framework, a classification model for BAA using a

AC C

93

EP

88

94

decision-level fusion strategy is presented.

95

4. A novel performance evaluation approach for BAA based on error-inspired

96

cumulative percentage curve is introduced. The curve describes the per-

97

centage of validation examples that have an error less than a certain

98

threshold. In comparison to the traditional evaluation metrics such as

99

mean absolute error (MAE), this curve-based evaluation can provide much

100

more insights into the performance of a method and offer a more intuitive

101

way to compare different methods.

5

ACCEPTED MANUSCRIPT

5. Experimental results on the Digital Hand Atlas dataset demonstrate that

103

the proposed method can obtain very competitive performance when com-

104

pared with the state-of-the-art methods.

RI PT

102

A preliminary version of this work was reported in [36], which just presents a

106

regression model for bone age assessment. In this paper, we have made a series

107

of extensions which mainly include:

109

110

111

1. A more complete review about existing CNN-based BAA methods is provided in Section 2.1.

M AN U

108

SC

105

2. For the methodology, a classification BAA model based on NSCT and CNN with different decision-level fusion rules is added in Section 3.3. 3. A novel performance evaluation approach for BAA based on error-inspired

113

cumulative percentage curve is presented in Section 4.3. In [36], we just

114

simply mentioned this idea without full description and experimental ver-

115

ification. In this paper, the evaluation approach is detailed and its effec-

116

tiveness is verified.

TE D

112

4. Extensive experimental validations are supplemented in Sections 4.4 - 4.6.

118

First, a systematic study on the impact of different parameter settings

119

including the number of NSCT decomposition levels and the decision-

120

level fusion rule used in the classification model are given. Second, a

121

set of experiments focusing on the distinction between two genders are

122

conducted. Third, the proposed method is compared with more state-of-

123

the-art BAA methods including a recent CNN-based method [35] and the

124

well-known BoneXpert method [18].

AC C

EP

117

125

126

5. More discussions are provided to further clarify the characteristics of the proposed method in Section 5.

127

The rest of this paper is organized as follows. In Section 2, some related

128

work and the motivations of this work are introduced. Section 3 presents the

129

proposed classification model and regression model for BAA in detail. Experi-

130

mental results are given in Section 4 and further discussions are made in Section

131

5. Finally, Section 6 concludes the paper. 6

ACCEPTED MANUSCRIPT

2. Related Work and Motivations

133

2.1. Convolutional neural networks for bone age assessment

RI PT

132

As a representative supervised deep learning architecture, convolutional neu-

135

ral networks (CNNs) have gained great achievements in a variety of medical

136

image analysis problems. In the field of bone age assessment (BAA), CNN-

137

based studies started relatively late. In 2016, Stern et al. [32] firstly proposed

138

a CNN-based BAA method using magnetic resonance imaging (MRI) volumes.

139

They followed the idea of TW2 method to obtain age information by combining

140

the estimated ossification stages of 13 individual bones. The first CNN-based

141

BAA method using 2D X-ray scans was reported by Chen [33], which presents

142

a classification model for BAA using the VGGNet [37] with transfer learning.

143

The GP framework is applied in this method for its simplicity and efficiency.

144

In 2017, Lee et al. [10] proposed a fully-automated CNN-based BAA system

145

involving data preparation, image preprocessing, CNN-based hand region detec-

146

tion and CNN-based age classification. In their method, the GoogLeNet [38] is

147

employed as the age classification model under the GP framework. Spampinato

148

et al. [34] introduced a general CNN-based regression model for BAA, which

149

consists of a convolutional network for feature extraction and a regression net-

150

work for bone age estimation. They also applied the GP framework in their

151

work and tested GoogLeNet and VGGNet with different transfer learning tech-

152

niques. In addition, they proposed a new architecture for BAA called BoNet by

153

introducing a deformation layer in the convolutional network. Zhou et al. [35]

154

defined five regions of interest (ROIs) in the X-ray image by referring to the

155

TW method. The VGGNet model pre-trained on general imagery is employed

156

for the classification of each ROI and a multi-model fusion strategy is adopted

157

to get the final prediction. Iglovikov et al. [9] proposed an end-to-end network

158

architecture for automated BAA. They studied the impact of ROI selection on

159

algorithm performance by defining three specific regions in the X-ray image.

160

Most recently, Wang et al. [11] proposed a BAA approach by extracting the

161

distal radius and ulna areas from the X-ray image for CNN-based classification.

AC C

EP

TE D

M AN U

SC

134

7

ACCEPTED MANUSCRIPT

NSDFB

RI PT

NSP

High frequency

Input image

NSDFB

NSP

Scale

Fine

SC

High frequency

Coarse

M AN U

Low frequency

Figure 2: The schematic diagram of a two-level NSCT decomposition. NSP is the abbreviation of non-subsampled pyramid. NSDFB is the abbreviation of non-subsampled directional filter bank.

162

163

2.2. Non-subsampled Contourlet Transform

Non-subsampled Contourlet Transform (NSCT) [39] is a representative multiscale geometric analysis approach that has been widely used in many image

165

processing problems such as denoising [40] and fusion [41]. In comparison to

166

some early multi-scale image decomposition approaches like pyramid, wavelet

167

and curvelet, contourlet transform is verified to be a more optimal image rep-

168

resentation method as it can capture the details of an image at diverse scales

169

and directions more effectively. By applying the non-subsampling strategy to

170

contourlet transform, NSCT also owns the shift-invariance property. The im-

EP

TE D

164

plementation of NSCT is mainly based on non-subsampled pyramid (NSP) de-

172

composition and non-subsampled directional filter bank (NSDFB). The NSP

173

decomposition is employed to obtain the multi-scale representations of an input

174

image from fine to coarse. It is achieved by a two-channel non-subsampled 2-

175

D filter bank. At each decomposition level, one high-frequency band and one

176

low-frequency band of the same size as the input image are produced. As a

177

result, by applying an L-level NSP decomposition, we can obtain L + 1 sub-

178

bands including L high-frequency bands and one low-frequency band. For each

179

high-frequency band, the NSDFB is applied to obtain its multi-direction rep-

AC C 171

8

ACCEPTED MANUSCRIPT

resentations at the corresponding scale. The obtained band at each direction

181

is also of the same size as the input image. The number of directions at each

182

scale can be adjusted and it is typically set to a value that is the power of two.

183

Figure 2 shows the schematic diagram of a two-level NSCT decomposition.

184

2.3. Motivations of This Work

SC

RI PT

180

To the best of our knowledge, all the existing CNN-based BAA methods

186

adopt the spatial domain image (either the whole image or ROIs) as network

187

input. However, the problem of BAA usually suffers from a severe lack of

188

annotated examples for network training. Even though some transfer learning

189

and data augmentation approaches have been applied to solve this problem, the

190

effect may still be limited.

191

192

M AN U

185

In this work, we perform NSCT on the spatial domain image before feeding it into the convolutional architecture. The motivations include: 1. The NSCT can effectively extract a rich set of low-level features such as

194

edges, corners and textures with multiple scales and directions, due to

195

its good characteristics described in Section 2.2. As a result, dozens of

196

coefficient maps are generated from a single spatial domain image. Using

197

this feature pre-extraction procedure, a more thorough training for the

198

deep convolutional networks with limited examples can be expected.

EP

TE D

193

2. By applying NSCT, the information contained in the original spatial do-

200

main image can be well separated into different scales. Consequently, the

AC C

199

201

flexibility in designing CNN architectures can be greatly improved and

202

data fusion strategies of different levels (e.g., feature level and decision

203

level) can be easily introduced to promote the system performance.

204

3. In fact, other multi-scale transform approaches such as wavelets and curvelets

205

can also be used for feature pre-extraction. The main consideration of se-

206

lecting NSCT is that its representation ability for an image is widely rec-

207

ognized to be better than most of the other multi-scale transform methods

208

[42].

9

Scale #1 (16 bands)

…… Scale #2 (8 bands)

……

M AN U

Scale #3 (8 bands)

SC

……

RI PT

ACCEPTED MANUSCRIPT

……

Scale #4 (4 bands)

……

TE D

Scale #5 (4 bands)

Figure 3: A 5-level NSCT decomposition on the example image in Figure 1.

3. The Proposed Method

210

3.1. NSCT for hand X-ray images

EP

209

In consideration of the factors including computational efficiency, memory

212

cost and the input size of popular CNN models like VGGNet and GoogLeNet,

213

all the X-ray images are resized to 256 × 256 pixels, keeping aspect ratio via zero

214

padding. The resized version of the X-ray example given in Figure 1 is shown

215

in Figure 3 at the upper-left corner.

AC C

211

216

The NSCT is performed on the resized X-ray image to obtain its multi-scale

217

and multi-direction representations. It is clear that the number of decomposition

218

levels is an important factor in NSCT. To get a grasp of this issue, the results

219

of a 5-level decomposition on the resized image are exhibited in Figure 3. The

10

ACCEPTED MANUSCRIPT

numbers of directional bands at different scales from fine to coarse are set to

221

16, 8, 8, 4 and 4 [42, 43], which is in accord with the principle that a finer

222

scale needs more directions for description. Due to limitation of space, we only

223

show parts of the high-frequency bands at each scale (the absolute values of

224

high-frequency bands are used for a better visualization). The intermediate

225

low-frequency bands are also shown in Figure 3. Please refer to Section 4.2 for

226

more implementation details of NSCT decomposition.

SC

RI PT

220

It can be seen that there exists a clear trend of the features captured at

228

different scales. That is, the features become coarser with the increase of scale.

229

Specifically, the finest details are extracted at the first scale. At the second

230

and third scales, crucial edge-like features are well captured. The features have

231

become very coarse after the fourth scale and only the main contours are ex-

232

tracted in the fifth scale. The consistent trend can be found from the series of

233

intermediate low-frequency bands. The low-frequency band after 5-level decom-

234

position contains almost no valuable information for BAA. Similar observations

235

are obtained from other X-ray images. Therefore, it is not necessary to set the

236

number of decomposition levels larger than five. In Section 4, more quantitative

237

analysis and discussions on this issue are provided.

238

3.2. Regression Model

EP

TE D

M AN U

227

239

Figure 4 shows the schematic diagram of the proposed regression model

240

for BAA. The architecture consists of three modules: NSCT decomposition, convolution and regression.

AC C 241

242

The NSCT decomposition module is used to generate multi-scale and multi-

243

direction features of the input image. An L-level NSCT decomposition is em-

244

ployed. At each scale, all the directional sub-bands are concatenated to generate

245

a data cube of multi-channel format. Thus, L data cubes are constructed in to-

246

tal.

247

The data cubes obtained by NSCT decomposition are fed into the convolu-

248

tion module, which is a multi-stream architecture and each stream is constructed

249

by a series of convolutional layers and max-pooling layers. Considering that the 11

Convolution

Regression

Scale #2

VGGNet-16 convolutional architecture

Concatenation

Scale #1

VGGNet-16 convolutional architecture

SC

NSCT decomposition

RI PT

ACCEPTED MANUSCRIPT

M AN U

Input image

1024

Bone age 128

1 Fully-connected

Fully-connected

Fully-connected

Scale #L

TE D

VGGNet-16 convolutional architecture

Figure 4: Schematic diagram of the proposed regression model for bone age assessment. The

AC C

2×2 Max-pool

3×3 Conv, 512

3×3 Conv, 512

3×3 Conv, 512

2×2 Max-pool

3×3 Conv, 256

3×3 Conv, 256

3×3 Conv, 256

2×2 Max-pool

3×3 Conv, 128

3×3 Conv, 128

3×3 Conv, 128

2×2 Max-pool

3×3 Conv, 128

3×3 Conv, 128

2×2 Max-pool

3×3 Conv, 64

3×3 Conv, 64

EP

details of VGGNet-16 convolutional architecture are given in Figure 5.

VGGNet-16 convolutional architecture

Figure 5: The convolutional architecture of VGGNet-16 [37]. The “Conv” denotes the convolutional layer and the kernel size is uniformly 3 × 3. The number of output feature maps per

layer is given. The “Max-pool” denotes the max-pooling layer and the kernel size is uniformly 2 × 2. Please refer to [37] for more details.

12

ACCEPTED MANUSCRIPT

16-layer VGGNet [37] (denoted as VGGNet-16) has been verified to be a very

251

effective network for BAA [9, 11, 33–35], it is employed as the architecture of

252

each stream in this work. The convolutional architecture of the VGGNet-16 is

253

shown in Figure 5. It contains 13 convolutional layers and 5 max-pooling layers

254

in total. More details of this network can be found in [37]. As with all the

255

existing CNN-based BAA methods, transfer learning is applied in our method.

256

Specifically, the VGGNet-16 model2 , which is fully trained on the well-known

257

image dataset adopted by the ImageNet Large-Scale Visual Recognition Chal-

258

lenge (ILSVRC), is employed to provide the initial weights for further training

259

in the BAA task. The output feature maps of different streams are finally

260

concatenated to achieve a feature-level fusion.

M AN U

SC

RI PT

250

The regression module consisting of several fully-connected layers adopts

262

the concatenated feature maps as the input. In our method, based on extensive

263

experiments, we use three fully-connected layers with 1024, 128 and 1 neurons,

264

respectively. The final single neuron outputs the estimated bone age. We apply

265

the mean squared error as our loss function:

TE D

261

n

L(Θ) =

1 X 2 |f (Xi , Θ) − yi | , 2n i=1

(1)

where Θ denotes the network parameters to be determined and n indicates the

267

number of training examples in a mini-batch. Xi is the i-th training image in

268

the batch and f (Xi , Θ) denotes its prediction by the network. yi is its ground

269

truth result, which is generally provided by radiologists or domain experts. The

AC C

EP

266

270

popular mini-batch stochastic gradient descent (SGD) algorithm is employed to

271

minimize the loss function.

272

3.3. Classification Model

273

Figure 6 shows the schematic diagram of the proposed classification model

274

for bone age assessment, which mainly contains four modules: NSCT decompo-

275

sition, classification, fusion and estimation. The NSCT decomposition module 2 The

pre-trained model is available at: https://github.com/BVLC/caffe/wiki/Model-Zoo.

13

Classification

Fusion

Estimation

FC, #class

FC, 4096

VGGNet-16 convolutional architecture

FC, 4096

Scale #1

M AN U

NSCT decomposition

SC

RI PT

ACCEPTED MANUSCRIPT

#class

FC, #class

FC, 4096

FC, 4096

Scale #2

VGGNet-16 convolutional architecture

Bone age

Fusion

#class

Input image

1

FC, 4096

FC, #class

VGGNet-16 convolutional architecture

FC, 4096

#class

EP

Scale #L

TE D

#class

Figure 6: Schematic diagram of the proposed classification model for bone age assessment. The details of VGGNet-16 convolutional architecture are given in Figure 5. The “FC” denotes

AC C

the fully-connected layer and the number of neurons per FC layer is given. Please note that the FC layers also originate from the VGGNet-16 [37], while the number of output neurons is adjusted to the number of quantized classes (denoted as “#class”) in a specific BAA task.

14

ACCEPTED MANUSCRIPT

is the same as that in the regression model presented in Section 3.2.

RI PT

276

In the classification module, each data cube obtained above is individually

278

classified by a deep convolutional network. Just as the above regression model,

279

the pre-trained VGGNet-16 model is adopted for each sub-problem. The number

280

of output neurons is adjusted to the number of quantized categories in a specific

281

BAA task. For each sub-problem, the softmax function is applied to obtain the

282

output and the cross entropy loss is used for network training. The mini-batch

283

SGD algorithm is employed to update the network parameters.

SC

277

The fusion module conducts a decision-level fusion on the multiple predic-

285

tions obtained above. Let Pi = (pi,1 , pi,2 , ..., pi,C ) denote the i-th source pre-

286

diction and i ∈ {1, 2, ..., N }, where N is the number of predictions and C is

287

the dimension of each prediction. The fused prediction Pf = (pf,1 , pf,2 , ..., pf,C )

288

can be obtained with different fusion rules including element-wise maximum,

289

element-wise mean and element-wise multiplication. Specifically, the j-th el-

290

ement (j ∈ {1, 2, ..., C}) of the fused prediction pf,j is calculated using the

291

element-wise maximum rule:

TE D

(2)

N 1 X pi,j , N i=1

(3)

i

pf,j =

and the element-wise multiplication rule:

AC C

293

pf,j = max{pi,j },

the element-wise mean rule:

EP

292

M AN U

284

pf,j =

N Y

pi,j .

(4)

i=1

294

To obtain a meaningful probability vector, the fused prediction Pf is normalized

295

by its l1 -norm.

296

Finally, the estimation module outputs the predicted bone age R as R=

C X

pf,j · aj ,

(5)

j=1 297

where aj denotes the specific bone age indicated by the j-th category according

298

to the quantization. 15

299

4. Experimental Results

300

4.1. Dataset and Compared Methods

RI PT

ACCEPTED MANUSCRIPT

In the field of bone age assessment (BAA), public X-ray datasets are quite

302

limited. Most of the existing BAA methods are tested on private datasets

303

and their source codes are not released for reproducible study, making a fair

304

comparison impossible. Fortunately, Spampinato et al. [34] recently provided a

305

benchmark for BAA performance comparison, which is a great contribution to

306

this field. They tested several automated BAA methods including hand-crafted

307

feature based ones and deep learning based ones on the Digital Hand Atlas

308

[5], a public and comprehensive X-ray BAA dataset created by the Children’s

309

Hospital Los Angeles. The dataset contains nearly 1400 left-hand radiographs of

310

children ranging from 0 to 18 years old (chronological age) and covering different

311

genders and races. For each X-ray image in the dataset, two expert readings

312

independently given by two radiologists are provided as the references. The

313

minimum and maximum bone age readings over the whole dataset are 0 and

314

19.0 years, respectively.

TE D

M AN U

SC

301

In this paper, for the sake of fair comparison, we mainly adopt the above

316

benchmark to verify the effectiveness of the proposed method. Specifically, the

317

Digital Hand Atlas dataset is used and eight existing BAA methods that have

318

been tested on the benchmark are employed for performance comparison, which

319

include the method proposed in [14], the method proposed in [5], the method

320

proposed in [19], the method proposed in [23], the VGGNet-based method [34],

321

the GoogLeNet-based method [34], the BoNet-based method [34] and a latest

322

CNN-based method proposed by Zhou et al. in [35]. Among them, the first four

323

are conventional hand-created feature based methods, while the other four are

324

CNN-based ones with the spatial domain image as network input, by applying

325

either the whole image [34] or a set of local ROIs [35]. The results of the

326

first seven methods are provided in [34], and the performance of the last one is

327

reported in its original publication [35]. It is noted that all these methods and

328

the proposed method are tested using the same train-validation strategy.

AC C

EP

315

16

ACCEPTED MANUSCRIPT

The proposed method is also compared with the well-known automated BAA

330

method, Thodberg et al.’s BoneXpert [18, 44], which has been a widely recog-

331

nized commercial software in this field3 . In [44], the BoneXpert method is tested

332

on the Digital Hand Atlas dataset, making this comparison possible. Please refer

333

to Section 4.6.2 for more related details.

334

4.2. Implementation Details

SC

336

Following the train-validation strategy used in the benchmark [34], a 5-fold cross validation is adopted to evaluate the algorithm performance.

M AN U

335

RI PT

329

According to the analysis in Section 3.1, the NSCT decomposition is per-

338

formed up to five levels on the resized X-ray images, and the numbers of direc-

339

tional bands at different scales from fine to coarse are set to 16, 8, 8, 4 and 4.

340

In this work, the decomposition procedure is implemented via a Matlab toolbox

341

for NSCT4 released by its original authors [39]. And, we perform this feature

342

pre-extraction step separately from the whole BAA system, i.e., the contourlet

343

coefficients are pre-computed and saved as the inputs of the subsequent neural

344

networks. For either the regression model or the classification model, the num-

345

ber of decomposition levels is set increasing from 1 to 5 with a stride of 1 to

346

study its impact on the performance, and the approach that directly adopts the

347

spatial domain image (i.e., without performing NSCT) as network input with

348

the same architecture is also implemented for comparison.

EP

349

TE D

337

In the regression model, the average value of two expert readings is employed as the ground truth for training. Data augmentation is carried out by extracting

351

nine uniformly spaced crops and their horizontally flipped versions. The spatial

352

size of a cropped patch is 224 × 224 to fit the VGGNet input. In our training

353

procedure, the number of training examples in a mini-batch is set to 4, which is

354

in accord with the setting used in other CNN-based methods in the benchmark

355

[34]. The momentum factor and weight decay are fixed as popular values of 0.9

AC C 350

3 The 4 The

website of BoneXpert is: https://www.bonexpert.com/. toolbox is available at: https://ww2.mathworks.cn/matlabcentral/fileexchange/10049.

17

ACCEPTED MANUSCRIPT

and 0.0005, respectively. The training process is conducted for about 80 epochs

357

over the augmented dataset. The learning rate is initially set to 0.0001 and

358

dropped by a factor of 10 after the first 60 epochs.

RI PT

356

In the classification model, the rounding value of the average reading ranging

360

from 0 to 19 is used as the label of the corresponding example. Thus, the

361

classification problem has 20 classes in total. Data augmentation including

362

random cropping and horizontal flipping is performed online during the training

363

process. The spatial size of a cropped patch is also 224 × 224. The batch

364

size, momentum factor and weight decay are all the same as those used in the

365

regression model. For each sub-problem, the training process is conducted for

366

about 180 epochs. The learning rate is initialized at 0.001 and dropped by a

367

factor of 10 after 90 epochs. Those three decision-level fusion rules presented in

368

Section 3.3 are employed in the classification model, respectively.

M AN U

SC

359

All the experiments on network training are conducted via the popular deep

370

learning framework Caffe [45]. The hardware environment mainly includes a

371

Intel Core i7-6800K CPU and a NVIDIA GTX 1080 Ti GPU.

372

4.3. Objective Evaluations

TE D

369

The mean absolute error (MAE) between the prediction and the ground

374

truth is the most explicit evaluation metric in bone age assessment. Since the

375

dataset contains two independent expert readings, the benchmark presented in

376

[34] provide three MAEs: the MAE between the prediction and first expert

EP

373

reading, the MAE between the prediction and the second expert reading, and

378

the average value of these two MAEs. In this paper, besides these three MAEs,

379

we also calculate the MAE between the estimated bone age and the average

380

value of two readings, for the reason that the average reading is the actual

381

ground truth used in network training. Therefore, four MAEs are employed in

382

our experiments. Let pi , s1,i and s2,i denote the prediction, first expert reading

383

and second expert reading of the i-th validation example, respectively. The

384

index i ∈ {1, 2, ..., K}, where K is the number of validation examples. The

AC C 377

18

ACCEPTED MANUSCRIPT

Percentage (%)

80

60

40

20

0 0.0

0.5

1.0

1.5

2.0

2.5

SC

Error (year)

RI PT

100

Figure 7: The error-inspired cumulative percentage curve for performance evaluation of bone

385

M AN U

age assessment.

above four MAEs are calculated as

386

K 1 X e1 = |pi − s1,i |, K i=1

(6)

K 1 X |pi − s2,i |, K i=1

(7)

e2 = 387

388

1 (e1 + e2 ), 2

TE D

e¯ =

K 1 X 1 e= pi − (s1,i + s2,i ) . K i=1 2

(8) (9)

In this paper, in addition to the above numerical metrics, we introduce a

390

simple yet effective curve-based evaluation approach for bone age assessment.

391

Specifically, we focus on the percentage of validation examples that have an

392

absolute error less than a certain threshold. By increasing the threshold from

AC C

EP

389

393

0 with a stride such as 0.1 years, we can obtain a series of percentage values

394

to generate a cumulative percentage curve. An example is shown in Figure 7.

395

The horizontal axis is the threshold of a certain error measure while the ver-

396

tical axis is the cumulative percentage of validation examples. In contrast to

397

the MAE metric which only gives the mean value over the whole validation

398

set, this cumulative percentage curve can provide much more insights into the

399

performance of a method via describing the error distribution. By exhibiting

400

the curves of different methods on one coordinate system, a more intuitive and

401

detailed comparison can be achieved. Furthermore, the area of the region under 19

ACCEPTED MANUSCRIPT

the curve (the filled region shown in Figure 6) can be derived as an extra nu-

403

merical metric for performance evaluation. A larger value of the area indicates

404

a better performance. The area can be approximately calculated as the sum of

405

a series of trapezoids: sc =

M X u i=1

2

(pi−1 + pi ),

RI PT

402

(10)

where u is the stride of error threshold, M is the length of percentage array pi ,

407

i ∈ {1, 2, ..., M } and the value of p0 is defined as 0.

SC

406

To the best of our knowledge, this curve-based evaluation has not appeared

409

in previous BAA-related publications. However, in existing BAA methods, the

410

output predictions detailed to each example are not available, making the gen-

411

eration of their curves not possible. In this paper, this evaluation approach is

412

mainly used in Section 4.4 for parameter study of the proposed method.

413

4.4. Study of Parameter Settings

M AN U

408

In this section, the impact of the number of NSCT decomposition levels

415

and the decision-level fusion rule applied in the classification model are quan-

416

titatively studied. For simplicity, the error between the prediction and average

417

value of two expert readings (the ground truth used in training) are mainly

418

employed for evaluation, while the other three types of errors show very similar

419

observations and share the same trends. Based on this error, the MAE defined

420

in Eq. (9) and the cumulative percentage curve based evaluation approach (the

421

curve along with the metric defined in Eq. (10)) are used. Moreover, for each

AC C

EP

TE D

414

422

network, we repeat the 5-fold cross validation 4 times for statistical analysis. In

423

this work, the Wilcoxon signed-rank test [46], which is a popular non-parametric

424

hypothesis test approach, is employed to determine whether there exists a sig-

425

nificant difference between the performance of two networks. As each network

426

is actually trained 20 (i.e., 5 × 4) times, the Wilcoxon signed-rank test here is

427

based on 20 pairs of MAEs obtained by the corresponding two networks. The

428

average prediction of each image (has 4 predictions due to the repetition) in

429

the X-ray dataset is used for the calculations of the MAEs and the cumulative

430

percentage curves reported in this section. 20

ACCEPTED MANUSCRIPT

RI PT

100 90 80

60

SC

Percentage (%)

70

50 40

spatial domain: 1.6822 1-level: 1.6118 2-level: 1.7121 3-level: 1.7650 4-level: 1.7866 5-level: 1.7841

20 10 0 0

0.5

M AN U

30

1

1.5

2

2.5

Error (year)

Figure 8: The error-inspired cumulative percentage curves of the proposed regression model

TE D

with different NSCT decomposition levels.

Table 1: The performance of the proposed regression model with different NSCT decomposition levels.

spatial domain

1-level

2-level

3-level

4-level

5-level

e

0.8512

0.9339

0.8107

0.7529

0.7136

0.7158

1.6118

1.7121

1.7650

1.7866

1.7841

sc

1.6822

4.4.1. Regression model

AC C

431

EP

metric

432

Table 1 lists the performance of the proposed regression model with differ-

433

ent NSCT decomposition levels. The best result for each metric is indicated in

434

bold. The corresponding error-inspired cumulative percentage curves are shown

435

in Figure 8. For each curve in Figure 8 (also in Figure 9 and Figure 10), the

436

corresponding area-based metric sc is provided in its legend for better compar-

437

ison. The approach that directly adopts the spatial domain image as network

438

input (denoted as “spatial domain”) is also included in Table 1 and Figure 8.

439

It can be seen that a higher decomposition level can generally obtain better

21

ACCEPTED MANUSCRIPT

results, as more useful information can be involved. The performance tends to

441

be stable after the 4-level decomposition. Furthermore, when compared with

442

the method based on spatial domain image, the proposed method with a 2-

443

level NSCT decomposition has been able to achieve better performance, which

444

indicates the significance of the NSCT-based feature pre-extraction approach

445

for this task.

SC

RI PT

440

We can observe from Table 1 (also from the subsequent Table 3 and Table 4)

447

that the metric sc can obtain very consistent results with the MAE-based metric,

448

which demonstrates the effectiveness of the curve-based evaluation approach

449

presented in this paper.

M AN U

446

Table 2 lists the p-values given by the Wilcoxon signed-rank test from the

451

comparison between the “4-level” approach and each of the other approaches.

452

The test is implemented using the stats.wilcoxon function developed in the SciPy

453

library based on Python, with default parameter settings. Clearly, the “4-level”

454

approach has a statistically significant advantage over the first four approaches

455

as the related p-values are much smaller than 0.05 (the p-value 8.9e-5 is obtained

456

when one method wins all the 20 comparisons). There is no distinct difference

457

between the performance of the “4-level” and “5-level” approaches. However,

458

due to the lower model complexity, the “4-level” approach is favored as the

459

specific choice of the proposed regression model.

EP

TE D

450

Table 2: The p-values given by the Wilcoxon signed-rank test from the comparison between

AC C

the “4-level” approach and each other approach for the regression model. spatial domain

1-level

2-level

3-level

4-level

5-level

8.9e-5

8.9e-5

8.9e-5

0.0032

n/a

0.28

Table 3: The performance of using the NSCT coefficient maps at each scale individually. metric

spatial domain

scale #1

scale #2

scale #3

scale #4

scale #5

e

0.8289

0.7632

0.7941

0.8168

1.0032

1.2924

sc

1.6908

1.7451

1.7170

1.7129

1.5569

1.3598

22

ACCEPTED MANUSCRIPT

4.4.2. Classification model

RI PT

460

Table 3 reports the performance of using the NSCT coefficient maps at each

462

scale individually (i.e., each branch in the classification model). The correspond-

463

ing curve-based performance is shown in the upper-left subfigure of Figure 9.

464

The approach that adopts spatial domain image as network input is also in-

465

cluded in Table 2 and Figure 9. It can be seen that a finer (smaller) scale tends

466

to perform better, which indicates that more distinctive information are con-

467

tained in fine scales. The difference among first three scales are relatively small,

468

showing clear advantages over the fourth and fifth scales, which is in accord

469

with the perceptual observations obtained from Figure 3. Furthermore, it is

470

remarkable that for the first three scales, the performance of using the NSCT

471

features at a single scale is even better than that of using the spatial domain

472

image. This observation further verifies the effectiveness of the NSCT-based

473

feature pre-extraction for this BAA problem.

M AN U

SC

461

TE D

Table 4: The performance of the proposed classification model with different NSCT decomposition levels and decision-level fusion rules. metric

e

fusion rule

1-level

2-level

3-level

4-level

5-level

maximum

0.8289

0.7632

0.7172

0.7032

0.7511

0.8249

mean

0.8289

0.7632

0.7140

0.6885

0.7163

0.7440

0.8289

0.7632

0.7238

0.7302

0.7659

0.7575

EP

multiplication

maximum

1.6908

1.7451

1.7873

1.8033

1.7587

1.6933

mean

1.6908

1.7451

1.7920

1.8183

1.7904

1.7683

1.6908

1.7451

1.7786

1.7799

1.7437

1.7495

AC C

sc

spatial domain

multiplication

474

Table 4 lists the performance of the proposed classification model with dif-

475

ferent NSCT decomposition levels and decision-level fusion rules. The approach

476

that adopts spatial domain image as network input is also included. For the

477

comparison of different decomposition levels, the best result in each row is in-

478

dicated in bold. For the comparison of different fusion rules, the best result

479

in each column is underlined. It is noted that both the “spatial domain” and

480

“1-level” approaches have only one branch in the classification model, so the 23

90

80

80

70

70

60 50 40

spatial domain: 1.6908 scale #1: 1.7451 scale #2: 1.7170 scale #3: 1.7129 scale #4: 1.5569 scale #5: 1.3598

30 20 10 0

60 50 40 30 20 10

spatial domain: 1.6908 1-level-maximum: 1.7451 2-level-maximum: 1.7873 3-level-maximum: 1.8033 4-level-maximum: 1.7587 5-level-maximum: 1.6933

0

0

0.5

1

1.5

Error (year)

100

2

2.5

0

0.5

1

1.5

2

2.5

Error (year)

100

90

80 70 60 50 40

80 70

Percentage (%)

TE D

90

Percentage (%)

RI PT

90

SC

100

Percentage (%)

100

M AN U

Percentage (%)

ACCEPTED MANUSCRIPT

spatial domain: 1.6908 1-level-mean: 1.7451 2-level-mean: 1.7920 3-level-mean: 1.8183 4-level-mean: 1.7904 5-level-mean: 1.7683

20 10 0

0.5

AC C

0

1

1.5

50 40 spatial domain: 1.6908 1-level-multiplication: 1.7451 2-level-multiplication: 1.7786 3-level-multiplication: 1.7799 4-level-multiplication: 1.7437 5-level-multiplication: 1.7495

30

EP

30

60

20 10 0

2

2.5

0

Error (year)

0.5

1

1.5

2

2.5

Error (year)

Figure 9: The error-inspired cumulative percentage curves related to the proposed classification model. Upper-left: the performance of using the NSCT coefficient maps at each scale individually. Other three: the impact of the number of NSCT decomposition levels on the model performance with different fusion rules.

24

80

80

70

70

60 50 40 30

60 50 40 30

20

20

2-level-maximum: 1.7873 2-level-mean: 1.7920 2-level-multiplication: 1.7786

10 0

3-level-maximum: 1.8033 3-level-mean: 1.8183 3-level-multiplication: 1.7799

10 0

0

0.5

1

1.5

Error (year)

2

2.5

0

0.5

1

1.5

2

2.5

Error (year)

100

90 80 70 60 50

EP

40 30 20

AC C

10

90 80 70

Percentage (%)

TE D

100

Percentage (%)

RI PT

90

SC

100

90

Percentage (%)

100

M AN U

Percentage (%)

ACCEPTED MANUSCRIPT

0.5

50 40 30 20

4-level-maximum: 1.7587 4-level-mean: 1.7904 4-level-multiplication: 1.7437

5-level-maximum: 1.6933 5-level-mean: 1.7683 5-level-multiplication: 1.7495

10

0

0

60

0 1

1.5

2

2.5

0

0.5

Error (year)

1

1.5

2

2.5

Error (year)

Figure 10: The error-inspired cumulative percentage curves to study the impact of applied decision-level fusion rule on the proposed classification model.

25

ACCEPTED MANUSCRIPT

fusion rule makes no difference to the results.

RI PT

481

The impact of the number of NSCT decomposition levels on the model per-

483

formance with different fusion rules is shown in Figure 9 (the upper-right, lower-

484

left and low-right subfigures). It can be observed from Table 4 and Figure 9 that

485

using 2-level and 3-level decompositions can generate better performance than

486

others, no matter which fusion rule is applied. In particular, the result obtained

487

by a 3-level decomposition is more competitive. This is not difficult to interpret.

488

On one hand, when the number of decomposition levels is too small (i.e., 1),

489

many useful distinctive details are ignored. On the other hand, when the num-

490

ber is too large (i.e., 4 or 5), the information provided by larger scales become

491

unreliable (see Table 3), delivering an opposite effect on the fusion result.

M AN U

SC

482

The impact of the applied decision-level fusion rule on the model performance

493

is shown in Figure 10. Since the “1-level” approach actually has no fusion

494

process, there are four sets of comparisons by fixing the number of decomposition

495

levels. It can be seen that the performance obtained by the mean rule (indicated

496

by the green curves) is the most competitive. From Table 4, similar observations

497

exist as the mean rule wins the first place in all the cases.

TE D

492

Table 5: The p-values given by the Wilcoxon signed-rank test from the comparison between

EP

the “3-level-mean” approach and each other approach for the classification model. spatial domain

1-level

2-level

3-level

4-level

5-level

8.9e-5

8.9e-5

0.0032

0.048

8.9e-5

8.9e-5

8.9e-5

8.9e-5

0.0032

n/a

3.9e-4

8.9e-5

8.9e-5

8.9e-5

8.9e-5

8.9e-5

8.9e-5

8.9e-5

maximum

AC C

mean

multiplication

498

Table 5 lists the p-values given by the Wilcoxon signed-rank test from the

499

comparison between the “3-level-mean” approach and each of the other ap-

500

proaches. It can be seen that all the p-values are smaller than 0.05, indicat-

501

ing that the “3-level-mean” approach significantly outperforms others, so it is

502

selected as the default method of our classification model in the subsequent

503

experiments.

26

ACCEPTED MANUSCRIPT

4.5. Impact of the Gender: Mixing versus Separation

RI PT

504

In the above experiments, we mix the data of two genders (female and male)

506

for network training. However, it is well established that the biological devel-

507

opment is different for females and males. In this section, we study the impact

508

of the gender by comparing two different training manners: mixing and sepa-

509

ration. In the mixing-based manner, the networks are trained using the mixed

510

data (just as the above experiments), but the evaluation results on each gender

511

are separately reported here. In the separation-based manner, both training

512

and testing are separately conducted on each gender. According to the anal-

513

ysis in Section 4.4, the regression model “4-level” and the classification model

514

“3-level-mean” (denoted as “proposed”) are employed for this study. The re-

515

lated spatial domain based approaches are also included for comparison. For

516

the separation-based manner, the settings of network training are exactly the

517

same as the mixing-based manner.

M AN U

SC

505

TE D

Table 6: The evaluation results (on the metric e) of different approaches using two different training manners (mixing and separation) for two genders. gender

classification model

proposed

spatial domain

proposed

mixing

0.8287

0.7033

0.8027

0.6925

separation

0.6208

0.5316

0.5992

0.5189

mixing

0.8738

0.7239

0.8550

0.6844

separation

0.6875

0.5667

0.6695

0.5518

AC C

male

regression model

spatial domain

EP

female

manner

518

The related evaluation results on the metric e are reported in Table 6. It

519

can be seen that for either the mixing-based or the separation-based manner,

520

the proposed methods have clear advantages over the corresponding spatial

521

domain based methods on both genders, which further verifies the effectiveness

522

of our NSCT-based feature pre-extraction strategy. In addition, we can see

523

that the separation-based manner can obtain better results in comparison to

524

the mix-based manner. This is straightforward as the separation of two genders

27

ACCEPTED MANUSCRIPT

will decrease the complexity of the age assessment problem, leading to higher

526

prediction accuracy.

RI PT

525

Nevertheless, it is worth noting that the mixed training manner still makes

528

sense from the viewpoint of deep learning. Using the mixing-based manner, the

529

mapping between the input and output tends to be more complicated because

530

it should fit more data, but deep learning models like CNNs have powerful ca-

531

pacity to tackle this problem. The meaningful results obtained by this manner

532

(although they are relatively inferior to the separation-based manner) verify

533

this point. In fact, the recent BAA benchmark proposed by Spampinato et

534

al. [34] also adopts the mixed training manner on this X-ray dataset. But

535

of course, from the viewpoint of biology, the separation manner offers a more

536

“natural” solution to this problem and produces better results, which provides

537

some valuable clues for the future study in this field. Based on the above con-

538

siderations, both of these two manners are attempted in this work (in Section

539

4.4, we use the mixing-based manner for the study of parameter settings be-

540

cause it can provide a more compact description than separating two genders.

541

Actually, we experimentally find that the characteristics of the results obtained

542

by the separation-based manner are very similar), and the obtained results are

543

employed for the comparison with the state of the art in Section 4.6.

544

4.6. Comparison with the State of the Art

545

4.6.1. Comparison on Spampinato et al.’s BAA benchmark

EP

TE D

M AN U

SC

527

Based on the experimental results exhibited in Section 4.4, we select the re-

547

gression model “4-level” and the classification model “3-level-mean” for compar-

548

ison with eight existing BAA methods that have been tested on the benchmark

549

[34] (more details have been provided in Section 4.1). Table 7 reports the perfor-

550

mance of different methods on the four mean squared errors (MAEs) presented

551

in Section 4.3. Some values in Table IV are not given because they were not

552

reported in related publications [34, 35]. The best results are indicated in bold.

553

It is noted that for a certain method the value of e can be smaller than both

554

e1 and e2 . Actually, this is always the case in practice because it is the average

AC C 546

28

ACCEPTED MANUSCRIPT

of two readings that acts as the ground truth for training. It can be seen from

556

Table 7 that the CNN-based methods (the last six) significantly outperform the

557

conventional hand-crafted feature based methods (the first four). The proposed

558

classification model obtains the best performance among all the methods. In

559

comparison to the baseline of CNN-based methods (VGGNet, GoogLeNet and

560

BoNet) presented in [34], the proposed method can achieve an improvement of

561

about 0.1 years. When compared with the improved TW-based method [35],

562

the proposed method can still obtain more competitive results.

M AN U

SC

RI PT

555

Table 7: The performance of different BAA methods on four mean squared errors (MAEs). Method Hsieh et al. [14] Gertych et al. [5] Giordano et al. [19] Giordano et al. [23]

e2



e

2.78

2.37

2.57

n/a

2.60

1.70

2.15

n/a

2.46

2.38

2.42

n/a

1.92

1.73

1.82

n/a

0.88

0.79

0.83

n/a

GoogLeNet [34]

0.86

0.79

0.82

n/a

BoNet [34]

0.80

0.79

0.79

n/a

TW-based VGGNet [35]

n/a

n/a

n/a

0.72

TE D

VGGNet [34]

e1

0.73

0.74

0.71

0.70

0.71

0.69

4.6.2. Comparison with Thodberg et al.’s BoneXpert To date, BoneXpert is the most influential commercial product in the field of

AC C

564

0.75

0.73

EP

563

Our regression model

Our classification model

565

bone age assessment and has been licensed to more than 130 hospitals around

566

the world. The BoneXpert method [18] applies the active appearance model

567

(AAM) [47] to automatically locate 15 bones in the hand and wrist for the de-

568

termination of bone age based on shape, intensity and texture features. In [44],

569

the BoneXpert method is tested on Digital Hand Atlas dataset and obtains very

570

impressive results. The bone age ranges tested in [44] are 2-15 years for girls and

571

2.5-17 years for boys. In addition, the root mean square error (RMSE) between

572

the automated prediction and the average of the two expert readings is mainly

29

ACCEPTED MANUSCRIPT

RI PT

Table 8: The performance of the BoneXpert method and the proposed method on RMSE for the samples with BA range: 2-15 years for females and 2.5-17 years for males. Method

female

male

BoneXpert

0.57

0.64

Proposed

0.66

0.71

whole 0.61

SC

0.69

employed for performance evaluation. Based on these settings, we compare the

574

proposed method with the BoneXpert method in this subsection. Specifically,

575

according to the results given in Table 6, we select the best performing approach,

576

the “3-level-mean” classification model using the separation-based training man-

577

ner, as the proposed method for comparison. The RMSEs are calculated based

578

on the samples within the above BA ranges (2-15 years for females and 2.5-17

579

years for males) for individual female samples, individual male samples and the

580

whole set. Table 8 lists evaluation results of the BoneXpert method (the results

581

are reported in [44]) and the proposed method. It can be seen that the BoneX-

582

pert method obtains smaller RMSEs for about 0.08 years. This is not surprising

583

because the BoneXpert method is elaborately designed in every part and has

584

been commercially sold for clinical routine use, while the proposed method still

585

has large potential for further improvement (more discussions on this issue are

586

provided in Section 5).

587

5. Discussion

AC C

EP

TE D

M AN U

573

588

This work mainly presents an NSCT-based feature pre-extraction strategy

589

for the CNN-based BAA methods, so the most fundamental comparison is with

590

the methods that directly adopt spatial domain images as network input. The

591

related results given in Table 1 and Table 4 show that the proposed methods

592

outperforms the corresponding spatial domain CNN-based methods with an

593

improvement of about 0.14 years on the MAE, for both the regression models

594

(0.8512 − 0.7136 = 0.1376) and the classification models (0.8289 − 0.6885 =

595

0.1404). We calculate the MAE between the two expert readings as a reference

30

ACCEPTED MANUSCRIPT

and the value is about 0.36 years. Considering the fact that the MAE for each

597

expert reading (e1 and e2 ) is very close to that for the average reading (e) (see

598

Table 7), this value can be approximately viewed as an “ideal” target of an

599

automated BAA method since it has reached the professional standard of an

600

expert when obtaining such an error. Therefore, when comparing the proposed

601

methods with the spatial domain methods (for simplicity, we just adopt the

602

regression model as the example while the situation of the classification model

603

is very similar), the gap between the actual performance and the “ideal” target

604

decreases from about 0.49 (i.e., 0.85 − 0.36) years to about 0.35 (i.e., 0.71 − 0.36)

605

years (relatively about 29%), which is believed to be an obvious improvement5 .

606

Similar observations can be obtained from Table 6 using the above evaluation

607

process.

M AN U

SC

RI PT

596

Table 9: The average running time and standard deviations of the proposed method and the spatial domain method on training CNN models in Section 4.4 (Unit: hour).

TE D

regression model

classification model

spatial domain

proposed

spatial domain

proposed

average running time

3.96

14.61

1.35

4.14

standard deviation

0.04

0.06

0.02

0.03

However, it is easy to find that the proposed BAA method will be more

609

memory and time consuming than the spatial domain methods. Fortunately,

610

owing to the popularization of high-performance GPUs, this is not a crucial

EP

608

issue. In our experiments, the employed GPU is an NVIDIA GTX 1080 Ti

612

(11 GB), and we verify that all the networks can also be trained on a GTX

613

1080 GPU (8 GB). Table 9 reports the average running time and standard

614

deviations of the proposed method and the spatial domain method on training

615

CNN models in Section 4.4. As mentioned before, the regression model “4-

616

level” and the classification model “3-level-mean” are employed in the proposed

AC C 611

5 This

evaluation process may be not strictly precise from the viewpoint of statistics, but

we believe it can approximately reflect the extent of the improvement.

31

ACCEPTED MANUSCRIPT

method. Although the time cost of the proposed method is larger than the

618

spatial domain method due to the higher model complexity, it takes only about

619

14.61 hours and 4.14 hours for the proposed method to train the regression

620

model and the classification model, respectively. Furthermore, considering that

621

the training procedure is performed offline, the above difference on the training

622

cost is actually not so crucial and the cost is generally acceptable in practice.

623

Table 10 lists the average running time and standard deviations of the proposed

624

method and the spatial domain method on testing an image (the time unit is

625

millisecond). The time cost of the image pre-processing part (including image

626

resize for all the methods and NSCT decomposition for the proposed methods)

627

is not involved in the results of Table 10. The time cost of image resize is

628

negligible. The average running time of a 4-level NSCT decomposition for an

629

256 × 256 image using Matlab on our platform (Intel Core i7-6800K CPU) is

630

230 milliseconds. Clearly, this process can be further accelerated by applying

631

more efficient implementation techniques. But even so, the total running time

632

of the proposed method is lower than 0.3 seconds, which generally meets the

633

requirement of practical usage.

TE D

M AN U

SC

RI PT

617

Table 10: The average running time and standard deviations of the proposed methods and

EP

the spatial domain methods on testing an image (Unit: millisecond).

AC C

average running time standard deviation

regression model

classification model

spatial domain

proposed

spatial domain

proposed

5.58

22.69

8.92

35.36

0.03

0.08

0.04

0.12

634

The comparison on Spampinato et al.’s BAA benchmark [34] shows that the

635

proposed method obtains a clear advantage over the CNN-based methods (VG-

636

GNet, GoogLeNet and BoNet) under the GP framework, with an improvement

637

of 0.08-0.12 years on MAE. There is no significant improvement when compared

638

with the CNN-based method using the TW framework [35]. With more detailed

639

modeling strategy, the TW-based approach has emerged as a popular way to

640

improve the performance of GP-based methods, but the problems like appro32

ACCEPTED MANUSCRIPT

priately defining and accurately detecting local ROIs remain challenging. The

642

proposed approach offers another way to improve traditional GP-based meth-

643

ods. Moreover, it is easy to notice that the proposed feature pre-extraction

644

strategy can also be adopted under the TW framework.

RI PT

641

There exists a disparity between the performance of the proposed method

646

and the BoneXpert method [18, 44] (about 0.08 years on RMSE), but we be-

647

lieve this study still has its significance. On one hand, although the BoneXpert

648

method has achieved great success in clinical practice, it still has a few limita-

649

tions such as rejecting a small percentage of radiographs with excessive noise

650

and relying on a relationship between chronological age and bone age [10, 33].

651

On the other hand, the proposed method still has large space for future im-

652

provement. In this work, for the sake of fair comparison using Spampinato et

653

al.’s BAA benchmark [34], the original X-ray images are just simply resampled

654

as network input. By applying some image pre-processing procedures such as

655

hand region detection to eliminate the impact of background regions [10], the

656

performance of the BAA method may be further improved. In addition, as

657

mentioned above, the proposed approach can be viewed as a generalized im-

658

proved strategy for the spatial domain methods, so it can also be utilized in the

659

TW-based methods to pursue better performance.

TE D

M AN U

SC

645

Finally, we discuss a potential benefit of the feature pre-extraction strategy

661

presented in this work. In current CNN-based BAA study (also for many other

662

medical image related applications), in order to adopt the popular convolu-

AC C

EP

660

663

tional networks that are pre-trained on general imagery, almost all the existing

664

methods firstly resample the input images to a relatively lower resolution like

665

256×256 to fit the employed network architecture. Obviously, this operation will

666

cause the loss of image details which may be very important for the assessment

667

task, leading to a considerable limitation of these methods. Nevertheless, the

668

proposed feature pre-extraction strategy may provide a potential way to over-

669

come this limitation. Specifically, by employing some feature pre-extraction

670

approaches to help the training process, it may be not necessary to adopt the

671

pre-trained models. As a result, the restriction on image resolution can be bro33

ACCEPTED MANUSCRIPT

ken and the flexibility of architecture design can be largely improved. However,

673

the effectiveness of this idea needs a systematic verification that involves a se-

674

ries of factors, such as the size of input images, the specific CNN architecture

675

and the feature pre-extraction technique, which has exceeded the scope of this

676

paper.

677

6. Conclusion

SC

RI PT

672

In this study, a multi-scale data fusion framework for automated bone age

679

assessment (BAA) based on convolutional neural networks (CNNs) is proposed.

680

Instead of directly employing the spatial domain image as network input in

681

existing CNN-based BAA methods, the non-subsampled contourlet transform

682

(NSCT) is firstly performed for feature pre-extraction, aiming to achieve bet-

683

ter training effects with limited annotated examples. Furthermore, since the

684

information contained in original spatial domain image is well separated by

685

NSCT, this approach improves the flexibility in designing network architectures

686

via different data fusion strategies. Based on this idea, we present a regression

687

model using feature-level fusion and a classification model using decision-level

688

fusion for BAA. Extensive experiments are conducted on the Digital Hand Atlas

689

dataset. The results demonstrate that the proposed method can achieve an ob-

690

vious improvement over the corresponding spatial domain CNN-based methods.

691

Moreover, the proposed method can obtain very competitive performance when

692

comparing with some state-of-the-art BAA methods. In the future, we will ex-

AC C

EP

TE D

M AN U

678

693

plore the potential of the proposed feature-extraction strategy on overcoming

694

the limitation of input image resolution in CNN-based bone age assessment,

695

as discussed in the last paragraph in Section 5. The impact of different set-

696

tings including the size of input images, the CNN architecture and the feature

697

pre-extraction approaches will be systematically studied.

698

Conflict of Interest

699

No potential conflict of interest. 34

ACCEPTED MANUSCRIPT

Acknowledgements

RI PT

700

The authors would like to thank the editors and anonymous reviewers for

702

their insightful comments and constructive suggestions. This work was sup-

703

ported by the National Natural Science Foundation of China (Grants 61701160

704

and 81571760), the Provincial Natural Science Foundation of Anhui (Grant

705

1808085QF186), the Fundamental Research Funds for the Central Universities

706

(Grant JZ2018HGTB0228) and the SenseTime Research Fund.

SC

701

[1] D. D. Martin, J. M. Wit, Z. Hochberg, L. Savendahl, R. R. van Rijn,

708

O. Fricke, N. Cameron, J. Caliebe, T. Hertel, D. Kiepe, K. A. Wikland,

709

H. H. Thodberg, G. Binder, M. B. Ranke, The use of bone age in clinical

710

practice - part 1, Hormone Research in Paediatrics 76 (1) (2011) 1–9.

M AN U

707

[2] J. M. Tanner, M. J. R. Healy, H. Goldstein, N. Cameron, Assessment of

712

Skeletal Maturity and Prediction of Adult Height (TW3 method), W.B.

713

Saunders, London, 2001.

TE D

711

[3] H. H. Thodberg, O. G. Jenni, J. Caflisch, M. B. Ranke, D. D. Martin,

715

Prediction of adult height based on automated determination of bone age,

716

The Journal of Clinical Endocrinology & Metabolism 94 (12) (2009) 4868–

717

4874.

EP

714

[4] A. M. Vukicevic, G. R. Jovicic, M. N. Jovicic, V. L. Milicevic, N. D. Fil-

719

ipovic, Assessment of cortical bone fracture resistance curves by fusing ar-

AC C

718

720

tificial neural networks and linear regression, Computer Methods in Biome-

721

chanics and Biomedical Engineering 21 (2) (2018) 169–176.

722

[5] A. Gertych, A. Zhang, J. Sayre, S. Kurkowska, H. Huang, Bone age assess-

723

ment of children using a digital hand atlas, Computerized Medical Imaging

724

and Graphics 31 (2007) 322–331.

725

[6] W. Greulich, S. Pyle, Radiographic atlas of skeletal development of the

726

hand and wrist, The American Journal of the Medical Sciences 238 (3)

727

(1959) 393. 35

ACCEPTED MANUSCRIPT

[7] J. M. Tanner, R. H. Whitehouse, N. Cameron, W. A. Marshall, M. J. R.

729

Healy, H. Goldstein, Assessment of skeletal maturity and prediction of adult

730

height (TW2 method), Academic Press, London, 1975.

RI PT

728

[8] P. Kaplowitz, S. Srinivasan, J. He, R. MaCarter, M. R. Hayeri, R. Sze,

732

Comparison of bone age readings by pediatric endocrinologists and pedi-

733

atric radiologists using two bone age atlases, Pediaric Radiology 41 (6)

734

(2011) 690–693.

SC

731

[9] V. Iglovikov, A. Rakhlin, A. Kalinin, A. Shvets, Pediatric bone age assess-

736

ment using deep convolutional neural networks, arXiv 1712.05053 (2017)

737

1–14.

M AN U

735

738

[10] H. Lee, S. Tajmir, J. Lee, M. Zissen, B. Yeshiwas, T. Alkasab, G. Choy,

739

S. Do, Fully automated deep learning system for bone age assessment,

740

Journal of Digital Imaging 30 (2017) 427–441.

[11] S. Wang, Y. Shen, C. Shi, P. Yin, Z. Wang, P. W. Cheung, J. P. Cheung,

742

K. D. LUK, Y. Hu, Skeletal maturity recognition using a fully automated

743

system with convolutional neural networks, IEEE Access 6 (2018) 29979–

744

29993.

TE D

741

[12] E. Pietka, A. Gertych, S. Pospiech, F. Cao, H. Huang, V. Gilsanz,

746

Computer-assisted bone age assessment: image preprocessing and epiphy-

747

seal/metaphyseal roi extraction, IEEE Transactions on Medical Imaging

AC C

EP

745

748

749

750

20 (8) (2001) 715–729.

[13] E. Pietka, Computer-assisted bone age assessment-database adjustment, International Congress Series 1256 (2003) 87–92.

751

[14] C. Hsieh, T. Jong, C. Tiu, Bone age estimation based on phalanx informa-

752

tion with fuzzy constrain of carpals, Medical & Biological Engineering &

753

Computing 45 (3) (2007) 283–295.

36

ACCEPTED MANUSCRIPT

[15] A. Zhang, A. Gertych, B. J. Liu, Automatic bone age assessment for young

755

children from newborn to 7-year-old using carpal bones, Computerized

756

Medical Imaging and Graphics 31 (2007) 299–310.

758

[16] A. Vega, J. Arribas, A radius and ulna tw3 bone age assessment system, IEEE Transactions on Biomedical Engineering 55 (5) (2008) 1463–1476.

SC

757

RI PT

754

[17] J. Liu, J. Qi, Z. Liu, Q. Ning, X. Luo, Automatic bone age assessment based

760

on intelligent algorithms and comparison with tw3 method, Computerized

761

Medical Imaging and Graphics 32 (8) (2008) 678–684.

M AN U

759

762

[18] H. H. Thodberg, S. Kreiborg, A. Juul, K. D. Pedersen, The bonexpert

763

method for automated determination of skeletal maturity, IEEE Transac-

764

tions on Medical Imaging 28 (1) (2009) 52–66.

[19] D. Giordano, C. Spampinato, G. Scarciofalo, R. Leonardi, An automatic

766

system for skeletal bone age measurement by robust processing of carpal

767

and epiphysial/metaphysial bones, IEEE Transactions on Instrumentation

768

and Measurement 59 (10) (2010) 2539–2553.

TE D

765

[20] H. Lin, S. Shu, Y. Lin, S. Yu, Bone age cluster assessment and feature

770

clustering analysis based on phalangeal image rough segmentation, Pattern

771

Recognition 45 (2012) 322–332.

EP

769

[21] M. Harmsen, B. Fischer, H. Schramm, T. Seidl, T. Deserno, Support vector

773

machine classification based on correlation prototypes applied to bone age

AC C

772

774

assessment, IEEE Journal of Biomedical and Health Informatics 17 (1)

775

(2013) 190–197.

776

[22] M. Kashif, T. M. Deserno, D. Haak, S. Jonas, Feature description with

777

sift, surf, brief, brisk, or freak? a general question answered for bone age

778

assessment, Computers in Biology and Medicine 68 (2016) 67–75.

779

[23] D. Giordano, I. Kavasidis, C. Spampinato, Modeling skeletal bone devel-

780

opment with hidden markov models, Computer Methods and Programs in

781

Biomedicine 124 (2016) 138–147. 37

ACCEPTED MANUSCRIPT

[24] S. Simu, S. Lal, A study about evolutionary and non-evolutionary segmen-

783

tation techniques on hand radiographs for bone age assessment, Biomedical

784

Signal Processing and Control 33 (2017) 220–235.

786

787

788

[25] M. Sabeti, R. Boostani, B. Davoodi, Improved particle swarm optimisation to estimate bone age, IET Image Processing 12 (2) (2018) 179–187.

SC

785

RI PT

782

[26] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.

[27] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based leaning applied

790

to document recognition, Proceedings of The IEEE 86 (11) (1998) 2278–

791

2324.

M AN U

789

[28] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep

793

convolutional neural networks, in: Proceedings of Advances in Neural In-

794

formation Processing Systems (NIPS), 2012, pp. 1097–1105.

795

796

TE D

792

[29] P. Meyer, V. Noblet, C. Mazzara, A. Lallement, Survey on deep learning for radiotherapy, Computers in Biology and Medicine 98 (2018) 126–146. [30] G. Litjens, T. Kooi, B. Bejnordi, A. Setio, F. Ciompi, M. Ghafoorian,

798

J. vander Laak, B. van Ginneken, C. Sanchez, A survey on deep learning

799

in medical image analysis, Medical Image Analysis 42 (2017) 60–88. [31] D. Shen, G. Wu, H. Suk, Deep learning in medical image analysis, Annual

AC C

800

EP

797

801

Review of Biomedical Engineering 19 (2017) 221–248.

802

[32] D. Stern, C. Payer, V. Lepetit, M. Urschler, Automated age estimation

803

from hand mri volumes using deep learning, in: Proceedings of Interna-

804

tional Conference on Medical Image Computing and Computer-Assisted

805

Intervention (MICCAI), 2016, pp. 194–202.

806

807

[33] M. Chen, Automated bone age classification with deep neural networks, Tech. rep., Stanford University (2016).

38

ACCEPTED MANUSCRIPT

[34] C. Spampinato, S. Palazzo, D. Giordano, M. Aldinucci, R. Leonardi, Deep

809

learning for automated skeletal bone age assessment in x-ray images, Med-

810

ical Image Analysis 36 (2017) 41–51.

RI PT

808

[35] J. Zhou, Z. Li, W. Zhi, B. Liang, D. Moses, L. Dawes, Using convolu-

812

tional neural networks and transfer learning for bone age classification,

813

in: Proceedings of International Conference on Digital Image Computing:

814

Techniques and Applications (DICTA), 2017, pp. 1–6.

SC

811

[36] X. Chen, C. Zhang, Y. Liu, Bone age assessment with x-ray images based

816

on contourlet motivated deep convolutional networks, in: IEEE 20th Inter-

817

national Workshop on Multimedia Signal Processing (MMSP), 2018, pp.

818

1–6.

819

820

M AN U

815

[37] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv 1409.1556v2 (2014) 1–10. [38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-

822

han, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, arXiv

823

1409.4842 (2014) 1–12.

TE D

821

[39] A. L. da Cunha, J. Zhou, M. N. Do, The nonsubsampled contourlet trans-

825

form: theory, design, and applications, IEEE Transactions on Image Pro-

826

cessing 15 (10) (2006) 3089–3101.

EP

824

[40] Y. Zhou, J. Wang, Image denoising based on the symmetric normal inverse

828

gaussian model and non-subsampled contourlet transform, IET Image Pro-

829

cessing 6 (8) (2012) 1136–1147.

AC C

827

830

831

832

833

[41] Q. Zhang, B. Guo, Multifocus image fusion using the nonsubsampled contourlet transform, Signal Processing 89 (7) (2009) 1334–1346.

[42] S. Li, B. Yang, J. Hu, Performance comparison of different multi-resolution transforms for image fusion, Information Fusion 12 (2011) 74–84.

39

ACCEPTED MANUSCRIPT

[43] Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based

835

on multi-scale transform and sparse representation, Information Fusion 24

836

(2015) 147–164.

RI PT

834

[44] H. H. Thodberg, L. Savendahl, Validation and reference values of auto-

838

mated bone age determination for four ethnicities, Academic Radiology

839

17 (11) (2010) 1425–1432.

SC

837

[45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

841

S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast fea-

842

ture embedding, in: Proceedings of the ACM International Conference on

843

Multimedia, 2014, pp. 675–678.

844

845

M AN U

840

[46] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bulletin 1 (6) (1945) 80–83.

[47] T. F. Cootes, G. J. Edwards, C. J. Taylor, Active appearance models, IEEE

847

Transactions on Pattern Analysis and Machine Intelligence 23 (6) (2001)

848

681–685.

AC C

EP

TE D

846

40

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40 30 20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

2-level-maximum: 1.7873 2-level-mean: 1.7920 2-level-multiplication: 1.7786

AC C

Percentage (%)

70

2.5

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40 30 20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

3-level-maximum: 1.8033 3-level-mean: 1.8183 3-level-multiplication: 1.7799

AC C

Percentage (%)

70

2.5

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40 30 20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

4-level-maximum: 1.7587 4-level-mean: 1.7904 4-level-multiplication: 1.7437

AC C

Percentage (%)

70

2.5

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40 30 20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

5-level-maximum: 1.6933 5-level-mean: 1.7683 5-level-multiplication: 1.7495

AC C

Percentage (%)

70

2.5

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

120

RI PT

80

SC

60

20

0 0

2

4

6

8

M AN U

40

10

EP

TE D

Bone Age

AC C

Number of Images

100

12

14

16

18

20

ACCEPTED MANUSCRIPT

NSP

NSDFB Fine

NSDFB

NSP

Scale

RI PT

High frequency

Input image

SC

High frequency

AC C

EP

TE D

M AN U

Low frequency

Coarse

ACCEPTED MANUSCRIPT

……

SC

……

RI PT

Scale #1 (16 bands)

M AN U

Scale #2 (8 bands)

……

TE D

Scale #3 (8 bands)

……

AC C

EP

Scale #4 (4 bands)

…… Scale #5 (4 bands)

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

Convolution

Scale #2

VGGNet-16 convolutional architecture Concatenation

Scale #1

VGGNet-16 convolutional architecture

Regression

Bone age

1024

Scale #L

M AN U

SC

Input image

RI PT

NSCT decomposition

AC C

EP

TE D

VGGNet-16 convolutional architecture

128

1

Fully-connected

Fully-connected

Fully-connected

EP

TE D

M AN U

SC

RI PT

2×2 Max-pool 3×3 Conv, 512 3×3 Conv, 512 3×3 Conv, 512 2×2 Max-pool 3×3 Conv, 256 3×3 Conv, 256 3×3 Conv, 256 2×2 Max-pool 3×3 Conv, 128 3×3 Conv, 128 3×3 Conv, 128 2×2 Max-pool 3×3 Conv, 128 3×3 Conv, 128 2×2 Max-pool 3×3 Conv, 64 3×3 Conv, 64

AC C

ACCEPTED MANUSCRIPT

VGGNet-16 convolutional architecture

ACCEPTED MANUSCRIPT

NSCT decomposition

Classification

Estimation

FC, #class

FC, 4096

FC, 4096

Scale #1

VGGNet-16 convolutional architecture

Fusion

FC, #class

FC, 4096

FC, 4096

Scale #2

VGGNet-16 convolutional architecture

SC M AN U FC, #class

FC, 4096

FC, 4096

Scale #L

VGGNet-16 convolutional architecture

AC C

EP

TE D

#class

Bone age

Fusion

#class

Input image

RI PT

#class

#class

1

ACCEPTED MANUSCRIPT

100

40

20

0.5

1.0

1.5

2.0

2.5

SC

0 0.0

RI PT

60

EP

TE D

M AN U

Error (year)

AC C

Percentage (%)

80

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40

20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

spatial domain: 1.6822 1-level: 1.6118 2-level: 1.7121 3-level: 1.7650 4-level: 1.7866 5-level: 1.7841

30

AC C

Percentage (%)

70

2.5

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40

20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

spatial domain: 1.6908 scale #1: 1.7451 scale #2: 1.7170 scale #3: 1.7129 scale #4: 1.5569 scale #5: 1.3598

30

AC C

Percentage (%)

70

2.5

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40

20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

spatial domain: 1.6908 1-level-maximum: 1.7451 2-level-maximum: 1.7873 3-level-maximum: 1.8033 4-level-maximum: 1.7587 5-level-maximum: 1.6933

30

AC C

Percentage (%)

70

2.5

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40

20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

spatial domain: 1.6908 1-level-mean: 1.7451 2-level-mean: 1.7920 3-level-mean: 1.8183 4-level-mean: 1.7904 5-level-mean: 1.7683

30

AC C

Percentage (%)

70

2.5

ACCEPTED MANUSCRIPT

100 90 80

RI PT

60 50

SC

40

20 10 0 0

0.5

1

1.5

2

EP

TE D

Error (year)

M AN U

spatial domain: 1.6908 1-level-multiplication: 1.7451 2-level-multiplication: 1.7786 3-level-multiplication: 1.7799 4-level-multiplication: 1.7437 5-level-multiplication: 1.7495

30

AC C

Percentage (%)

70

2.5

ACCEPTED MANUSCRIPT

Highlights 1. Proposes a multi-scale data fusion framework for BAA based on NSCT and CNNs.

RI PT

2. Presents a regression CNN model for BAA using a feature-level fusion strategy.

3. Presents a classification CNN model for BAA using a decision-level fusion strategy.

SC

4. Introduces a novel curve-based performance evaluation approach for BAA.

AC C

EP

TE D

M AN U

5. Achieves state-of-the-art BAA results on the Digital Hand Atlas dataset.