Accepted Manuscript A multi-scale data fusion framework for bone age assessment with convolutional neural networks Yu Liu, Chao Zhang, Juan Cheng, Xun Chen, Z. Jane Wang PII:
S0010-4825(19)30088-5
DOI:
https://doi.org/10.1016/j.compbiomed.2019.03.015
Reference:
CBM 3243
To appear in:
Computers in Biology and Medicine
Received Date: 28 November 2018 Revised Date:
14 March 2019
Accepted Date: 14 March 2019
Please cite this article as: Y. Liu, C. Zhang, J. Cheng, X. Chen, Z.J. Wang, A multi-scale data fusion framework for bone age assessment with convolutional neural networks, Computers in Biology and Medicine (2019), doi: https://doi.org/10.1016/j.compbiomed.2019.03.015. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
RI PT
A Multi-Scale Data Fusion Framework for Bone Age Assessment with Convolutional Neural Networks Yu Liua , Chao Zhanga , Juan Chenga , Xun Chenb,∗, Z. Jane Wangc a Department
M AN U
SC
of Biomedical Engineering, Hefei University of Technology, Hefei 230009, China. b Department of Electronic Science and Technology, University of Science and Technology of China, Hefei 230026, China. c Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada.
Abstract
Bone age assessment (BAA) has various clinical applications such as diagnosis of endocrine disorders and prediction of final adult height for adolescents. Recent studies indicate that deep learning techniques have great potential in developing automated BAA methods with significant advantages over the con-
TE D
ventional methods based on hand-crafted features. In this paper, we propose a multi-scale data fusion framework for bone age assessment with X-ray images based on non-subsampled contourlet transform (NSCT) and convolutional neural networks (CNNs). Unlike the existing CNN-based BAA methods that adopt
EP
the original spatial domain image as network input directly, we pre-extract a rich set of features for the input image by performing NSCT to obtain its multiscale and multi-direction representations. This feature pre-extraction strategy
AC C
could be beneficial to network training as the number of annotated examples in the problem of BAA is typically quite limited. The obtained NSCT coefficient maps at each scale are fed into a convolutional network individually and the information from different scales are then merged to achieve the final prediction. Specifically, two CNN models with different data fusion strategies are presented for BAA: a regression model with feature-level fusion and a classification model ∗ Corresponding
author. Email addresses:
[email protected] (Yu Liu),
[email protected] (Xun Chen)
Preprint submitted to Computers in Biology and Medicine
March 17, 2019
ACCEPTED MANUSCRIPT
RI PT
with decision-level fusion. Experiments on the public BAA dataset Digital Hand Atlas demonstrate that the proposed method can obtain promising results and
outperform many state-of-the-art BAA methods. In particular, the proposed approaches exhibit obvious advantages over the corresponding spatial domain approaches (generally with an improvement of more than 0.1 years on the mean
SC
absolute error), showing great potential in the future study of this field. Keywords: Bone age assessment (BAA), non-subsampled contourlet
data fusion
1
1. Introduction
M AN U
transform (NSCT), convolutional neural networks (CNNs), feature extraction,
As a frequently performed procedure for skeletal maturity evaluation in pe-
3
diatric radiology, bone age assessment (BAA) can determine the discrepancy
4
between a child’s skeletal age and chronological age, which may indicate ab-
5
normalities in skeletal development. BAA is widely used in the diagnosis of
6
endocrine disorders and the monitoring of growth hormone therapy [1]. In
7
addition, accurate assessment of skeletal age is of great significance for the pre-
8
diction of a child’s adult height [2, 3]. Some other related studies may also
9
potentially benefit from the BAA technique, such as the assessment of cortical
EP
10
TE D
2
bone fracture resistance curves for estimating the risk of bone fracture [4]. The most popular approach for BAA is based on a radiological examina-
12
tion of left (non-dominant) hand and wrist to reveal the maturity of skeletal
13
development, due to the advantages like simplicity and minimum radiation ex-
14
posure. Figure 1 (the left subfigure) shows a typical example of hand radiograph
15
for skeletal age assessment. The image is from the public BAA dataset Digi-
16
tal Hand Atlas 1 [5]. The patient is a 10.49 years old (chronological age) girl,
17
while the estimated results of her skeletal age from two radiologists using this
18
radiograph are 11.5 and 11.0 years old, respectively. Thus, the patient in this
AC C
11
1 The
dataset is available at: http://ipilabmysql.usc.edu/newindex.php
2
ACCEPTED MANUSCRIPT
Number of Images
100
80
60
40
20
0 0
2
4
6
8
10
12
14
16
18
20
SC
Bone Age
RI PT
120
Figure 1: Left: a radiograph from the Digital Hand Atlas dataset for bone age assessment.
M AN U
Right: the distribution of examples in this dataset based on the ground truth bone age.
19
case is likely to have an advanced bone age, revealing some possible physical
20
abnormalities (e.g., precocious puberty) [1].
Currently, there exist two widely-used clinical methods for radiologists and
22
pediatricians to assess skeletal age based on the hand X-ray images: Greulich
23
and Pyle (GP) [6] and Tanner-Whitehouse (TW) [2, 7]. The GP method deter-
24
mines the skeletal age by comparing the whole target radiograph with a list of
25
standard bone age maps known as an atlas. In contrast, the TW method gives a
26
more detailed solution to this problem. It takes into account a set of local regions
27
of interest (ROIs) in an X-ray image and provides an evaluation by a numerical
28
score for each ROI that corresponds to a specific bone. The final bone age is es-
29
timated by combining all the scores using some pre-defined strategies. Although
30
the GP and TW methods have their own characteristics such as simplicity for
31
GP and precision for TW, these manual methods suffer from several common
EP
TE D
21
drawbacks. They both rely heavily on the domain knowledge and experience of
33
radiologists. A considerable intra-rater and inter-rater variability always exists
34
in clinical practice [8], as shown in the above example where a difference of 0.5
35
years between two expert readings occurs. In addition, the time efficiency of
36
assessing the bone age in a manual manner is relatively low [9–11]. As stated
37
in [9], it takes around 30 minutes per image for a radiologist to accomplish the
38
assessment procedure using the TW method. The GP-based evaluation will be
39
more efficient due to its simplicity, but an experienced radiologist still requires
40
a few minutes to accomplish the task in general.
AC C 32
3
ACCEPTED MANUSCRIPT
In the last two decades, the computer-assisted methods for automated bone
42
age assessment have emerged an active topic in the field of medical image anal-
43
ysis, and a variety of automatic BAA approaches have been proposed [12–25].
44
These methods usually handle BAA as a classification or regression problem
45
which involves essential steps like hand segmentation, ROI detection, feature
46
extraction and classifier or regressor design. In these methods, high quality X-
47
ray images are usually required for robust hand segmentation and ROI detection
48
results. Moreover, these methods need to manually design distinctive features
49
for the subsequent classification or regression, leading to a heavy dependence of
50
experience and skills. The above steps all have critical impacts on the system
51
performance and they are highly correlated with each other, which greatly limits
52
the convenience in designing effective BAA methods.
M AN U
SC
RI PT
41
More recently, deep learning techniques [26], and in particular convolutional
54
neural networks (CNNs) [27, 28] have achieved great success in a wide range
55
of medical image analysis problems [29–31]. As for the issue of bone age as-
56
sessment, a couple of CNN-based methods have been proposed in the last three
57
years [9–11, 32–35], leading to promising results and obvious advantages over
58
conventional methods based on hand-crafted features. However, just like the
59
situation in most medical image related applications [30, 31], a main challenge
60
existing in CNN-based BAA study is the lack of annotated data for supervised
61
learning, due to many factors such as privacy constraint, high demand of profes-
62
sional knowledge, etc. Transfer learning is recognized as an effective solution to
AC C
EP
TE D
53
63
this problem and has been widely applied in this area. The deep models trained
64
by natural images are employed for fine-tuning using the labelled medical im-
65
ages. Despite of this, in many cases, the scale of a medical image dataset may
66
still seem to be too small to achieve a satisfactory training. For instance, in the
67
BAA dataset Digital Hand Atlas, there exist nearly 1400 X-ray images and the
68
range of ground truth bone age (defined as the average of two expert readings
69
given by the dataset) is 0-19 years old. Therefore, the BAA task on this dataset
70
can be naturally modelled as a classification problem with 20 classes while each
71
class only contains about 70 images on the average. The right subfigure in 4
ACCEPTED MANUSCRIPT
Figure 1 provides the distribution of X-ray images in this dataset based on the
73
ground truth bone age after the rounding operation. Some commonly-used data
74
augmentation techniques like cropping and flipping could alleviate this problem,
75
but the extent may still be limited.
RI PT
72
In this paper, a multi-scale data fusion framework for bone age assessment
77
based on deep convolutional networks is proposed. Unlike the existing CNN-
78
based BAA methods that directly use the original spatial domain image as
79
network input, we pre-extract a rich set of features for the input image by
80
performing non-subsampled contourlet transform (NSCT), which is an effective
81
multi-scale geometric analysis approach that can obtain multi-scale and multi-
82
direction features of an image. This feature pre-extraction procedure can be
83
beneficial to network training when the quantity of annotated examples in BAA
84
is quite limited. Furthermore, by performing NSCT, the information contained
85
in the spatial domain image is well separated into different scales, which sig-
86
nificantly improves the flexibility in designing effective CNN models via data
87
fusion. The main contributions of this paper are summarized as follows:
TE D
M AN U
SC
76
1. We propose a multi-scale data fusion framework for BAA based on NSCT
89
and CNNs, which has several advantages over the conventional spatial
90
domain CNN-based BAA methods.
91
92
2. Under the proposed framework, a regression model for BAA using a featurelevel fusion strategy is presented. 3. Under the proposed framework, a classification model for BAA using a
AC C
93
EP
88
94
decision-level fusion strategy is presented.
95
4. A novel performance evaluation approach for BAA based on error-inspired
96
cumulative percentage curve is introduced. The curve describes the per-
97
centage of validation examples that have an error less than a certain
98
threshold. In comparison to the traditional evaluation metrics such as
99
mean absolute error (MAE), this curve-based evaluation can provide much
100
more insights into the performance of a method and offer a more intuitive
101
way to compare different methods.
5
ACCEPTED MANUSCRIPT
5. Experimental results on the Digital Hand Atlas dataset demonstrate that
103
the proposed method can obtain very competitive performance when com-
104
pared with the state-of-the-art methods.
RI PT
102
A preliminary version of this work was reported in [36], which just presents a
106
regression model for bone age assessment. In this paper, we have made a series
107
of extensions which mainly include:
109
110
111
1. A more complete review about existing CNN-based BAA methods is provided in Section 2.1.
M AN U
108
SC
105
2. For the methodology, a classification BAA model based on NSCT and CNN with different decision-level fusion rules is added in Section 3.3. 3. A novel performance evaluation approach for BAA based on error-inspired
113
cumulative percentage curve is presented in Section 4.3. In [36], we just
114
simply mentioned this idea without full description and experimental ver-
115
ification. In this paper, the evaluation approach is detailed and its effec-
116
tiveness is verified.
TE D
112
4. Extensive experimental validations are supplemented in Sections 4.4 - 4.6.
118
First, a systematic study on the impact of different parameter settings
119
including the number of NSCT decomposition levels and the decision-
120
level fusion rule used in the classification model are given. Second, a
121
set of experiments focusing on the distinction between two genders are
122
conducted. Third, the proposed method is compared with more state-of-
123
the-art BAA methods including a recent CNN-based method [35] and the
124
well-known BoneXpert method [18].
AC C
EP
117
125
126
5. More discussions are provided to further clarify the characteristics of the proposed method in Section 5.
127
The rest of this paper is organized as follows. In Section 2, some related
128
work and the motivations of this work are introduced. Section 3 presents the
129
proposed classification model and regression model for BAA in detail. Experi-
130
mental results are given in Section 4 and further discussions are made in Section
131
5. Finally, Section 6 concludes the paper. 6
ACCEPTED MANUSCRIPT
2. Related Work and Motivations
133
2.1. Convolutional neural networks for bone age assessment
RI PT
132
As a representative supervised deep learning architecture, convolutional neu-
135
ral networks (CNNs) have gained great achievements in a variety of medical
136
image analysis problems. In the field of bone age assessment (BAA), CNN-
137
based studies started relatively late. In 2016, Stern et al. [32] firstly proposed
138
a CNN-based BAA method using magnetic resonance imaging (MRI) volumes.
139
They followed the idea of TW2 method to obtain age information by combining
140
the estimated ossification stages of 13 individual bones. The first CNN-based
141
BAA method using 2D X-ray scans was reported by Chen [33], which presents
142
a classification model for BAA using the VGGNet [37] with transfer learning.
143
The GP framework is applied in this method for its simplicity and efficiency.
144
In 2017, Lee et al. [10] proposed a fully-automated CNN-based BAA system
145
involving data preparation, image preprocessing, CNN-based hand region detec-
146
tion and CNN-based age classification. In their method, the GoogLeNet [38] is
147
employed as the age classification model under the GP framework. Spampinato
148
et al. [34] introduced a general CNN-based regression model for BAA, which
149
consists of a convolutional network for feature extraction and a regression net-
150
work for bone age estimation. They also applied the GP framework in their
151
work and tested GoogLeNet and VGGNet with different transfer learning tech-
152
niques. In addition, they proposed a new architecture for BAA called BoNet by
153
introducing a deformation layer in the convolutional network. Zhou et al. [35]
154
defined five regions of interest (ROIs) in the X-ray image by referring to the
155
TW method. The VGGNet model pre-trained on general imagery is employed
156
for the classification of each ROI and a multi-model fusion strategy is adopted
157
to get the final prediction. Iglovikov et al. [9] proposed an end-to-end network
158
architecture for automated BAA. They studied the impact of ROI selection on
159
algorithm performance by defining three specific regions in the X-ray image.
160
Most recently, Wang et al. [11] proposed a BAA approach by extracting the
161
distal radius and ulna areas from the X-ray image for CNN-based classification.
AC C
EP
TE D
M AN U
SC
134
7
ACCEPTED MANUSCRIPT
NSDFB
RI PT
NSP
High frequency
Input image
NSDFB
NSP
Scale
Fine
SC
High frequency
Coarse
M AN U
Low frequency
Figure 2: The schematic diagram of a two-level NSCT decomposition. NSP is the abbreviation of non-subsampled pyramid. NSDFB is the abbreviation of non-subsampled directional filter bank.
162
163
2.2. Non-subsampled Contourlet Transform
Non-subsampled Contourlet Transform (NSCT) [39] is a representative multiscale geometric analysis approach that has been widely used in many image
165
processing problems such as denoising [40] and fusion [41]. In comparison to
166
some early multi-scale image decomposition approaches like pyramid, wavelet
167
and curvelet, contourlet transform is verified to be a more optimal image rep-
168
resentation method as it can capture the details of an image at diverse scales
169
and directions more effectively. By applying the non-subsampling strategy to
170
contourlet transform, NSCT also owns the shift-invariance property. The im-
EP
TE D
164
plementation of NSCT is mainly based on non-subsampled pyramid (NSP) de-
172
composition and non-subsampled directional filter bank (NSDFB). The NSP
173
decomposition is employed to obtain the multi-scale representations of an input
174
image from fine to coarse. It is achieved by a two-channel non-subsampled 2-
175
D filter bank. At each decomposition level, one high-frequency band and one
176
low-frequency band of the same size as the input image are produced. As a
177
result, by applying an L-level NSP decomposition, we can obtain L + 1 sub-
178
bands including L high-frequency bands and one low-frequency band. For each
179
high-frequency band, the NSDFB is applied to obtain its multi-direction rep-
AC C 171
8
ACCEPTED MANUSCRIPT
resentations at the corresponding scale. The obtained band at each direction
181
is also of the same size as the input image. The number of directions at each
182
scale can be adjusted and it is typically set to a value that is the power of two.
183
Figure 2 shows the schematic diagram of a two-level NSCT decomposition.
184
2.3. Motivations of This Work
SC
RI PT
180
To the best of our knowledge, all the existing CNN-based BAA methods
186
adopt the spatial domain image (either the whole image or ROIs) as network
187
input. However, the problem of BAA usually suffers from a severe lack of
188
annotated examples for network training. Even though some transfer learning
189
and data augmentation approaches have been applied to solve this problem, the
190
effect may still be limited.
191
192
M AN U
185
In this work, we perform NSCT on the spatial domain image before feeding it into the convolutional architecture. The motivations include: 1. The NSCT can effectively extract a rich set of low-level features such as
194
edges, corners and textures with multiple scales and directions, due to
195
its good characteristics described in Section 2.2. As a result, dozens of
196
coefficient maps are generated from a single spatial domain image. Using
197
this feature pre-extraction procedure, a more thorough training for the
198
deep convolutional networks with limited examples can be expected.
EP
TE D
193
2. By applying NSCT, the information contained in the original spatial do-
200
main image can be well separated into different scales. Consequently, the
AC C
199
201
flexibility in designing CNN architectures can be greatly improved and
202
data fusion strategies of different levels (e.g., feature level and decision
203
level) can be easily introduced to promote the system performance.
204
3. In fact, other multi-scale transform approaches such as wavelets and curvelets
205
can also be used for feature pre-extraction. The main consideration of se-
206
lecting NSCT is that its representation ability for an image is widely rec-
207
ognized to be better than most of the other multi-scale transform methods
208
[42].
9
Scale #1 (16 bands)
…… Scale #2 (8 bands)
……
M AN U
Scale #3 (8 bands)
SC
……
RI PT
ACCEPTED MANUSCRIPT
……
Scale #4 (4 bands)
……
TE D
Scale #5 (4 bands)
Figure 3: A 5-level NSCT decomposition on the example image in Figure 1.
3. The Proposed Method
210
3.1. NSCT for hand X-ray images
EP
209
In consideration of the factors including computational efficiency, memory
212
cost and the input size of popular CNN models like VGGNet and GoogLeNet,
213
all the X-ray images are resized to 256 × 256 pixels, keeping aspect ratio via zero
214
padding. The resized version of the X-ray example given in Figure 1 is shown
215
in Figure 3 at the upper-left corner.
AC C
211
216
The NSCT is performed on the resized X-ray image to obtain its multi-scale
217
and multi-direction representations. It is clear that the number of decomposition
218
levels is an important factor in NSCT. To get a grasp of this issue, the results
219
of a 5-level decomposition on the resized image are exhibited in Figure 3. The
10
ACCEPTED MANUSCRIPT
numbers of directional bands at different scales from fine to coarse are set to
221
16, 8, 8, 4 and 4 [42, 43], which is in accord with the principle that a finer
222
scale needs more directions for description. Due to limitation of space, we only
223
show parts of the high-frequency bands at each scale (the absolute values of
224
high-frequency bands are used for a better visualization). The intermediate
225
low-frequency bands are also shown in Figure 3. Please refer to Section 4.2 for
226
more implementation details of NSCT decomposition.
SC
RI PT
220
It can be seen that there exists a clear trend of the features captured at
228
different scales. That is, the features become coarser with the increase of scale.
229
Specifically, the finest details are extracted at the first scale. At the second
230
and third scales, crucial edge-like features are well captured. The features have
231
become very coarse after the fourth scale and only the main contours are ex-
232
tracted in the fifth scale. The consistent trend can be found from the series of
233
intermediate low-frequency bands. The low-frequency band after 5-level decom-
234
position contains almost no valuable information for BAA. Similar observations
235
are obtained from other X-ray images. Therefore, it is not necessary to set the
236
number of decomposition levels larger than five. In Section 4, more quantitative
237
analysis and discussions on this issue are provided.
238
3.2. Regression Model
EP
TE D
M AN U
227
239
Figure 4 shows the schematic diagram of the proposed regression model
240
for BAA. The architecture consists of three modules: NSCT decomposition, convolution and regression.
AC C 241
242
The NSCT decomposition module is used to generate multi-scale and multi-
243
direction features of the input image. An L-level NSCT decomposition is em-
244
ployed. At each scale, all the directional sub-bands are concatenated to generate
245
a data cube of multi-channel format. Thus, L data cubes are constructed in to-
246
tal.
247
The data cubes obtained by NSCT decomposition are fed into the convolu-
248
tion module, which is a multi-stream architecture and each stream is constructed
249
by a series of convolutional layers and max-pooling layers. Considering that the 11
Convolution
Regression
Scale #2
VGGNet-16 convolutional architecture
Concatenation
Scale #1
VGGNet-16 convolutional architecture
SC
NSCT decomposition
RI PT
ACCEPTED MANUSCRIPT
M AN U
Input image
1024
Bone age 128
1 Fully-connected
Fully-connected
Fully-connected
Scale #L
TE D
VGGNet-16 convolutional architecture
Figure 4: Schematic diagram of the proposed regression model for bone age assessment. The
AC C
2×2 Max-pool
3×3 Conv, 512
3×3 Conv, 512
3×3 Conv, 512
2×2 Max-pool
3×3 Conv, 256
3×3 Conv, 256
3×3 Conv, 256
2×2 Max-pool
3×3 Conv, 128
3×3 Conv, 128
3×3 Conv, 128
2×2 Max-pool
3×3 Conv, 128
3×3 Conv, 128
2×2 Max-pool
3×3 Conv, 64
3×3 Conv, 64
EP
details of VGGNet-16 convolutional architecture are given in Figure 5.
VGGNet-16 convolutional architecture
Figure 5: The convolutional architecture of VGGNet-16 [37]. The “Conv” denotes the convolutional layer and the kernel size is uniformly 3 × 3. The number of output feature maps per
layer is given. The “Max-pool” denotes the max-pooling layer and the kernel size is uniformly 2 × 2. Please refer to [37] for more details.
12
ACCEPTED MANUSCRIPT
16-layer VGGNet [37] (denoted as VGGNet-16) has been verified to be a very
251
effective network for BAA [9, 11, 33–35], it is employed as the architecture of
252
each stream in this work. The convolutional architecture of the VGGNet-16 is
253
shown in Figure 5. It contains 13 convolutional layers and 5 max-pooling layers
254
in total. More details of this network can be found in [37]. As with all the
255
existing CNN-based BAA methods, transfer learning is applied in our method.
256
Specifically, the VGGNet-16 model2 , which is fully trained on the well-known
257
image dataset adopted by the ImageNet Large-Scale Visual Recognition Chal-
258
lenge (ILSVRC), is employed to provide the initial weights for further training
259
in the BAA task. The output feature maps of different streams are finally
260
concatenated to achieve a feature-level fusion.
M AN U
SC
RI PT
250
The regression module consisting of several fully-connected layers adopts
262
the concatenated feature maps as the input. In our method, based on extensive
263
experiments, we use three fully-connected layers with 1024, 128 and 1 neurons,
264
respectively. The final single neuron outputs the estimated bone age. We apply
265
the mean squared error as our loss function:
TE D
261
n
L(Θ) =
1 X 2 |f (Xi , Θ) − yi | , 2n i=1
(1)
where Θ denotes the network parameters to be determined and n indicates the
267
number of training examples in a mini-batch. Xi is the i-th training image in
268
the batch and f (Xi , Θ) denotes its prediction by the network. yi is its ground
269
truth result, which is generally provided by radiologists or domain experts. The
AC C
EP
266
270
popular mini-batch stochastic gradient descent (SGD) algorithm is employed to
271
minimize the loss function.
272
3.3. Classification Model
273
Figure 6 shows the schematic diagram of the proposed classification model
274
for bone age assessment, which mainly contains four modules: NSCT decompo-
275
sition, classification, fusion and estimation. The NSCT decomposition module 2 The
pre-trained model is available at: https://github.com/BVLC/caffe/wiki/Model-Zoo.
13
Classification
Fusion
Estimation
FC, #class
FC, 4096
VGGNet-16 convolutional architecture
FC, 4096
Scale #1
M AN U
NSCT decomposition
SC
RI PT
ACCEPTED MANUSCRIPT
#class
FC, #class
FC, 4096
FC, 4096
Scale #2
VGGNet-16 convolutional architecture
Bone age
Fusion
#class
Input image
1
FC, 4096
FC, #class
VGGNet-16 convolutional architecture
FC, 4096
#class
EP
Scale #L
TE D
#class
Figure 6: Schematic diagram of the proposed classification model for bone age assessment. The details of VGGNet-16 convolutional architecture are given in Figure 5. The “FC” denotes
AC C
the fully-connected layer and the number of neurons per FC layer is given. Please note that the FC layers also originate from the VGGNet-16 [37], while the number of output neurons is adjusted to the number of quantized classes (denoted as “#class”) in a specific BAA task.
14
ACCEPTED MANUSCRIPT
is the same as that in the regression model presented in Section 3.2.
RI PT
276
In the classification module, each data cube obtained above is individually
278
classified by a deep convolutional network. Just as the above regression model,
279
the pre-trained VGGNet-16 model is adopted for each sub-problem. The number
280
of output neurons is adjusted to the number of quantized categories in a specific
281
BAA task. For each sub-problem, the softmax function is applied to obtain the
282
output and the cross entropy loss is used for network training. The mini-batch
283
SGD algorithm is employed to update the network parameters.
SC
277
The fusion module conducts a decision-level fusion on the multiple predic-
285
tions obtained above. Let Pi = (pi,1 , pi,2 , ..., pi,C ) denote the i-th source pre-
286
diction and i ∈ {1, 2, ..., N }, where N is the number of predictions and C is
287
the dimension of each prediction. The fused prediction Pf = (pf,1 , pf,2 , ..., pf,C )
288
can be obtained with different fusion rules including element-wise maximum,
289
element-wise mean and element-wise multiplication. Specifically, the j-th el-
290
ement (j ∈ {1, 2, ..., C}) of the fused prediction pf,j is calculated using the
291
element-wise maximum rule:
TE D
(2)
N 1 X pi,j , N i=1
(3)
i
pf,j =
and the element-wise multiplication rule:
AC C
293
pf,j = max{pi,j },
the element-wise mean rule:
EP
292
M AN U
284
pf,j =
N Y
pi,j .
(4)
i=1
294
To obtain a meaningful probability vector, the fused prediction Pf is normalized
295
by its l1 -norm.
296
Finally, the estimation module outputs the predicted bone age R as R=
C X
pf,j · aj ,
(5)
j=1 297
where aj denotes the specific bone age indicated by the j-th category according
298
to the quantization. 15
299
4. Experimental Results
300
4.1. Dataset and Compared Methods
RI PT
ACCEPTED MANUSCRIPT
In the field of bone age assessment (BAA), public X-ray datasets are quite
302
limited. Most of the existing BAA methods are tested on private datasets
303
and their source codes are not released for reproducible study, making a fair
304
comparison impossible. Fortunately, Spampinato et al. [34] recently provided a
305
benchmark for BAA performance comparison, which is a great contribution to
306
this field. They tested several automated BAA methods including hand-crafted
307
feature based ones and deep learning based ones on the Digital Hand Atlas
308
[5], a public and comprehensive X-ray BAA dataset created by the Children’s
309
Hospital Los Angeles. The dataset contains nearly 1400 left-hand radiographs of
310
children ranging from 0 to 18 years old (chronological age) and covering different
311
genders and races. For each X-ray image in the dataset, two expert readings
312
independently given by two radiologists are provided as the references. The
313
minimum and maximum bone age readings over the whole dataset are 0 and
314
19.0 years, respectively.
TE D
M AN U
SC
301
In this paper, for the sake of fair comparison, we mainly adopt the above
316
benchmark to verify the effectiveness of the proposed method. Specifically, the
317
Digital Hand Atlas dataset is used and eight existing BAA methods that have
318
been tested on the benchmark are employed for performance comparison, which
319
include the method proposed in [14], the method proposed in [5], the method
320
proposed in [19], the method proposed in [23], the VGGNet-based method [34],
321
the GoogLeNet-based method [34], the BoNet-based method [34] and a latest
322
CNN-based method proposed by Zhou et al. in [35]. Among them, the first four
323
are conventional hand-created feature based methods, while the other four are
324
CNN-based ones with the spatial domain image as network input, by applying
325
either the whole image [34] or a set of local ROIs [35]. The results of the
326
first seven methods are provided in [34], and the performance of the last one is
327
reported in its original publication [35]. It is noted that all these methods and
328
the proposed method are tested using the same train-validation strategy.
AC C
EP
315
16
ACCEPTED MANUSCRIPT
The proposed method is also compared with the well-known automated BAA
330
method, Thodberg et al.’s BoneXpert [18, 44], which has been a widely recog-
331
nized commercial software in this field3 . In [44], the BoneXpert method is tested
332
on the Digital Hand Atlas dataset, making this comparison possible. Please refer
333
to Section 4.6.2 for more related details.
334
4.2. Implementation Details
SC
336
Following the train-validation strategy used in the benchmark [34], a 5-fold cross validation is adopted to evaluate the algorithm performance.
M AN U
335
RI PT
329
According to the analysis in Section 3.1, the NSCT decomposition is per-
338
formed up to five levels on the resized X-ray images, and the numbers of direc-
339
tional bands at different scales from fine to coarse are set to 16, 8, 8, 4 and 4.
340
In this work, the decomposition procedure is implemented via a Matlab toolbox
341
for NSCT4 released by its original authors [39]. And, we perform this feature
342
pre-extraction step separately from the whole BAA system, i.e., the contourlet
343
coefficients are pre-computed and saved as the inputs of the subsequent neural
344
networks. For either the regression model or the classification model, the num-
345
ber of decomposition levels is set increasing from 1 to 5 with a stride of 1 to
346
study its impact on the performance, and the approach that directly adopts the
347
spatial domain image (i.e., without performing NSCT) as network input with
348
the same architecture is also implemented for comparison.
EP
349
TE D
337
In the regression model, the average value of two expert readings is employed as the ground truth for training. Data augmentation is carried out by extracting
351
nine uniformly spaced crops and their horizontally flipped versions. The spatial
352
size of a cropped patch is 224 × 224 to fit the VGGNet input. In our training
353
procedure, the number of training examples in a mini-batch is set to 4, which is
354
in accord with the setting used in other CNN-based methods in the benchmark
355
[34]. The momentum factor and weight decay are fixed as popular values of 0.9
AC C 350
3 The 4 The
website of BoneXpert is: https://www.bonexpert.com/. toolbox is available at: https://ww2.mathworks.cn/matlabcentral/fileexchange/10049.
17
ACCEPTED MANUSCRIPT
and 0.0005, respectively. The training process is conducted for about 80 epochs
357
over the augmented dataset. The learning rate is initially set to 0.0001 and
358
dropped by a factor of 10 after the first 60 epochs.
RI PT
356
In the classification model, the rounding value of the average reading ranging
360
from 0 to 19 is used as the label of the corresponding example. Thus, the
361
classification problem has 20 classes in total. Data augmentation including
362
random cropping and horizontal flipping is performed online during the training
363
process. The spatial size of a cropped patch is also 224 × 224. The batch
364
size, momentum factor and weight decay are all the same as those used in the
365
regression model. For each sub-problem, the training process is conducted for
366
about 180 epochs. The learning rate is initialized at 0.001 and dropped by a
367
factor of 10 after 90 epochs. Those three decision-level fusion rules presented in
368
Section 3.3 are employed in the classification model, respectively.
M AN U
SC
359
All the experiments on network training are conducted via the popular deep
370
learning framework Caffe [45]. The hardware environment mainly includes a
371
Intel Core i7-6800K CPU and a NVIDIA GTX 1080 Ti GPU.
372
4.3. Objective Evaluations
TE D
369
The mean absolute error (MAE) between the prediction and the ground
374
truth is the most explicit evaluation metric in bone age assessment. Since the
375
dataset contains two independent expert readings, the benchmark presented in
376
[34] provide three MAEs: the MAE between the prediction and first expert
EP
373
reading, the MAE between the prediction and the second expert reading, and
378
the average value of these two MAEs. In this paper, besides these three MAEs,
379
we also calculate the MAE between the estimated bone age and the average
380
value of two readings, for the reason that the average reading is the actual
381
ground truth used in network training. Therefore, four MAEs are employed in
382
our experiments. Let pi , s1,i and s2,i denote the prediction, first expert reading
383
and second expert reading of the i-th validation example, respectively. The
384
index i ∈ {1, 2, ..., K}, where K is the number of validation examples. The
AC C 377
18
ACCEPTED MANUSCRIPT
Percentage (%)
80
60
40
20
0 0.0
0.5
1.0
1.5
2.0
2.5
SC
Error (year)
RI PT
100
Figure 7: The error-inspired cumulative percentage curve for performance evaluation of bone
385
M AN U
age assessment.
above four MAEs are calculated as
386
K 1 X e1 = |pi − s1,i |, K i=1
(6)
K 1 X |pi − s2,i |, K i=1
(7)
e2 = 387
388
1 (e1 + e2 ), 2
TE D
e¯ =
K 1 X 1 e= pi − (s1,i + s2,i ) . K i=1 2
(8) (9)
In this paper, in addition to the above numerical metrics, we introduce a
390
simple yet effective curve-based evaluation approach for bone age assessment.
391
Specifically, we focus on the percentage of validation examples that have an
392
absolute error less than a certain threshold. By increasing the threshold from
AC C
EP
389
393
0 with a stride such as 0.1 years, we can obtain a series of percentage values
394
to generate a cumulative percentage curve. An example is shown in Figure 7.
395
The horizontal axis is the threshold of a certain error measure while the ver-
396
tical axis is the cumulative percentage of validation examples. In contrast to
397
the MAE metric which only gives the mean value over the whole validation
398
set, this cumulative percentage curve can provide much more insights into the
399
performance of a method via describing the error distribution. By exhibiting
400
the curves of different methods on one coordinate system, a more intuitive and
401
detailed comparison can be achieved. Furthermore, the area of the region under 19
ACCEPTED MANUSCRIPT
the curve (the filled region shown in Figure 6) can be derived as an extra nu-
403
merical metric for performance evaluation. A larger value of the area indicates
404
a better performance. The area can be approximately calculated as the sum of
405
a series of trapezoids: sc =
M X u i=1
2
(pi−1 + pi ),
RI PT
402
(10)
where u is the stride of error threshold, M is the length of percentage array pi ,
407
i ∈ {1, 2, ..., M } and the value of p0 is defined as 0.
SC
406
To the best of our knowledge, this curve-based evaluation has not appeared
409
in previous BAA-related publications. However, in existing BAA methods, the
410
output predictions detailed to each example are not available, making the gen-
411
eration of their curves not possible. In this paper, this evaluation approach is
412
mainly used in Section 4.4 for parameter study of the proposed method.
413
4.4. Study of Parameter Settings
M AN U
408
In this section, the impact of the number of NSCT decomposition levels
415
and the decision-level fusion rule applied in the classification model are quan-
416
titatively studied. For simplicity, the error between the prediction and average
417
value of two expert readings (the ground truth used in training) are mainly
418
employed for evaluation, while the other three types of errors show very similar
419
observations and share the same trends. Based on this error, the MAE defined
420
in Eq. (9) and the cumulative percentage curve based evaluation approach (the
421
curve along with the metric defined in Eq. (10)) are used. Moreover, for each
AC C
EP
TE D
414
422
network, we repeat the 5-fold cross validation 4 times for statistical analysis. In
423
this work, the Wilcoxon signed-rank test [46], which is a popular non-parametric
424
hypothesis test approach, is employed to determine whether there exists a sig-
425
nificant difference between the performance of two networks. As each network
426
is actually trained 20 (i.e., 5 × 4) times, the Wilcoxon signed-rank test here is
427
based on 20 pairs of MAEs obtained by the corresponding two networks. The
428
average prediction of each image (has 4 predictions due to the repetition) in
429
the X-ray dataset is used for the calculations of the MAEs and the cumulative
430
percentage curves reported in this section. 20
ACCEPTED MANUSCRIPT
RI PT
100 90 80
60
SC
Percentage (%)
70
50 40
spatial domain: 1.6822 1-level: 1.6118 2-level: 1.7121 3-level: 1.7650 4-level: 1.7866 5-level: 1.7841
20 10 0 0
0.5
M AN U
30
1
1.5
2
2.5
Error (year)
Figure 8: The error-inspired cumulative percentage curves of the proposed regression model
TE D
with different NSCT decomposition levels.
Table 1: The performance of the proposed regression model with different NSCT decomposition levels.
spatial domain
1-level
2-level
3-level
4-level
5-level
e
0.8512
0.9339
0.8107
0.7529
0.7136
0.7158
1.6118
1.7121
1.7650
1.7866
1.7841
sc
1.6822
4.4.1. Regression model
AC C
431
EP
metric
432
Table 1 lists the performance of the proposed regression model with differ-
433
ent NSCT decomposition levels. The best result for each metric is indicated in
434
bold. The corresponding error-inspired cumulative percentage curves are shown
435
in Figure 8. For each curve in Figure 8 (also in Figure 9 and Figure 10), the
436
corresponding area-based metric sc is provided in its legend for better compar-
437
ison. The approach that directly adopts the spatial domain image as network
438
input (denoted as “spatial domain”) is also included in Table 1 and Figure 8.
439
It can be seen that a higher decomposition level can generally obtain better
21
ACCEPTED MANUSCRIPT
results, as more useful information can be involved. The performance tends to
441
be stable after the 4-level decomposition. Furthermore, when compared with
442
the method based on spatial domain image, the proposed method with a 2-
443
level NSCT decomposition has been able to achieve better performance, which
444
indicates the significance of the NSCT-based feature pre-extraction approach
445
for this task.
SC
RI PT
440
We can observe from Table 1 (also from the subsequent Table 3 and Table 4)
447
that the metric sc can obtain very consistent results with the MAE-based metric,
448
which demonstrates the effectiveness of the curve-based evaluation approach
449
presented in this paper.
M AN U
446
Table 2 lists the p-values given by the Wilcoxon signed-rank test from the
451
comparison between the “4-level” approach and each of the other approaches.
452
The test is implemented using the stats.wilcoxon function developed in the SciPy
453
library based on Python, with default parameter settings. Clearly, the “4-level”
454
approach has a statistically significant advantage over the first four approaches
455
as the related p-values are much smaller than 0.05 (the p-value 8.9e-5 is obtained
456
when one method wins all the 20 comparisons). There is no distinct difference
457
between the performance of the “4-level” and “5-level” approaches. However,
458
due to the lower model complexity, the “4-level” approach is favored as the
459
specific choice of the proposed regression model.
EP
TE D
450
Table 2: The p-values given by the Wilcoxon signed-rank test from the comparison between
AC C
the “4-level” approach and each other approach for the regression model. spatial domain
1-level
2-level
3-level
4-level
5-level
8.9e-5
8.9e-5
8.9e-5
0.0032
n/a
0.28
Table 3: The performance of using the NSCT coefficient maps at each scale individually. metric
spatial domain
scale #1
scale #2
scale #3
scale #4
scale #5
e
0.8289
0.7632
0.7941
0.8168
1.0032
1.2924
sc
1.6908
1.7451
1.7170
1.7129
1.5569
1.3598
22
ACCEPTED MANUSCRIPT
4.4.2. Classification model
RI PT
460
Table 3 reports the performance of using the NSCT coefficient maps at each
462
scale individually (i.e., each branch in the classification model). The correspond-
463
ing curve-based performance is shown in the upper-left subfigure of Figure 9.
464
The approach that adopts spatial domain image as network input is also in-
465
cluded in Table 2 and Figure 9. It can be seen that a finer (smaller) scale tends
466
to perform better, which indicates that more distinctive information are con-
467
tained in fine scales. The difference among first three scales are relatively small,
468
showing clear advantages over the fourth and fifth scales, which is in accord
469
with the perceptual observations obtained from Figure 3. Furthermore, it is
470
remarkable that for the first three scales, the performance of using the NSCT
471
features at a single scale is even better than that of using the spatial domain
472
image. This observation further verifies the effectiveness of the NSCT-based
473
feature pre-extraction for this BAA problem.
M AN U
SC
461
TE D
Table 4: The performance of the proposed classification model with different NSCT decomposition levels and decision-level fusion rules. metric
e
fusion rule
1-level
2-level
3-level
4-level
5-level
maximum
0.8289
0.7632
0.7172
0.7032
0.7511
0.8249
mean
0.8289
0.7632
0.7140
0.6885
0.7163
0.7440
0.8289
0.7632
0.7238
0.7302
0.7659
0.7575
EP
multiplication
maximum
1.6908
1.7451
1.7873
1.8033
1.7587
1.6933
mean
1.6908
1.7451
1.7920
1.8183
1.7904
1.7683
1.6908
1.7451
1.7786
1.7799
1.7437
1.7495
AC C
sc
spatial domain
multiplication
474
Table 4 lists the performance of the proposed classification model with dif-
475
ferent NSCT decomposition levels and decision-level fusion rules. The approach
476
that adopts spatial domain image as network input is also included. For the
477
comparison of different decomposition levels, the best result in each row is in-
478
dicated in bold. For the comparison of different fusion rules, the best result
479
in each column is underlined. It is noted that both the “spatial domain” and
480
“1-level” approaches have only one branch in the classification model, so the 23
90
80
80
70
70
60 50 40
spatial domain: 1.6908 scale #1: 1.7451 scale #2: 1.7170 scale #3: 1.7129 scale #4: 1.5569 scale #5: 1.3598
30 20 10 0
60 50 40 30 20 10
spatial domain: 1.6908 1-level-maximum: 1.7451 2-level-maximum: 1.7873 3-level-maximum: 1.8033 4-level-maximum: 1.7587 5-level-maximum: 1.6933
0
0
0.5
1
1.5
Error (year)
100
2
2.5
0
0.5
1
1.5
2
2.5
Error (year)
100
90
80 70 60 50 40
80 70
Percentage (%)
TE D
90
Percentage (%)
RI PT
90
SC
100
Percentage (%)
100
M AN U
Percentage (%)
ACCEPTED MANUSCRIPT
spatial domain: 1.6908 1-level-mean: 1.7451 2-level-mean: 1.7920 3-level-mean: 1.8183 4-level-mean: 1.7904 5-level-mean: 1.7683
20 10 0
0.5
AC C
0
1
1.5
50 40 spatial domain: 1.6908 1-level-multiplication: 1.7451 2-level-multiplication: 1.7786 3-level-multiplication: 1.7799 4-level-multiplication: 1.7437 5-level-multiplication: 1.7495
30
EP
30
60
20 10 0
2
2.5
0
Error (year)
0.5
1
1.5
2
2.5
Error (year)
Figure 9: The error-inspired cumulative percentage curves related to the proposed classification model. Upper-left: the performance of using the NSCT coefficient maps at each scale individually. Other three: the impact of the number of NSCT decomposition levels on the model performance with different fusion rules.
24
80
80
70
70
60 50 40 30
60 50 40 30
20
20
2-level-maximum: 1.7873 2-level-mean: 1.7920 2-level-multiplication: 1.7786
10 0
3-level-maximum: 1.8033 3-level-mean: 1.8183 3-level-multiplication: 1.7799
10 0
0
0.5
1
1.5
Error (year)
2
2.5
0
0.5
1
1.5
2
2.5
Error (year)
100
90 80 70 60 50
EP
40 30 20
AC C
10
90 80 70
Percentage (%)
TE D
100
Percentage (%)
RI PT
90
SC
100
90
Percentage (%)
100
M AN U
Percentage (%)
ACCEPTED MANUSCRIPT
0.5
50 40 30 20
4-level-maximum: 1.7587 4-level-mean: 1.7904 4-level-multiplication: 1.7437
5-level-maximum: 1.6933 5-level-mean: 1.7683 5-level-multiplication: 1.7495
10
0
0
60
0 1
1.5
2
2.5
0
0.5
Error (year)
1
1.5
2
2.5
Error (year)
Figure 10: The error-inspired cumulative percentage curves to study the impact of applied decision-level fusion rule on the proposed classification model.
25
ACCEPTED MANUSCRIPT
fusion rule makes no difference to the results.
RI PT
481
The impact of the number of NSCT decomposition levels on the model per-
483
formance with different fusion rules is shown in Figure 9 (the upper-right, lower-
484
left and low-right subfigures). It can be observed from Table 4 and Figure 9 that
485
using 2-level and 3-level decompositions can generate better performance than
486
others, no matter which fusion rule is applied. In particular, the result obtained
487
by a 3-level decomposition is more competitive. This is not difficult to interpret.
488
On one hand, when the number of decomposition levels is too small (i.e., 1),
489
many useful distinctive details are ignored. On the other hand, when the num-
490
ber is too large (i.e., 4 or 5), the information provided by larger scales become
491
unreliable (see Table 3), delivering an opposite effect on the fusion result.
M AN U
SC
482
The impact of the applied decision-level fusion rule on the model performance
493
is shown in Figure 10. Since the “1-level” approach actually has no fusion
494
process, there are four sets of comparisons by fixing the number of decomposition
495
levels. It can be seen that the performance obtained by the mean rule (indicated
496
by the green curves) is the most competitive. From Table 4, similar observations
497
exist as the mean rule wins the first place in all the cases.
TE D
492
Table 5: The p-values given by the Wilcoxon signed-rank test from the comparison between
EP
the “3-level-mean” approach and each other approach for the classification model. spatial domain
1-level
2-level
3-level
4-level
5-level
8.9e-5
8.9e-5
0.0032
0.048
8.9e-5
8.9e-5
8.9e-5
8.9e-5
0.0032
n/a
3.9e-4
8.9e-5
8.9e-5
8.9e-5
8.9e-5
8.9e-5
8.9e-5
8.9e-5
maximum
AC C
mean
multiplication
498
Table 5 lists the p-values given by the Wilcoxon signed-rank test from the
499
comparison between the “3-level-mean” approach and each of the other ap-
500
proaches. It can be seen that all the p-values are smaller than 0.05, indicat-
501
ing that the “3-level-mean” approach significantly outperforms others, so it is
502
selected as the default method of our classification model in the subsequent
503
experiments.
26
ACCEPTED MANUSCRIPT
4.5. Impact of the Gender: Mixing versus Separation
RI PT
504
In the above experiments, we mix the data of two genders (female and male)
506
for network training. However, it is well established that the biological devel-
507
opment is different for females and males. In this section, we study the impact
508
of the gender by comparing two different training manners: mixing and sepa-
509
ration. In the mixing-based manner, the networks are trained using the mixed
510
data (just as the above experiments), but the evaluation results on each gender
511
are separately reported here. In the separation-based manner, both training
512
and testing are separately conducted on each gender. According to the anal-
513
ysis in Section 4.4, the regression model “4-level” and the classification model
514
“3-level-mean” (denoted as “proposed”) are employed for this study. The re-
515
lated spatial domain based approaches are also included for comparison. For
516
the separation-based manner, the settings of network training are exactly the
517
same as the mixing-based manner.
M AN U
SC
505
TE D
Table 6: The evaluation results (on the metric e) of different approaches using two different training manners (mixing and separation) for two genders. gender
classification model
proposed
spatial domain
proposed
mixing
0.8287
0.7033
0.8027
0.6925
separation
0.6208
0.5316
0.5992
0.5189
mixing
0.8738
0.7239
0.8550
0.6844
separation
0.6875
0.5667
0.6695
0.5518
AC C
male
regression model
spatial domain
EP
female
manner
518
The related evaluation results on the metric e are reported in Table 6. It
519
can be seen that for either the mixing-based or the separation-based manner,
520
the proposed methods have clear advantages over the corresponding spatial
521
domain based methods on both genders, which further verifies the effectiveness
522
of our NSCT-based feature pre-extraction strategy. In addition, we can see
523
that the separation-based manner can obtain better results in comparison to
524
the mix-based manner. This is straightforward as the separation of two genders
27
ACCEPTED MANUSCRIPT
will decrease the complexity of the age assessment problem, leading to higher
526
prediction accuracy.
RI PT
525
Nevertheless, it is worth noting that the mixed training manner still makes
528
sense from the viewpoint of deep learning. Using the mixing-based manner, the
529
mapping between the input and output tends to be more complicated because
530
it should fit more data, but deep learning models like CNNs have powerful ca-
531
pacity to tackle this problem. The meaningful results obtained by this manner
532
(although they are relatively inferior to the separation-based manner) verify
533
this point. In fact, the recent BAA benchmark proposed by Spampinato et
534
al. [34] also adopts the mixed training manner on this X-ray dataset. But
535
of course, from the viewpoint of biology, the separation manner offers a more
536
“natural” solution to this problem and produces better results, which provides
537
some valuable clues for the future study in this field. Based on the above con-
538
siderations, both of these two manners are attempted in this work (in Section
539
4.4, we use the mixing-based manner for the study of parameter settings be-
540
cause it can provide a more compact description than separating two genders.
541
Actually, we experimentally find that the characteristics of the results obtained
542
by the separation-based manner are very similar), and the obtained results are
543
employed for the comparison with the state of the art in Section 4.6.
544
4.6. Comparison with the State of the Art
545
4.6.1. Comparison on Spampinato et al.’s BAA benchmark
EP
TE D
M AN U
SC
527
Based on the experimental results exhibited in Section 4.4, we select the re-
547
gression model “4-level” and the classification model “3-level-mean” for compar-
548
ison with eight existing BAA methods that have been tested on the benchmark
549
[34] (more details have been provided in Section 4.1). Table 7 reports the perfor-
550
mance of different methods on the four mean squared errors (MAEs) presented
551
in Section 4.3. Some values in Table IV are not given because they were not
552
reported in related publications [34, 35]. The best results are indicated in bold.
553
It is noted that for a certain method the value of e can be smaller than both
554
e1 and e2 . Actually, this is always the case in practice because it is the average
AC C 546
28
ACCEPTED MANUSCRIPT
of two readings that acts as the ground truth for training. It can be seen from
556
Table 7 that the CNN-based methods (the last six) significantly outperform the
557
conventional hand-crafted feature based methods (the first four). The proposed
558
classification model obtains the best performance among all the methods. In
559
comparison to the baseline of CNN-based methods (VGGNet, GoogLeNet and
560
BoNet) presented in [34], the proposed method can achieve an improvement of
561
about 0.1 years. When compared with the improved TW-based method [35],
562
the proposed method can still obtain more competitive results.
M AN U
SC
RI PT
555
Table 7: The performance of different BAA methods on four mean squared errors (MAEs). Method Hsieh et al. [14] Gertych et al. [5] Giordano et al. [19] Giordano et al. [23]
e2
e¯
e
2.78
2.37
2.57
n/a
2.60
1.70
2.15
n/a
2.46
2.38
2.42
n/a
1.92
1.73
1.82
n/a
0.88
0.79
0.83
n/a
GoogLeNet [34]
0.86
0.79
0.82
n/a
BoNet [34]
0.80
0.79
0.79
n/a
TW-based VGGNet [35]
n/a
n/a
n/a
0.72
TE D
VGGNet [34]
e1
0.73
0.74
0.71
0.70
0.71
0.69
4.6.2. Comparison with Thodberg et al.’s BoneXpert To date, BoneXpert is the most influential commercial product in the field of
AC C
564
0.75
0.73
EP
563
Our regression model
Our classification model
565
bone age assessment and has been licensed to more than 130 hospitals around
566
the world. The BoneXpert method [18] applies the active appearance model
567
(AAM) [47] to automatically locate 15 bones in the hand and wrist for the de-
568
termination of bone age based on shape, intensity and texture features. In [44],
569
the BoneXpert method is tested on Digital Hand Atlas dataset and obtains very
570
impressive results. The bone age ranges tested in [44] are 2-15 years for girls and
571
2.5-17 years for boys. In addition, the root mean square error (RMSE) between
572
the automated prediction and the average of the two expert readings is mainly
29
ACCEPTED MANUSCRIPT
RI PT
Table 8: The performance of the BoneXpert method and the proposed method on RMSE for the samples with BA range: 2-15 years for females and 2.5-17 years for males. Method
female
male
BoneXpert
0.57
0.64
Proposed
0.66
0.71
whole 0.61
SC
0.69
employed for performance evaluation. Based on these settings, we compare the
574
proposed method with the BoneXpert method in this subsection. Specifically,
575
according to the results given in Table 6, we select the best performing approach,
576
the “3-level-mean” classification model using the separation-based training man-
577
ner, as the proposed method for comparison. The RMSEs are calculated based
578
on the samples within the above BA ranges (2-15 years for females and 2.5-17
579
years for males) for individual female samples, individual male samples and the
580
whole set. Table 8 lists evaluation results of the BoneXpert method (the results
581
are reported in [44]) and the proposed method. It can be seen that the BoneX-
582
pert method obtains smaller RMSEs for about 0.08 years. This is not surprising
583
because the BoneXpert method is elaborately designed in every part and has
584
been commercially sold for clinical routine use, while the proposed method still
585
has large potential for further improvement (more discussions on this issue are
586
provided in Section 5).
587
5. Discussion
AC C
EP
TE D
M AN U
573
588
This work mainly presents an NSCT-based feature pre-extraction strategy
589
for the CNN-based BAA methods, so the most fundamental comparison is with
590
the methods that directly adopt spatial domain images as network input. The
591
related results given in Table 1 and Table 4 show that the proposed methods
592
outperforms the corresponding spatial domain CNN-based methods with an
593
improvement of about 0.14 years on the MAE, for both the regression models
594
(0.8512 − 0.7136 = 0.1376) and the classification models (0.8289 − 0.6885 =
595
0.1404). We calculate the MAE between the two expert readings as a reference
30
ACCEPTED MANUSCRIPT
and the value is about 0.36 years. Considering the fact that the MAE for each
597
expert reading (e1 and e2 ) is very close to that for the average reading (e) (see
598
Table 7), this value can be approximately viewed as an “ideal” target of an
599
automated BAA method since it has reached the professional standard of an
600
expert when obtaining such an error. Therefore, when comparing the proposed
601
methods with the spatial domain methods (for simplicity, we just adopt the
602
regression model as the example while the situation of the classification model
603
is very similar), the gap between the actual performance and the “ideal” target
604
decreases from about 0.49 (i.e., 0.85 − 0.36) years to about 0.35 (i.e., 0.71 − 0.36)
605
years (relatively about 29%), which is believed to be an obvious improvement5 .
606
Similar observations can be obtained from Table 6 using the above evaluation
607
process.
M AN U
SC
RI PT
596
Table 9: The average running time and standard deviations of the proposed method and the spatial domain method on training CNN models in Section 4.4 (Unit: hour).
TE D
regression model
classification model
spatial domain
proposed
spatial domain
proposed
average running time
3.96
14.61
1.35
4.14
standard deviation
0.04
0.06
0.02
0.03
However, it is easy to find that the proposed BAA method will be more
609
memory and time consuming than the spatial domain methods. Fortunately,
610
owing to the popularization of high-performance GPUs, this is not a crucial
EP
608
issue. In our experiments, the employed GPU is an NVIDIA GTX 1080 Ti
612
(11 GB), and we verify that all the networks can also be trained on a GTX
613
1080 GPU (8 GB). Table 9 reports the average running time and standard
614
deviations of the proposed method and the spatial domain method on training
615
CNN models in Section 4.4. As mentioned before, the regression model “4-
616
level” and the classification model “3-level-mean” are employed in the proposed
AC C 611
5 This
evaluation process may be not strictly precise from the viewpoint of statistics, but
we believe it can approximately reflect the extent of the improvement.
31
ACCEPTED MANUSCRIPT
method. Although the time cost of the proposed method is larger than the
618
spatial domain method due to the higher model complexity, it takes only about
619
14.61 hours and 4.14 hours for the proposed method to train the regression
620
model and the classification model, respectively. Furthermore, considering that
621
the training procedure is performed offline, the above difference on the training
622
cost is actually not so crucial and the cost is generally acceptable in practice.
623
Table 10 lists the average running time and standard deviations of the proposed
624
method and the spatial domain method on testing an image (the time unit is
625
millisecond). The time cost of the image pre-processing part (including image
626
resize for all the methods and NSCT decomposition for the proposed methods)
627
is not involved in the results of Table 10. The time cost of image resize is
628
negligible. The average running time of a 4-level NSCT decomposition for an
629
256 × 256 image using Matlab on our platform (Intel Core i7-6800K CPU) is
630
230 milliseconds. Clearly, this process can be further accelerated by applying
631
more efficient implementation techniques. But even so, the total running time
632
of the proposed method is lower than 0.3 seconds, which generally meets the
633
requirement of practical usage.
TE D
M AN U
SC
RI PT
617
Table 10: The average running time and standard deviations of the proposed methods and
EP
the spatial domain methods on testing an image (Unit: millisecond).
AC C
average running time standard deviation
regression model
classification model
spatial domain
proposed
spatial domain
proposed
5.58
22.69
8.92
35.36
0.03
0.08
0.04
0.12
634
The comparison on Spampinato et al.’s BAA benchmark [34] shows that the
635
proposed method obtains a clear advantage over the CNN-based methods (VG-
636
GNet, GoogLeNet and BoNet) under the GP framework, with an improvement
637
of 0.08-0.12 years on MAE. There is no significant improvement when compared
638
with the CNN-based method using the TW framework [35]. With more detailed
639
modeling strategy, the TW-based approach has emerged as a popular way to
640
improve the performance of GP-based methods, but the problems like appro32
ACCEPTED MANUSCRIPT
priately defining and accurately detecting local ROIs remain challenging. The
642
proposed approach offers another way to improve traditional GP-based meth-
643
ods. Moreover, it is easy to notice that the proposed feature pre-extraction
644
strategy can also be adopted under the TW framework.
RI PT
641
There exists a disparity between the performance of the proposed method
646
and the BoneXpert method [18, 44] (about 0.08 years on RMSE), but we be-
647
lieve this study still has its significance. On one hand, although the BoneXpert
648
method has achieved great success in clinical practice, it still has a few limita-
649
tions such as rejecting a small percentage of radiographs with excessive noise
650
and relying on a relationship between chronological age and bone age [10, 33].
651
On the other hand, the proposed method still has large space for future im-
652
provement. In this work, for the sake of fair comparison using Spampinato et
653
al.’s BAA benchmark [34], the original X-ray images are just simply resampled
654
as network input. By applying some image pre-processing procedures such as
655
hand region detection to eliminate the impact of background regions [10], the
656
performance of the BAA method may be further improved. In addition, as
657
mentioned above, the proposed approach can be viewed as a generalized im-
658
proved strategy for the spatial domain methods, so it can also be utilized in the
659
TW-based methods to pursue better performance.
TE D
M AN U
SC
645
Finally, we discuss a potential benefit of the feature pre-extraction strategy
661
presented in this work. In current CNN-based BAA study (also for many other
662
medical image related applications), in order to adopt the popular convolu-
AC C
EP
660
663
tional networks that are pre-trained on general imagery, almost all the existing
664
methods firstly resample the input images to a relatively lower resolution like
665
256×256 to fit the employed network architecture. Obviously, this operation will
666
cause the loss of image details which may be very important for the assessment
667
task, leading to a considerable limitation of these methods. Nevertheless, the
668
proposed feature pre-extraction strategy may provide a potential way to over-
669
come this limitation. Specifically, by employing some feature pre-extraction
670
approaches to help the training process, it may be not necessary to adopt the
671
pre-trained models. As a result, the restriction on image resolution can be bro33
ACCEPTED MANUSCRIPT
ken and the flexibility of architecture design can be largely improved. However,
673
the effectiveness of this idea needs a systematic verification that involves a se-
674
ries of factors, such as the size of input images, the specific CNN architecture
675
and the feature pre-extraction technique, which has exceeded the scope of this
676
paper.
677
6. Conclusion
SC
RI PT
672
In this study, a multi-scale data fusion framework for automated bone age
679
assessment (BAA) based on convolutional neural networks (CNNs) is proposed.
680
Instead of directly employing the spatial domain image as network input in
681
existing CNN-based BAA methods, the non-subsampled contourlet transform
682
(NSCT) is firstly performed for feature pre-extraction, aiming to achieve bet-
683
ter training effects with limited annotated examples. Furthermore, since the
684
information contained in original spatial domain image is well separated by
685
NSCT, this approach improves the flexibility in designing network architectures
686
via different data fusion strategies. Based on this idea, we present a regression
687
model using feature-level fusion and a classification model using decision-level
688
fusion for BAA. Extensive experiments are conducted on the Digital Hand Atlas
689
dataset. The results demonstrate that the proposed method can achieve an ob-
690
vious improvement over the corresponding spatial domain CNN-based methods.
691
Moreover, the proposed method can obtain very competitive performance when
692
comparing with some state-of-the-art BAA methods. In the future, we will ex-
AC C
EP
TE D
M AN U
678
693
plore the potential of the proposed feature-extraction strategy on overcoming
694
the limitation of input image resolution in CNN-based bone age assessment,
695
as discussed in the last paragraph in Section 5. The impact of different set-
696
tings including the size of input images, the CNN architecture and the feature
697
pre-extraction approaches will be systematically studied.
698
Conflict of Interest
699
No potential conflict of interest. 34
ACCEPTED MANUSCRIPT
Acknowledgements
RI PT
700
The authors would like to thank the editors and anonymous reviewers for
702
their insightful comments and constructive suggestions. This work was sup-
703
ported by the National Natural Science Foundation of China (Grants 61701160
704
and 81571760), the Provincial Natural Science Foundation of Anhui (Grant
705
1808085QF186), the Fundamental Research Funds for the Central Universities
706
(Grant JZ2018HGTB0228) and the SenseTime Research Fund.
SC
701
[1] D. D. Martin, J. M. Wit, Z. Hochberg, L. Savendahl, R. R. van Rijn,
708
O. Fricke, N. Cameron, J. Caliebe, T. Hertel, D. Kiepe, K. A. Wikland,
709
H. H. Thodberg, G. Binder, M. B. Ranke, The use of bone age in clinical
710
practice - part 1, Hormone Research in Paediatrics 76 (1) (2011) 1–9.
M AN U
707
[2] J. M. Tanner, M. J. R. Healy, H. Goldstein, N. Cameron, Assessment of
712
Skeletal Maturity and Prediction of Adult Height (TW3 method), W.B.
713
Saunders, London, 2001.
TE D
711
[3] H. H. Thodberg, O. G. Jenni, J. Caflisch, M. B. Ranke, D. D. Martin,
715
Prediction of adult height based on automated determination of bone age,
716
The Journal of Clinical Endocrinology & Metabolism 94 (12) (2009) 4868–
717
4874.
EP
714
[4] A. M. Vukicevic, G. R. Jovicic, M. N. Jovicic, V. L. Milicevic, N. D. Fil-
719
ipovic, Assessment of cortical bone fracture resistance curves by fusing ar-
AC C
718
720
tificial neural networks and linear regression, Computer Methods in Biome-
721
chanics and Biomedical Engineering 21 (2) (2018) 169–176.
722
[5] A. Gertych, A. Zhang, J. Sayre, S. Kurkowska, H. Huang, Bone age assess-
723
ment of children using a digital hand atlas, Computerized Medical Imaging
724
and Graphics 31 (2007) 322–331.
725
[6] W. Greulich, S. Pyle, Radiographic atlas of skeletal development of the
726
hand and wrist, The American Journal of the Medical Sciences 238 (3)
727
(1959) 393. 35
ACCEPTED MANUSCRIPT
[7] J. M. Tanner, R. H. Whitehouse, N. Cameron, W. A. Marshall, M. J. R.
729
Healy, H. Goldstein, Assessment of skeletal maturity and prediction of adult
730
height (TW2 method), Academic Press, London, 1975.
RI PT
728
[8] P. Kaplowitz, S. Srinivasan, J. He, R. MaCarter, M. R. Hayeri, R. Sze,
732
Comparison of bone age readings by pediatric endocrinologists and pedi-
733
atric radiologists using two bone age atlases, Pediaric Radiology 41 (6)
734
(2011) 690–693.
SC
731
[9] V. Iglovikov, A. Rakhlin, A. Kalinin, A. Shvets, Pediatric bone age assess-
736
ment using deep convolutional neural networks, arXiv 1712.05053 (2017)
737
1–14.
M AN U
735
738
[10] H. Lee, S. Tajmir, J. Lee, M. Zissen, B. Yeshiwas, T. Alkasab, G. Choy,
739
S. Do, Fully automated deep learning system for bone age assessment,
740
Journal of Digital Imaging 30 (2017) 427–441.
[11] S. Wang, Y. Shen, C. Shi, P. Yin, Z. Wang, P. W. Cheung, J. P. Cheung,
742
K. D. LUK, Y. Hu, Skeletal maturity recognition using a fully automated
743
system with convolutional neural networks, IEEE Access 6 (2018) 29979–
744
29993.
TE D
741
[12] E. Pietka, A. Gertych, S. Pospiech, F. Cao, H. Huang, V. Gilsanz,
746
Computer-assisted bone age assessment: image preprocessing and epiphy-
747
seal/metaphyseal roi extraction, IEEE Transactions on Medical Imaging
AC C
EP
745
748
749
750
20 (8) (2001) 715–729.
[13] E. Pietka, Computer-assisted bone age assessment-database adjustment, International Congress Series 1256 (2003) 87–92.
751
[14] C. Hsieh, T. Jong, C. Tiu, Bone age estimation based on phalanx informa-
752
tion with fuzzy constrain of carpals, Medical & Biological Engineering &
753
Computing 45 (3) (2007) 283–295.
36
ACCEPTED MANUSCRIPT
[15] A. Zhang, A. Gertych, B. J. Liu, Automatic bone age assessment for young
755
children from newborn to 7-year-old using carpal bones, Computerized
756
Medical Imaging and Graphics 31 (2007) 299–310.
758
[16] A. Vega, J. Arribas, A radius and ulna tw3 bone age assessment system, IEEE Transactions on Biomedical Engineering 55 (5) (2008) 1463–1476.
SC
757
RI PT
754
[17] J. Liu, J. Qi, Z. Liu, Q. Ning, X. Luo, Automatic bone age assessment based
760
on intelligent algorithms and comparison with tw3 method, Computerized
761
Medical Imaging and Graphics 32 (8) (2008) 678–684.
M AN U
759
762
[18] H. H. Thodberg, S. Kreiborg, A. Juul, K. D. Pedersen, The bonexpert
763
method for automated determination of skeletal maturity, IEEE Transac-
764
tions on Medical Imaging 28 (1) (2009) 52–66.
[19] D. Giordano, C. Spampinato, G. Scarciofalo, R. Leonardi, An automatic
766
system for skeletal bone age measurement by robust processing of carpal
767
and epiphysial/metaphysial bones, IEEE Transactions on Instrumentation
768
and Measurement 59 (10) (2010) 2539–2553.
TE D
765
[20] H. Lin, S. Shu, Y. Lin, S. Yu, Bone age cluster assessment and feature
770
clustering analysis based on phalangeal image rough segmentation, Pattern
771
Recognition 45 (2012) 322–332.
EP
769
[21] M. Harmsen, B. Fischer, H. Schramm, T. Seidl, T. Deserno, Support vector
773
machine classification based on correlation prototypes applied to bone age
AC C
772
774
assessment, IEEE Journal of Biomedical and Health Informatics 17 (1)
775
(2013) 190–197.
776
[22] M. Kashif, T. M. Deserno, D. Haak, S. Jonas, Feature description with
777
sift, surf, brief, brisk, or freak? a general question answered for bone age
778
assessment, Computers in Biology and Medicine 68 (2016) 67–75.
779
[23] D. Giordano, I. Kavasidis, C. Spampinato, Modeling skeletal bone devel-
780
opment with hidden markov models, Computer Methods and Programs in
781
Biomedicine 124 (2016) 138–147. 37
ACCEPTED MANUSCRIPT
[24] S. Simu, S. Lal, A study about evolutionary and non-evolutionary segmen-
783
tation techniques on hand radiographs for bone age assessment, Biomedical
784
Signal Processing and Control 33 (2017) 220–235.
786
787
788
[25] M. Sabeti, R. Boostani, B. Davoodi, Improved particle swarm optimisation to estimate bone age, IET Image Processing 12 (2) (2018) 179–187.
SC
785
RI PT
782
[26] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
[27] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based leaning applied
790
to document recognition, Proceedings of The IEEE 86 (11) (1998) 2278–
791
2324.
M AN U
789
[28] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep
793
convolutional neural networks, in: Proceedings of Advances in Neural In-
794
formation Processing Systems (NIPS), 2012, pp. 1097–1105.
795
796
TE D
792
[29] P. Meyer, V. Noblet, C. Mazzara, A. Lallement, Survey on deep learning for radiotherapy, Computers in Biology and Medicine 98 (2018) 126–146. [30] G. Litjens, T. Kooi, B. Bejnordi, A. Setio, F. Ciompi, M. Ghafoorian,
798
J. vander Laak, B. van Ginneken, C. Sanchez, A survey on deep learning
799
in medical image analysis, Medical Image Analysis 42 (2017) 60–88. [31] D. Shen, G. Wu, H. Suk, Deep learning in medical image analysis, Annual
AC C
800
EP
797
801
Review of Biomedical Engineering 19 (2017) 221–248.
802
[32] D. Stern, C. Payer, V. Lepetit, M. Urschler, Automated age estimation
803
from hand mri volumes using deep learning, in: Proceedings of Interna-
804
tional Conference on Medical Image Computing and Computer-Assisted
805
Intervention (MICCAI), 2016, pp. 194–202.
806
807
[33] M. Chen, Automated bone age classification with deep neural networks, Tech. rep., Stanford University (2016).
38
ACCEPTED MANUSCRIPT
[34] C. Spampinato, S. Palazzo, D. Giordano, M. Aldinucci, R. Leonardi, Deep
809
learning for automated skeletal bone age assessment in x-ray images, Med-
810
ical Image Analysis 36 (2017) 41–51.
RI PT
808
[35] J. Zhou, Z. Li, W. Zhi, B. Liang, D. Moses, L. Dawes, Using convolu-
812
tional neural networks and transfer learning for bone age classification,
813
in: Proceedings of International Conference on Digital Image Computing:
814
Techniques and Applications (DICTA), 2017, pp. 1–6.
SC
811
[36] X. Chen, C. Zhang, Y. Liu, Bone age assessment with x-ray images based
816
on contourlet motivated deep convolutional networks, in: IEEE 20th Inter-
817
national Workshop on Multimedia Signal Processing (MMSP), 2018, pp.
818
1–6.
819
820
M AN U
815
[37] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv 1409.1556v2 (2014) 1–10. [38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-
822
han, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, arXiv
823
1409.4842 (2014) 1–12.
TE D
821
[39] A. L. da Cunha, J. Zhou, M. N. Do, The nonsubsampled contourlet trans-
825
form: theory, design, and applications, IEEE Transactions on Image Pro-
826
cessing 15 (10) (2006) 3089–3101.
EP
824
[40] Y. Zhou, J. Wang, Image denoising based on the symmetric normal inverse
828
gaussian model and non-subsampled contourlet transform, IET Image Pro-
829
cessing 6 (8) (2012) 1136–1147.
AC C
827
830
831
832
833
[41] Q. Zhang, B. Guo, Multifocus image fusion using the nonsubsampled contourlet transform, Signal Processing 89 (7) (2009) 1334–1346.
[42] S. Li, B. Yang, J. Hu, Performance comparison of different multi-resolution transforms for image fusion, Information Fusion 12 (2011) 74–84.
39
ACCEPTED MANUSCRIPT
[43] Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based
835
on multi-scale transform and sparse representation, Information Fusion 24
836
(2015) 147–164.
RI PT
834
[44] H. H. Thodberg, L. Savendahl, Validation and reference values of auto-
838
mated bone age determination for four ethnicities, Academic Radiology
839
17 (11) (2010) 1425–1432.
SC
837
[45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
841
S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast fea-
842
ture embedding, in: Proceedings of the ACM International Conference on
843
Multimedia, 2014, pp. 675–678.
844
845
M AN U
840
[46] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bulletin 1 (6) (1945) 80–83.
[47] T. F. Cootes, G. J. Edwards, C. J. Taylor, Active appearance models, IEEE
847
Transactions on Pattern Analysis and Machine Intelligence 23 (6) (2001)
848
681–685.
AC C
EP
TE D
846
40
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40 30 20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
2-level-maximum: 1.7873 2-level-mean: 1.7920 2-level-multiplication: 1.7786
AC C
Percentage (%)
70
2.5
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40 30 20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
3-level-maximum: 1.8033 3-level-mean: 1.8183 3-level-multiplication: 1.7799
AC C
Percentage (%)
70
2.5
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40 30 20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
4-level-maximum: 1.7587 4-level-mean: 1.7904 4-level-multiplication: 1.7437
AC C
Percentage (%)
70
2.5
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40 30 20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
5-level-maximum: 1.6933 5-level-mean: 1.7683 5-level-multiplication: 1.7495
AC C
Percentage (%)
70
2.5
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
120
RI PT
80
SC
60
20
0 0
2
4
6
8
M AN U
40
10
EP
TE D
Bone Age
AC C
Number of Images
100
12
14
16
18
20
ACCEPTED MANUSCRIPT
NSP
NSDFB Fine
NSDFB
NSP
Scale
RI PT
High frequency
Input image
SC
High frequency
AC C
EP
TE D
M AN U
Low frequency
Coarse
ACCEPTED MANUSCRIPT
……
SC
……
RI PT
Scale #1 (16 bands)
M AN U
Scale #2 (8 bands)
……
TE D
Scale #3 (8 bands)
……
AC C
EP
Scale #4 (4 bands)
…… Scale #5 (4 bands)
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Convolution
Scale #2
VGGNet-16 convolutional architecture Concatenation
Scale #1
VGGNet-16 convolutional architecture
Regression
Bone age
1024
Scale #L
M AN U
SC
Input image
RI PT
NSCT decomposition
AC C
EP
TE D
VGGNet-16 convolutional architecture
128
1
Fully-connected
Fully-connected
Fully-connected
EP
TE D
M AN U
SC
RI PT
2×2 Max-pool 3×3 Conv, 512 3×3 Conv, 512 3×3 Conv, 512 2×2 Max-pool 3×3 Conv, 256 3×3 Conv, 256 3×3 Conv, 256 2×2 Max-pool 3×3 Conv, 128 3×3 Conv, 128 3×3 Conv, 128 2×2 Max-pool 3×3 Conv, 128 3×3 Conv, 128 2×2 Max-pool 3×3 Conv, 64 3×3 Conv, 64
AC C
ACCEPTED MANUSCRIPT
VGGNet-16 convolutional architecture
ACCEPTED MANUSCRIPT
NSCT decomposition
Classification
Estimation
FC, #class
FC, 4096
FC, 4096
Scale #1
VGGNet-16 convolutional architecture
Fusion
FC, #class
FC, 4096
FC, 4096
Scale #2
VGGNet-16 convolutional architecture
SC M AN U FC, #class
FC, 4096
FC, 4096
Scale #L
VGGNet-16 convolutional architecture
AC C
EP
TE D
#class
Bone age
Fusion
#class
Input image
RI PT
#class
#class
1
ACCEPTED MANUSCRIPT
100
40
20
0.5
1.0
1.5
2.0
2.5
SC
0 0.0
RI PT
60
EP
TE D
M AN U
Error (year)
AC C
Percentage (%)
80
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40
20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
spatial domain: 1.6822 1-level: 1.6118 2-level: 1.7121 3-level: 1.7650 4-level: 1.7866 5-level: 1.7841
30
AC C
Percentage (%)
70
2.5
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40
20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
spatial domain: 1.6908 scale #1: 1.7451 scale #2: 1.7170 scale #3: 1.7129 scale #4: 1.5569 scale #5: 1.3598
30
AC C
Percentage (%)
70
2.5
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40
20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
spatial domain: 1.6908 1-level-maximum: 1.7451 2-level-maximum: 1.7873 3-level-maximum: 1.8033 4-level-maximum: 1.7587 5-level-maximum: 1.6933
30
AC C
Percentage (%)
70
2.5
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40
20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
spatial domain: 1.6908 1-level-mean: 1.7451 2-level-mean: 1.7920 3-level-mean: 1.8183 4-level-mean: 1.7904 5-level-mean: 1.7683
30
AC C
Percentage (%)
70
2.5
ACCEPTED MANUSCRIPT
100 90 80
RI PT
60 50
SC
40
20 10 0 0
0.5
1
1.5
2
EP
TE D
Error (year)
M AN U
spatial domain: 1.6908 1-level-multiplication: 1.7451 2-level-multiplication: 1.7786 3-level-multiplication: 1.7799 4-level-multiplication: 1.7437 5-level-multiplication: 1.7495
30
AC C
Percentage (%)
70
2.5
ACCEPTED MANUSCRIPT
Highlights 1. Proposes a multi-scale data fusion framework for BAA based on NSCT and CNNs.
RI PT
2. Presents a regression CNN model for BAA using a feature-level fusion strategy.
3. Presents a classification CNN model for BAA using a decision-level fusion strategy.
SC
4. Introduces a novel curve-based performance evaluation approach for BAA.
AC C
EP
TE D
M AN U
5. Achieves state-of-the-art BAA results on the Digital Hand Atlas dataset.