Computers in Industry 112 (2019) 103121
Contents lists available at ScienceDirect
Computers in Industry journal homepage: www.elsevier.com/locate/compind
Multi-class structural damage segmentation using fully convolutional networks Juan Jose Rubioa , Takahiro Kashiwab , Teera Laiteeraponga , Wenlong Denga , Kohei Nagaib , Sergio Escalerac , Kotaro Nakayamad, Yutaka Matsuod, Helmut Prendingera,* a
National Institute of Informatics, Tokyo, Japan Institute of Industrial Science, The University of Tokyo, Tokyo, Japan c Universitat de Barcelona and Computer Vision Center, Barcelona, Spain d Graduate School of Engineering, The University of Tokyo, Tokyo, Japan b
A R T I C L E I N F O
A B S T R A C T
Article history: Received 31 December 2018 Received in revised form 3 July 2019 Accepted 5 August 2019 Available online 21 August 2019
Structural Health Monitoring (SHM) has benefited from computer vision and more recently, Deep Learning approaches, to accurately estimate the state of deterioration of infrastructure. In our work, we test Fully Convolutional Networks (FCNs) with a dataset of deck areas of bridges for damage segmentation. We create a dataset for delamination and rebar exposure that has been collected from inspection records of bridges in Niigata Prefecture, Japan. The dataset consists of 734 images with three labels per image, which makes it the largest dataset of images of bridge deck damage. This data allows us to estimate the performance of our method based on regions of agreement, which emulates the uncertainty of in-field inspections. We demonstrate the practicality of FCNs to perform automated semantic segmentation of surface damages. Our model achieves a mean accuracy of 89.7% for delamination and 78.4% for rebar exposure, and a weighted F1 score of 81.9%. © 2019 Published by Elsevier B.V.
Keywords: Bridge damage detection Deep learning Semantic segmentation
1. Introduction [12_TD$IF]Japan's transportation infrastructure presents a high number of bridges rated at 723,495 bridges by the Ministry of Land, Infrastructure, Transport and Tourism (MLIT). More than a half of the bridges in Japan is below [13_TD$IF]100 m in length, and 48% of all the bridges in Japan are expected to be more than 50 years old in the coming 10 years. The structural integrity of critical infrastructure needs to be evaluated regularly to guarantee the safety of its operation. At the same time, there is a strong demand for less timeconsuming and more efficient inspections, because the bridge managers—71% of which are municipalities in Japan—lack civil engineers with sufficient knowledge for infrastructure maintenance [1]. According to the Concrete Standard Specification in Japan [2], bridge inspection is divided into two phases: Overall inspection for global health monitoring, and Detailed inspection for local health monitoring.
* Corresponding author. E-mail address:
[email protected] (H. Prendinger). https://doi.org/10.1016/j.compind.2019.08.002 0166-3615/© 2019 Published by Elsevier B.V.
[14_TD$IF]The overall inspection is intended for checking the existence of damages for all bridges. If the engineers find damages on the bridge and they decide that more facts should be collected to evaluate the prospect of damage expansion, the bridge is subject to the additional detailed inspection by more sophisticated and quantitative methods, such as Non-Destructive Evaluation (NDE). To prevent accidents caused by the deterioration of bridges, both overall and detailed inspection need to be conducted. While visual judgment by highly skilled civil engineers is required for the inspection, Structural Health Monitoring (SHM) has been adopted for identifying the location and severity of damages. The concept of SHM is to determine the existence, location and severity of damages in a structure. Thanks to the development of sensors measuring acceleration and Non-Destructive Evaluations (NDE), SHM has been adopted especially for detailed inspection. However, because of the complicated, expensive equipment and time-consuming procedures, NDE is not practical for overall inspection [3]. Rapid improvement in image processing technologies allows us to conduct the overall inspection more efficiently, especially for cracks on concrete decks. However, overall inspection is also useful to identify other kinds of damages that can be observed on concrete. For instance, “delamination” refers to a state where there
2
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
are fracture planes on the surface of the concrete [4], and the excess delamination will result in “rebar exposure” (here, “rebar” means the steel rebar inside the concrete). The current inspection code in Japan defines 14 kinds of damages including delamination [5], but there have been few researches that address the automated identification of such damages (except cracks). We acknowledge that the internal damage of a bridge can be quite extensive before it can be captured by vision-based methods. Nevertheless, such methods are well suited for cost-effective bridge inspection. In our collaborative work between Artificial Intelligence researchers and civil engineers, we aim to develop a proof of concept for damage detection on the surface of concrete bridges in Japan. The main contribution of our project is to develop an automated segmentation framework for finding delamination and rebar exposure in concrete bridges, which can potentially can help to reevaluate the criteria of damage definition in Japan. A related goal is the use of this methodology in developing countries, where the lack of skilled labor with qualification for infrastructure inspection can benefit from an automated evaluation framework. In order to evaluate the segmentation method, as a secondary contribution, a dataset of 734 images from the inspection record in Niigata Prefecture, Japan, has been collected with multiple labels per sample in order to quantify the agreement among different inspectors. For this proof of concept, the dataset is limited to a certain region of a concrete bridge, the concrete deck. The remainder of the paper is organized as follows. [15_TD$IF]Section 2 reports on work related to our paper. The description of the datasets is given in [16_TD$IF]Section 3. Section 4 briefly explains the FCN model used in this paper. [17_TD$IF]Section 5 provides information on our experiments, such as parameter setting, the definition of our unique Intersection over Union (IoU) metric, and the evaluation methodology. [18_TD$IF]Section 6 provides the analysis to evaluate model performance. Finally, [19_TD$IF]Section 7 concludes the paper and suggests directions for future work.
However, there are many other defects that might be automatically detected on concrete decks. One of the common defects on concrete surfaces beside crack is delamination [4]. Delamination is a state where a thin cover of the concrete deck of a bridge is partially laminated from the surface of the deck. When delamination becomes worse, the rebar inside the deck is exposed (rebar exposure), which is considered to be a serious deficiency. Although delamination and rebar exposure is regulated as kinds of damage that should be detected in overall inspection [5], little work on identifying the location of these damages has been done. The first work to detect and localize multiple number of civil infrastructure damage using flexible bounding boxes is described in [17]. They succeed in localizing five types of damages (concrete cracks, medium/high steel corrosion, steel delamination, bolt corrosion) appearing in concrete/steel structures. The goal of that research is the detection of damages by Faster R-CNN, in which the results are represented in squares. However, for inspectors to clearly estimate the remaining structural functionality, grasping exact areas of the damages is very important. Therefore, the goal of our research, “semantic segmentation” (or pixel-wise labeling) of the damages, will have a great practical meaning. Further, we would like to note that the work of [17] is performed in a controlled scenario, whereas in our case, we use realistic images taken by inspectors in the real world, who were not anticipating that their images would be used for Deep Learning. [18] is the first paper to quantify concrete spalling damage in terms of accurate volume, which adds another dimension to investigating damages. While we also use Fully Convolutional Networks (FCNs) for image segmentation, we aim to apply FCNs for segmentation on a realistic dataset of texture-based damages on images collected by Niigata Prefecture, Japan, over several years. By contrast to images taken in an experimental environment or a controlled setting, such dataset collected “in the wild” has several challenges in terms of variation in quality and also labeling.
2. Related works
3. Dataset
Traditional approaches for SHM mainly address local damage evaluation, i.e., the identification of damage locations and quantitative evaluation. For the inspection of concrete structures, approaches such as the Impact Echo (IE) method [6,7], the combination of IE and Impulse Response method [8], and Infrared Thermography (IR) method [9,10] have been proposed as NDE methods for SHM. To perform NDE, sensors are attached to some of the large structures. However, because the apparatus has not been implemented on many other bridges and because of regulations of law, inspectors in rural municipalities rely mostly on on-site visual judgment for their inspection. Recently, many infrastructure managers have shown an interest in applying SHM for overall inspection as global health monitoring, because the Ministry of Land, Infrastructure, Transport and Tourism in Japan (MLIT) requires bridge managers to conduct overall inspection for all the bridges every 5 years [5]. Historically, many researchers have been trying to identify the existence of cracks on concrete surface. Classic computer vision methods comprise thresholding and the use of filters for crack detection in concrete or steel [11,12]. Lately, computer vision algorithms have been combined with machine learning such as Support Vector Machines (SVM), or self organizing maps, for crack detection [13,14]. Due to the success of Deep Learning in computer vision tasks, Convolutional Neural Networks (CNN) have recently been applied to damage detection. For example, Cha et al. [15] applied the AlexNet network [16] to crack detection on concrete surfaces in a patch-wise analysis of images taken in a controlled (not natural) scenario.
3.1. Description of dataset One of the major contributions of this work is the creation of a dataset that can be used to benchmark texture-based segmentation models, while also considering the uncertainty of the labels due to the complexity of the damage “object”. The dataset consists of images taken from standard infrastructure inspection records of 26 municipalities from the Niigata prefecture in Japan. The images are obtained from the deck area (see Fig. 1) of 9344 bridges yielding an initial set of 1879 images containing damage. After automatic filtering based on resolution and manual inspection based on data quality, an initial set of 734 images has been labeled. Automatic filtering has been performed by setting a lower bound of resolution (1Mpx images).
[(Fig._1)TD$IG]
Fig. 1. Structure of standard bridge, Standard Bridge Maintenance Instruction [5]. The dataset is based on the region marked as deck.
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
The criterion for filtering relates to images in an old image archive that contains very low resolution images. It is important to note that the origin of the dataset was purely targeted to documenting visual inspections in standard procedures, which yields a very rich dataset in terms of real world representation, by contrast to datasets created in a controlled scenario. The images, taken by different inspectors, present a wide variety of angles, distances and lighting conditions. In this work, we focus on two types of damage within our Regions Of Interest (ROI). First, delamination is present when the concrete surface gets detached from the structure, mainly due to poor quality of material, the building process or the age of the structure. When delamination takes place, the aggregate of the concrete becomes visible. Heavy delaminated regions leads to the second type of damage, rebar (reinforced steel bar) exposure. Rebar corrosion is present when the steel bar is visible and gets corroded due to salt and humidity. Sample images can be seen in Fig. 2. The dataset has been annotated using the web-based tool LabelMe [19] by three civil engineering students, and for a subset of data, labeling was outsourced to three external labelers. The external labelers have been trained and instructed to follow the established criteria, based on the following: Delamination: A state where aggregates (stoned mixed in the concrete) is obviously visible; Rebar exposure: A state where “rebar” is obviously recognized on the surface. We provided several examples of annotations for these damages so that all the labelers to share the same criteria for delamination and rebar exposure. We communicated both face-toface and over e-mail. A total of six labelers annotated three labels (“delamination”, “rebar exposure”, “no damage”) per image to model the uncertainty in the process of evaluating the damages in a given inspection image. A set of statistics has been computed to describe the dataset. As shown in Fig. 3, delamination represents the majority of blobs and blob sizes. The histogram is slightly more spread which represents that sizes and shapes are very diverse, being in general much bigger than rebar exposure blobs. Regarding rebar exposure blobs, the labeled regions are much smaller in size and usually present a more constant structure, presenting an elongated thin shape as steel bars are exposed. The texture representing rebar exposure tends to be more consistent compared to delamination areas since the exposure or corrosion on metal is a texture with less variation than the different levels of concrete delamination. Finally, the class “non-damage” is generally represented as only one blob since it is defined as the area inside the ROI which is not labeled as any of the two other damages. The average size of this class represents 86% of
3
the entire deck area. The images are in general around 1Mpx and most of them present a standard resolution while others are crops of larger images. To evaluate the level of agreement among the labelers, Table 1 presents the per class IoU (Intersection over Union). This agreement has been computed by averaging the IoU of the three unique combinations of the labels per image across the entire dataset. We can observe that the agreement on non-damaged areas is very high, as they are generally easy to identify. Parts with no agreement tend to be small in relation to the entire area of nondamage. On the other hand, for delamination, even though it has an easy to identify texture, the lack of agreement is present when it comes to consider the precise area of delamination. The reason is that depending on the degree of aggregation, some inspectors label it as “damage” or “no damage”. This damage type generally has a shared core among different labelers and the disagreement tends to be on the border of the damage. Finally, rebar exposure has a slightly higher agreement due to its very well defined elongated shape, but still lacks a high agreement due to the difficulty in defining a rebar exposure in some cases, e.g., where depending on lighting and even dirt or concrete cracks, it can be misinterpreted as rebar exposure. In this case, the agreement tends to be more refined but the hit rate is lower than in the delamination blobs. 3.2. Multi-label approach Considering the nature of the dataset and its variability in the interpretation of the damages, every image has a set of three labels obtained independently from different experts. This allows us to obtain a more accurate estimate of human level accuracy as well as a metric to consider areas of uncertainty. Note that our dataset is multi-class, because we discriminate multiple labels (types of damage), and it is multi-label, because each image pixel can have different labels according to the judgement of different labelers (annotators). Multiple labels per image pixel allows us to train the model with different approaches based on the degree of agreement among the different labels. The strict approach will only use regions with agreement among all labelers. Another approach will use multiple labels as a data augmentation strategy that yields a dataset three times bigger, where each image sample is trained with different labels. To take advantage of having multiple labels per data sample, the labels have been extended to form another ground truth (GT) in which the certain, uncertain and penalization regions for each class (C) are considered. This allows us to take into consideration the labeling agreement among the different labelers. C GT Ccore ¼ \Li¼1 GT C
ð1Þ
[(Fig._2)TD$IG]
Fig. 2. Sample images containing the Region Of Interest (ROI) in “red”, delamination damage in “blue”, rebar exposure in “purple” and non-damage regions. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
4
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
[(Fig._3)TD$IG]
Fig. 3. Histogram of blob pixel sizes.
therefore, prediction of class C within this region will penalize the score. An example of the resulting ground truth per class given a set of three different original labels can be seen in Fig. 4.
Table 1 Average IoU of the 3 labels per image reported per class.
IoU
Delamination
Rebar exposure
Non-damage
0.435
0.466
0.887
4. Deep learning models 4.1. Convolutional neural networks
C GT Cuncertain ¼ [Li¼1 GT C GT Ccore
ð2Þ
C 2 [Li¼1 GT C GT Cerror ¼ x=
ð3Þ
GT Ccore is the main target of the prediction since it represents the full agreement. On the other hand, GT Cuncertain represents the area in which at least one label contains pixels of a particular class C but does not belong to the full agreement. Finally, GT Cerror is the region in which there is full agreement of no presence of class C, and
[(Fig._4)TD$IG]
Convolutional Neural Networks (CNN) are a specific type of Neural Network (NN) targeted at many applications involving structured data such as images, time series data or 3D pointclouds, among others. CNNs gained a lot of attention after AlexNet [16] was introduced in the 2012 ImageNet classification competition and many current state-of-the-art methods in computer vision are based on the building blocks of CNNs. The main element of a regular CNN are Convolutional Layers composed of three stages: convolution, nonlinearity, and pooling stage. Given an input tensor, a convolution computes a linear operation by means of matrix multiplication with filters, yielding
Fig. 4. Multilabel example and corresponding extended label per class. A blue region corresponds to GT Ccore , a yellow region corresponds to GT Cerror , and a purple region corresponds to GT Cuncertain . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
an output representation or feature maps. The output tensor will have different dimensions depending on parameters such as the number of filters, size of the filters, the stride at which the convolution is applied or the presence or absence of padding in the input tensor. Considering that the filters have a smaller dimension than the input tensor, each output unit is related to a subset of input units and the same weights in the filter are used in different positions. This improves the computation efficiency reducing the number of parameters compared to a Multilayer Perceptron (MLP) in which all the input units are connected to each output unit and, at the same time, sharing the weights makes this type of network translation invariant. After a convolution or a set of convolutions, the feature maps are usually being subjected to an activation function in order to introduce nonlinearities to the network. Different types of nonlinearities can be applied, ReLu (rectified linear unit) being one of the most used in recent architectures [20]. In our work, ReLU was used as the activation function for the whole set of neurons in the network, since it was the activation applied in the network used as initialization for transfer learning. The pooling stage substitutes the values of certain locations with values obtained by neighbouring points. Common pooling approaches are max pooling or average pooling. These operations are fixed and therefore do not need to be learned. This operation provides invariance to small translations and is used commonly to downsample the input feature maps by using a stride different than 1. Downsampling the input improves the computation efficiency when concatenating Convolutional Layers by summarizing the feature maps into representations of lower dimensions, decreasing the number of operations in the next layer. Finally, in a common classification CNN, fully connected layers are used in the last stage of the network and the whole network is trained with standard learning algorithms. For more in depth information the reader is referred to [21].
5
4.2. Fully convolutional networks Fully Convolutional Networks [22] (FCN) based models are currently the state-of-the-art methods for semantic segmentation methods. Compared to previous approaches, FCNs provide an endto-end training and infer a pipeline on inputs of variable size. By the convolutionalization of the entire network and lack of preprocessing, FCNs yield an efficient framework for dense prediction tasks. The concept of convolutionalization can be applied to any regular CNN and thus a common approach is to make use of transfer learning by transforming pre-trained networks on classification tasks such as ImageNet [16] or VGG [23]. This process is done by discarding the last classifier layer and converting the fully connected layers to convolutions. The final classifier layer is substituted by a [120_TD$IF]1 1 convolution layer with number of class filters which outputs a coarse heatmap. This loss in spatial dimension due to strided convolutions and maxpooling is then upsampled by transposed convolutions to match the input size. The filters in the transposed convolution layers can be fixed to upsampling methods such as bilinear interpolation, but they can also be learned through backpropagation. The final upsample from a coarse heatmap restricts the level of detail due to the loss of information on spatial dimensions. This shortcoming can be tackled by combining feature maps from shallower layers with skip connections. Adding a [120_TD$IF]1 1 convolution layer to shallower layers in order to compute class scores allows us to fuse feature maps from different depths. Adding the scores from multiple feature maps before a final upsample provides a more accurate pixel-wise prediction. An example of an FCN is depicted in Fig. 5. Note that due to slight disagreement in the definition of the deck area, only the intersection on the deck for the different annotations is considered as the ROI.
[(Fig._5)TD$IG]
Fig. 5. Graph of a VGG-based FCN. Skip connections are represented as arrows from different levels after the maxpooling layers. The entire architecture consists of convolutional operations, lacking dense layers. Feature maps at different points in the network are fused and upsampled to match the original input size.
6
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
[(Fig._6)TD$IG]
5. Methodology 5.1. Experimental setup The experiments have been performed on an Intel i9-7900X with 10 cores (20 threads) @ [12_TD$IF]3.3 GHz and 2 Nvidia GTX 1080 Ti. The Deep Learning framework used is Chainer [24] and the implementation is based on an open sourced generic implementation for the PASCAL dataset [25]. The images are masked using the Region Of Interest (ROI) of the deck area described in the dataset section. In order to keep consistency and to ignore small areas of no agreement on defining the region, the images and labels are masked taking the intersection of ROIs. We selected VGG-16 as the backbone for the FCN (Fully Convolutional Network) given its accurate results in various visual tasks. As a step-by-step training procedure, our rather small and unbalanced dataset can be well trained and refined. For transfer learning from other data sets, we used a pre-trained VGG-16 on the PASCAL VOC 2012 dataset for weight initialization, and then trained it with Stochastic Gradient Descent (SGD) on [12_TD$IF]256 256 random crops with weight decay and momentum. The learning rate is doubled for biases and for every skip connection added, the learning rate is decreased by a factor of 100. The hyper parameters used for training are defined in Table 2. Dataset augmentation is used by applying horizontal and vertical flips, change in illumination and slight affine transformations. Two different training strategies have been tested regarding the multiple labels. A first and more strict approach is to train each image sample with the intersection among the labels, considering only the total agreement as the ground truth for an image. For the sample presented in Fig. 4, the corresponding label would be the label shown in Fig. 6. The second approach consists in using each of the multiple labels per image as a data sample. This can be seen as a way of data augmentation since each data sample in the training set is used with 3 different labels. In this case, during training, the label is randomly sampled from the pool of different labels. A total of 12 experiments have been tested by evaluating the effect of (1) skip connections (no skip, 1 skip and 2 skip connections), (2) the two training strategies described above, (3) learnable or fixed bilinear upsampling weights (only applicable on 1 and 2 skip connections), and (4) data augmentation (tested on no skip connections and fixed for the rest). Data augmentation was kept fixed in all the combinations with 1 and 2 skip connections, since we observed that this method improved all the scenarios tested on FCN32. The model selected for which the followings results are presented corresponds to 2 skip connections (FCN8) trained on the intersection of the labels and with learnable weights for the upsampling layers. These results have been validated in a 5-fold cross validation. The experiments have been run in a sequential manner, i.e., running models without skip connection first and using the weights as initialization to the next model with skip connection, and same procedure when adding an extra skip connection. The Table 2 Training parameters.
Fig. 6. Label of sample of data based on the intersection of each original label. Green represents the intersection area of delamination, yellow the rebar exposure intersection, blue non damage intersection and purple represents the outside of ROI area. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
experiments using skip connections (FCN16 and FCN8) with a particular set of training parameters, such as data augmentation or the training strategy, have been initialized with the weights of its corresponding model trained on the same set of parameters. The values reported correspond to the IoU considering uncertainty regions. 5.2. Metrics and evaluation methodology Jaccard index, or Intersection over Union (IoU), is a standard metrics for assessing the performance of object segmentation models, both for models creating a bounding box or pixel-wise labeling. The IoU metric evaluates the level of agreement between the prediction and the ground truth by taking the ratio between the intersection and the union of each mask. A formal definition for a particular class C is as follows (TP for “True Positive”; FN for “False Negative”; FP for “False Positive”): IoU C ¼
TPC TP C þ FNC þ FPC
This metric is computed by the ratio of the number of overlapping pixels and the count of the union of the masks. This evaluation index yields a value between of 0 and 1, corresponding to no overlap and exact overlap between the masks, respectively. In this metric, we only consider as correctly classified those pixels that are in the region we denominate as “core” (total agreement on particular damage), while only those pixels that fall into the “error” region (total agreement of no damage for particular class) are considered to have a negative impact in the metric. This allows us to not count as positive nor negative those pixels that are classified as particular class C in the uncertain region (pixels with partial agreement among the labelers). As explained in [123_TD$IF]Section 3, the use of multiple labels per image pixel allows us to expand the labels to consider uncertainty in the ground truth. The evaluation metric used in the experiments consists in evaluating the IoU on the test set considering the regions described in [124_TD$IF]Section 3 and, based on the original definition of IoU, the metric is defined in Eq. (5). GT C core IoU C ¼ ð5Þ GT C þ GT C core
Parameters
Value
Batch size Optimizer Learning rate Weight decay Momentum
8 SGD[105_TD$IF] + Weight decay 1e10 1e3 0.99
ð4Þ
error
By using this metric we have an analog to the IoU metric, where we consider three different regions (core, error and uncertain) per class, and, following the same manner as in regular IoU, we propose a metric that does not impact (positively nor negatively) the score when the pixels classified are located within the uncertain region.
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
[(Fig._7)TD$IG]
7
Fig. 7. IoU (Eq. (5)) on the test set per class on best model: FCN8 with learnable upsampling weights, data augmentation and trained on intersection of the labels. Introduction of skip connection marked as vertical line in blue. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
[(Fig._8)TD$IG]
Fig. 8. Cross-validation results for the validation fold at different stages for the damage classes. We can observe how there is less variance in the results when adding skip connections.
The results presented in [18_TD$IF]Section 6 include IoU per class C as well as an analysis on the detection performance based on damage blobs per class. 6. Results Fig. 7 shows the results per class for the best model. We performed a 5-fold cross validation on our entire train set (70% split of the total) and the test metrics are evaluated on the test set [125_TD$IF] (Fig. 8). In Fig. 10 we can observe how the IoU described in [124_TD$IF] Section 3 is consistent across all folds. The “non damage” area presents an area with no problem in terms of segmentation, and both the reference evaluation on one set and the uncertainty evaluation yield good results. “Delamination” can be detected from an early stage due to its larger onaverage size compared to rebar exposure. On the other hand, “rebar exposure”, due to its smaller size and tendency to get mistaken by areas with surface rust, takes more iterations to detect and benefits
substantially from the addition of skip connections due to its well defined shape.[126_TD$IF] Regarding qualitative results, we can observe that for cases of rebar exposure contained in a delaminated area, the FCN shows very good performance. This kind of damage is the most severe case, which makes it a suitable case for aid by automated detection. Examples can be found in Fig. 9(a). “Rebar exposure” is the class more consistent in shape and texture. Generally it is a strong pattern that is easy to detect. On the other hand, one of the main problems for this particular class is the presence of false positives. From the texture perspective, any rust or certain types of dirt in the structure can easily be misclassified as rebar exposure. This can be seen in Fig. 9(b). Figs. 10 and 11 illustrate that blob size has an impact on damage detection accuracy. To better understand the model performance, we divide the result set into five blob size groups. The size range is determined by the statistical results of pixel size in Fig. 3. We use the presented IoU (see Eq. (5)) and “recall” to evaluate the model's
8
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
[(Fig._9)TD$IG]
Fig. 9. Example of accurate result pictures and results of picture with false positives.
[(Fig._10)TD$IG]
Fig. 10. The damage recognition Accuracy is calculated based on damage type and classed by their average blob size in a picture. The overall rebar exposure size is small.
overall performance and performance without considering false positive, respectively. Both methods will be analyzed for different damage sizes and damage class. Note that we want to study “recall” because it is important for civil engineers who are very concerned not to miss any damage.
damage. The segmentation performance based on blob size is presented in Fig. 10. We can observe a tendency both in the increase of performance and reduction of variance when the blob size increases in both types of damage. This is due to the fact that, in most cases, an increase in blob size implies a closer picture, resulting in a more detailed representation of the type of damage.
6.1. Blob size based uncertainty IoU evaluation 6.2. Blob size based recall evaluation As stated in [123_TD$IF]Section 3, due to the high agreement in the nondamage area and to the fact that it represents the majority class by a large margin, the results reported focus on the two types of
First, we provide the Recall for the whole image. Then, within each kind of damage, we calculate the detection accuracy
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
[(Fig._1)TD$IG]
9
Fig. 11. The Recall results for different blob sizes.
according to its corresponding damage size. Meanwhile, the miss detection rate is calculated for better illustrating damage detection performance. Fig. 11 presents the recall pixel-wise results overall and for each damage class. It can be observed how the increase in size also improves the performance from a detection point of view. Small blobs pose a problem for the model due to the uncertainty of such regions. To provide a clearer picture of model performance, results regarding the number of blobs are shown in Table 3. A blob is considered to be detected if there is an intersection between the predicted blob and the one in the ground truth. This metric is presented considering that, in general, in industry applications, detection can be more important than the segmentation performance. However, the number of blobs is smaller than the ground truth since we take the intersection of three labelers (as stated in [127_TD$IF] Section 3). The considerable variation of different labelers’ opinion on damage area effects a lot pixels being considered as damage, which are ignored during training. As a result, the detection ability is Table 3 Blob detection missing rate. Delaminationmiss and Rebarmiss represent the number of blobs missed considering not intersection. Damage size (px) [106_TD$IF]1 1 102 [107_TD$IF]1 102 1 103 [108_TD$IF]1 103 1 104 [109_TD$IF]1 104 1 105 [10_TD$IF]1 105 1 106
Delaminationmiss
Delaminationtotal
Rebarmiss
Rebartotal
54 46 21 2 0
67 150 323 157 10
12 21 4 0 0
14 86 59 5 0
restricted as we loose some information. However, in this way, we can avoid the situation that some non-damage (e.g., dirt) is considered as damage. Finally, our model achieves a weighted F1 score of 81.9%. 6.3. Discussion of results This project aims at evaluating the fit of a FCN for the task of segmentation of surface-based damages. From the quantitative results, we can observe the suitability of this architecture for delimiting and classifying delamination and rebar exposure damages. For this purpose, multiple experiments have been carried out based on different settings in order to obtain the best baseline model and both quantitative and qualitative analysis has been provided. We can observe differences in the results for each type of damage. In general, even though the pattern of rebar exposure damage is more consistent in terms of shape and texture, the network performs worse due to false positives given by regions in the image that show similar characteristics (rust, dirt, strong shadows, etc.). On the other hand, delamination, even though it shows a more difficult to delimit region performs significantly better. This is due to the fact that most of the disagreement among the labelers is around the boundaries that delimit the damage, while the majority agrees on the central damage area. Regarding the qualitative analysis, the segmentation results have been inspected and validated by trained civil engineers. The results in some cases present shortcomings, but in most of the cases they present comparable segmentation results to on-site inspectors.
10
J.J. Rubio et al. / Computers in Industry 112 (2019) 103121
7. Conclusions
References
This work addresses the problem of automatically assessing damage on concrete bridges. Based on a database of bridges in Niigata Prefecture, we have developed a Fully Convolutional Network (FCN) for detecting “delamination” and “rebar exposure”. The special feature of our dataset is that the ground truth demonstrates variation among labelers, and hence requires a more suitable metric for performance evaluation. In this work, we contributed to the design of such a metric based on regions of agreement from multiple labelers. Such dataset can be a valuable because it provides multiple labels per image besides the pixelwise labeling bridge images. Specifically, we study the use of a VGG-based FCN. The work shows promising results for semantic segmentation in the field of damage recognition and serves as a baseline for benchmarking other approaches. Our results suggest that FCNs and particularly a VGG-based network pretrained on ImageNet is well suited for texture recognition in the scope of civil infrastructure surface damage semantic segmentation. Our study shows a mean accuracy of 89.7% for delamination and 78.4% for rebar exposure. The weighted F1 score is 81.9%. From the civil engineering perspective, the model can be used as an automated damage judgement system for standard bridge deck inspection. In Japan, by law, the severity of the damage of a bridge has to be assessed by a human. Therefore, our results can be used to support such inspection task. As future work, we are considering the following directions. First, we will implement a Tablet based bridge damage judgement system that uses our classifier, and ask feedback from the bridge inspectors. Second, we plan to develop a more advanced model to increase the performance of damage recognition. For example, the current model will classify a non-deck red rebar as “rebar exposure” and cannot accurately detect small damages. We also need to address the issue unbalanced data size of different damages. Third, the explicit treatment of uncertainty can be incorporated into the model. Finally, we plan to shrink the model complexity.
[1] The Ministry of Land, Infrastructure, Transport and Tourism in Japan, Annual Road Maintenance, (2018) . [2] Japan Society of Civil Engineers, Concrete Standard Specification [Maintenance], (2018) . [3] P.C. Chang, A. Flatau, S.C. Liu, Health monitoring of civil infrastructure, Struct. Health Monit. 2 (2003) 257–267. [4] S. Yehia, O. Abudayyeh, S. Nabulsi, I. Abdelqader, Detection of common defects in concrete bridge decks using nondestructive evaluation techniques, J. Bridge Eng. 12 (2) (2007) 215–225. [5] The Ministry of Land, Infrastructure, Transport and Tourism in Japan, Standard Bridge Maintenance Instruction, (2014) . [6] C. Colla, R. Lausch, Influence of source frequency on impact-echo data quality for testing concrete structures, NDT&E Int. 36 (2003) 203–213. [7] Y.S. Cho, Non-destructive testing of high strength concrete using spectral analysis of surface waves, NDT&E Int. 36 (2003) 229–235. [8] J. Hola, L. Sadowski, K. Schabowicz, Nondestructive identification of delaminations in concrete floor toppings with acoustic methods, Automat. Constr. 20 (2011) 799–807. [9] C. Maierhofer, A. Brink, M. Rollig, H. Wiggenhauser, Detection of shallow voids in concrete structures with impulse thermography and radar, NDT&E Int. 36 (2003) 257–263. [10] M. Clark, D. McCann, M. Forde, Application of infrared thermography to the non-destructive testing of concrete and masonry bridges, NDT&E Int. 36 (2003) 265–275. [11] T. Nishikawa, J. Yoshida, T. Sugiyama, Y. Fujino, Concrete crack detection by multiple sequential image filtering, Computer-Aided Civ. Inf. Eng. 27 (1) (2012) 29–47. [12] Y. Liu, S. Cho, B.F. Spencer Jr., J. Fan, Automated assessment of cracks on concrete surfaces using adaptive digital image processing, Smart Struct. Syst. 14 (4) (2014) 719–741. [13] P. Prasanna, K.J. Dana, N. Gucunski, B.B. Basily, H.M. La, R.S. Lim, H. Parvardeh, Automated crack detection on concrete bridges, IEEE Trans. Autom. Sci. Eng. 13 (2) (2016) 591–599. [14] G. Li, X. Zhao, K. Du, F. Ru, Y. Zhang, Recognition and evaluation of bridge cracks with modified active contour model and greedy search-based support vector machine, Automat. Constr. 78 (2017) 51–61. [15] Y.-J. Cha, W. Choi, Deep learning-based crack damage detection using convolutional neural networks, Computer-Aided Civ. Inf. Eng. 32 (2017) 361–378. [16] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, [129_TD$IF]Adv. Neural Inform. Process. Syst. ([130_TD$IF]2012) 1097– 1105. [17] Y.-J. Cha, W. Choi, O. Büyüköztürk, Deep learning-based crack damage detection using convolutional neural networks, Computer-Aided Civ. Inf. Eng. 32 (5) (2017) 361–378. [18] G. Beckman, D. Polyzois, Y. Cha, Deep learning-based automatic volumetric damage quantification using depth camera, Automat. Constr. 99 (2019) 114– 124. [19] M. C. Science, A. I. Laboratory, labelme, [13_TD$IF]labelme.csail.mit.edu. [20] V. Nair, G.E. Hinton, Rectified linear units improve]qtoaRefs. [16] and [20] are identical. Hence Ref. [20] has been deleted and the subsequent references have been renumbered thereon. Please check.restricted Boltzmann machines, [132_TD$IF] Proceedings of the 27th International Conference on Machine Learning ([13_TD$IF]2010). [21] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep Learning, vol. 1, MIT Press, Cambridge, 2016. [22] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, [132_TD$IF]Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ([134_TD$IF]2015) 3431–3440. [23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint [135_TD$IF]61409.1556. [24] P. Networks, Chainer, https://chainer.org/. [25] K. Wanda, Chainer implementation of Fully Convolutional Networks, https:// chainer.org/.
[128_TD$IF]Acknowledgements This work was partially supported by (i) the NII Int’l Internship Program, (ii) a grant on \enquote{Research on improving predictability by blending deep learning and symbol processing} (Kakenhi no.: 16H06562) provided to the Graduate School of Engineering at The University of Tokyo, and (iii) the Spanish project TIN2016-74946-P (MINECO/FEDER, UE) and CERCA Programme / Generalitat de Catalunya. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU used for this research. This work is partially supported by ICREA under the ICREA Academia programme. We would also like to thank Jonas Hepp for assisting the project.