Automation in Construction 112 (2020) 103124
Contents lists available at ScienceDirect
Automation in Construction journal homepage: www.elsevier.com/locate/autcon
Dense construction vehicle detection based on orientation-aware feature fusion convolutional neural network
T
Yapeng Guoa, Yang Xub, Shunlong Lia,
⁎
a b
School of Transportation Science and Engineering, Harbin Institute of Technology, Harbin 150090, China School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China
ARTICLE INFO
ABSTRACT
Keywords: Object detection Dense multiple construction vehicles Feature fusion Orientation-aware bounding box Unmanned aerial vehicle Deep learning Computer vision
During the construction process, many construction vehicles gather in a small area in a short period, thus the accurate identification of dense multiple vehicles is of great significance for ensuring the safety of construction sites. In this study, a novel end-to-end deep learning network, namely orientation-aware feature fusion singlestage detection (OAFF-SSD), is proposed for the precise detection of dense multiple construction vehicles using images from Unmanned Aerial Vehicle (UAV). The proposed OAFF-SSD consists of three main modules: (1) multi-level feature extraction, (2) novel feature fusion, and (3) new orientation-aware bounding box (OABB) proposal and regression. Meanwhile, specific strategies are designated for the fast convergence of training losses. The application of OAFF-SSD to real construction sites vehicle detection and comparison with the well-known SSD (a benchmark using traditional bounding box) and orientation-aware SSD (OA-SSD) demonstrate the efficiency and accuracy of the proposed method.
1. Introduction Construction sites gather a large amount of construction equipment, vehicles and workers in a limited space for a short time, leading to potential high safety risks at construction sites. Even though the workforce of construction industry accounts for only 5% of the total in the US, the number of serious casualties in construction accounts for 18% of all occupational deaths. In addition, the incidence rate of nonfatal injuries is 30% higher than average industries [1]. For construction managers, the safety of construction processes is the primary consideration. Compared to immovable construction structures and large equipment, construction vehicles and workers are the most active factors at construction sites and the largest source of safety risks. The potential danger of construction vehicles is much greater than that of workers, thus managing the status of construction vehicles has become a key issue of safety monitoring at construction sites [1–3]. With the improvement of engineering complexity and the development of construction machinery industry, the difficulty of detecting construction vehicles also increases, attracting the attention of many researchers. Automatic methods for construction vehicle detection have been developed based on widely-used sensing techniques, such as RFID (Radio Frequency Identification), GPS (Global Positioning System), UWB (Ultra-Wideband), and BLE (Bluetooth Low Energy) [1,4–7]. These traditional methods have solved the problem of construction ⁎
vehicle detection to a certain extent, but they all rely on specific complex and expensive equipment. Moreover, they regard construction vehicles as points. Vehicle types and sizes need to be manually input, not fully automatically. Meanwhile, because sensors are only installed in vehicles, these methods cannot acquire information of surroundings, leading to inefficiencies in risk analysis and incident handling. Camera has become the main configuration sensor of the construction site with the development of optical technology in recent years. Camera has dense sensing characteristics, and each pixel is a point sensor, being able to capture a large amount of targets and environmental information at low cost [8]. Current processing of images (or videos) is still based on manual work, i.e., performing specific tasks through analysis by experienced engineering managers. There are two inevitable shortcomings for this method: fatigue and lack of objectivity. Computer vision (CV) techniques have been widely employed in civil engineering [9]. To help the computer understand construction sites, many approaches for construction vehicle detection based on CV are developed [10–13]. However, these methods are based on fixedposition cameras with narrow fields of view and limited access to information that cannot support construction managers in making decisions. In addition, these conventional detection approaches can roughly identify the sizes and locations of construction vehicles, but cannot detect arbitrarily rotated vehicles due to the difficulties of locating the multi-angle objects and separating them effectively from the
Corresponding author. E-mail address:
[email protected] (S. Li).
https://doi.org/10.1016/j.autcon.2020.103124 Received 14 August 2019; Received in revised form 2 February 2020; Accepted 4 February 2020 0926-5805/ © 2020 Elsevier B.V. All rights reserved.
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
background. To address these challenges, this study proposes a novel end-to-end deep learning-based approach, which generates higher detection accuracy than state-of-the-art methods. This article introduces the idea of rotated object detection into automatic construction field, enabling high-precision detection of dense construction vehicles. The proposed network architecture consists of novel feature fusion module and orientation-aware bounding box (OABB) proposal module, with specific designed training strategies. The remainder of this paper is arranged as follows. Section 2 reviews recent progresses of vision-based vehicle detection and orientation-aware object detection fields. Section 3 illustrates the overall construction vehicle detection architecture and training strategies. Section 4 describes the implementation details. Section 5 shows training and testing results and discussions about some key hyperparameters. Section 6 concludes this article.
the comprehensive understanding of a whole image of construction site. Kim et al. developed a data-driven method for scene parsing for better recognition of various objects in a whole image of construction site [27]. Pose identification of construction equipment could help to estimate the time consumed by the operators. Soltani et al. proposed a 2D skeleton extractor for excavators using site videos [28]. Zhu et al. presented a novel framework for the detection of construction workforce and equipment with visual tracking technique, improved recall rate while maintained the precision rate [13]. The development of deep learning brings CV to a whole new field [29]. Construction vehicle detection approaches based on deep learning have given promising results. Fang et al. developed a deep learning based approach called IFaster Region-based Convolutional Neural Network (CNN) to automatically detect objects at construction sites in real-time [30]. Xiang et al. proposed an intelligent surveillance algorithm for detecting invading engineering vehicles based on a modified Faster R-CNN [31]. However, these methods mainly have two limitations: one is that they are almost based on in-car or fixed-position cameras, leading to inevitable problems, such as difficulty of installation, serious occlusion and inefficient inspection for extensive area; the other is that they have an inherent shortcoming: cannot locate rotated objects with high overlapping ratio comparing to expected. The first issue has been gradually solved by the emergence of aerial photography, which is more and more widely applied to vehicle detection with the development of unmanned aerial vehicles (UAV) [32]. Chen et al. presented a segmentation method for high-resolution aerial images to control the segmentation effects [33]. Yoon et al. [34,35] developed methods based on unmanned aerial systems for structural displacement measurement and modal analysis. Razakarivony and Jurie introduced a benchmark for automatic target recognition algorithms based on aerial images in unconstrained environments [36]. Wang et al. developed a novel system for vehicle detecting and tracking using image sequences form UAV [37]. A hybrid scheme for vehicle detection was proposed by Xu et al. using low-altitude UAV images, combining the Viola-Jones method and linear support vector machine classifier with histogram of gradient feature [38]. Audebert et al. presented a segment-before-detect method based on deep learning in highresolution remote sensing images to segment and subsequently detect and classify several kinds of wheeled vehicles [39]. Cao et al. developed a vehicle detector using highway satellite images, which was transfer learned from aerial image datasets to satellite dataset [40]. The second issue is addressed by the researchers from rotated object detection field, who aim to detect objects of arbitrary-orientations in images. Jing et al. developed an algorithm called Rotational Region CNN. The modified Region Proposal Part could generate axis-aligned bounding boxes, which enclosed the texts with different angles [41]. For the situation where objects were arbitrary orientated, a new detection method was proposed which applied the newly defined rotatable bounding box [42]. Strip-like rotated object detection, such as ship detection in high-resolution satellite images, hardly received satisfied results using current state-of-the-art object detection algorithms. To answer that problem, Liu et al. introduced the rotated region-based CNN into ship detection field with accurate extractions of rotated region features and precise locations of rotated objects [43]. Li et al. also proposed a multiscale rotated bounding box based detector using deep learning, to detect target ships in complex backgrounds and acquire orientations and locations of target ships [44]. Similarly, vehicles were also found in arbitrary directions in aerial images. Zhou et al. employed image local orientation to detect which provided a proper search direction for each pixel, not based on deep learning [45]. These methods are mainly based on state-of-the-art object frameworks: two-stage Faster R-CNN [46] and one-stage SSD [47]. However, for precision detection of dense objects, these methods have not considered enough multi-level features, leading to low-accuracy detection for dense and multiple vehicles. This article aims to build an end-to-end deep learning framework
2. Related works In recent years, many approaches based on CV are developed in civil engineering. Yang et al. [14,15] proposed vision-based methods to extract vibration information from laboratory experiments. Kuddus et al. [16] presented a target-free vision-based technique subjected to out-of-plane movements of structures. The other research works mainly focus on civil object detection. Yeum et al. [17,18] developed visionbased methods for crack detection and other interested objects for bridge assessment. Kong and Li [19] used video feature tracking technique to detect structure fatigue crack. Huang et al. [20] developed an approach for micro-seismic event detection and location in underground mines by using CV and deep learning. These studies represent the latest research ideas and results of vision-based methods in civil engineering, and have greatly stimulated the application of object detection techniques in this field. Not only in civil engineering, but also in many other fields, the application vision-based object detection technology has experienced a research boom, especially for vehicle detection, and many methods have been proposed, producing better results under certain conditions. There are mainly two research branches for vehicle detection: on-road common vehicle detection [21] and construction vehicle detection at construction sites. For on-road common vehicle detection, Li et al. used a multiscale and-or graph model (i.e. a graph model containing three types of nodes: the AND, OR, and terminal nodes) to detect vehicles with multiple sizes based on time-varying vehicle features, having the ability of handling situations like: partial vehicle occlusion and various vehicle shapes [22]. Combining CV technology and the information from the weigh-inmotion system for long-span bridges, Chen et al. proposed an approach to identify the spatiotemporal distribution of vehicle loads [23]. Kuang et al. presented a method to detect nighttime vehicle, using vehiclelight-based region-of-interest extraction and object proposal approach with nighttime image enhancement technique. To detect vehicles in video surveillances automatically, Noh et al. proposed an adaptive sliding-window approach, in which useful size templates could be generated for a given scene, and by using the obtained templates a sliding window was adaptively deformed [24]. For construction vehicle detection, Azar and McCabe presented a method to identify off-highway dump trucks using videos, and evaluated existing algorithms of object recognition and background subtraction [25]. To support decision making for managers, GolparvarFard et al. developed an algorithm to recognize single actions of earthmoving equipment at construction sites. They also proposed a construction workers and equipment detection method from site videos [10,12]. To build a benchmark for construction vehicle detection, Tajeen and Zhu created a dataset of construction site images to measure the detection performance of current object detection methods [3]. Ji et al. proposed an algorithm to detect hydraulic excavators and dump trucks based on videos [26]. Although there are lots of works done in the field of automated construction site monitoring, it's still difficult for 2
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
that automates the extraction of parameters of orientation-aware prior box, so that the network can be highly portable on different datasets. A novel end-to-end deep learning network (OAFF-SSD) is proposed for the precise detection of dense multiple construction vehicles based on images from UAVs to perform the specific detection task automatically. A feature fusion module and an orientation-aware bounding box (OABB) proposal module are proposed to identify dense multiple objects more effectively.
3.1. Feature extraction module Convolutional neural networks use convolution operations for feature extraction. Like SSD, feature extraction module in OAFF-SSD is based on VGG-16 [49]. Although VGG-16 has been proposed for a few years, it is still one of the most popular base networks. The original VGG-16 consists of five convolutional parts and three fully connected layers. Convolutional kernels utilized in this network are all 3 × 3. The first convolutional part consists of two convolutional layers with 64 kernels, and a maxpool layer (mp_1). The second convolutional part consists of two convolutional layers with 128 kernels, and a maxpool layer (mp_2). The third, fourth and fifth convolutional parts consist of three convolutional layers with 256, 512 and 512 kernels respectively, and a maxpool layer (mp_3, mp_4 and mp_5). Three fully connected layers (fc_6, fc_7, and fc_8) are functioned as classification with huge memory usage. In this study, in order to make better use of the pretraining model [50] and to add subsequent functional layers, kernel size and stride of mp_5 is transformed from 2 × 2, 2 to 3 × 3, 1. Fc_6 and fc_7 are transformed from fully connected layers to convolutional layers, with 3 × 3 and 1 × 1 kernels respectively. Two extra convolutional parts are added to extract higher-level features (conv_6 and conv_7). Each extra part is using 1 × 1 and 3 × 3 kernels for the first and second layers (conv6_1, conv6_2 and conv7_1 and conv7_2). The stride of layer conv6_2 is set to 1 to ensure that the feature map size of conv7_2 is 10 × 10.
3. Overall construction vehicle detection network architecture Most current object detection methods based on deep learning and CNNs are modified from Faster R-CNN [46] and SSD [47]. Current object detection methods employ traditional bounding box (BB) to represent locations of targets, as shown in Fig. 2. BB cannot represent the orientation and accurate size of targets when there is rotation of targets. For dense object detection, BB has difficulties in separating targets effectively. Thus, orientation-aware bounding box (OABB) is developed to cover the shortage of BB [42]. OABB extracts the orientation of targets, meanwhile segment targets from background efficiently. OABB describes dense targets clearly without overlapping, as shown in Fig. 2. SSD have reached a good balance between detection precision and speed. A state-of-the-art study (OA-SSD) [42] integrated OABB into the framework of SSD in order to achieve good detection of rotating targets. The results showed that the method could achieve the expected purpose to a certain extent, but still existed limitations for the detection of dense multi-objectives. Through analysis, the main problem of OA-SSD is that SSD is not sensitive to the detection of small and dense objects due to the insufficient efficiency for multi-scale features. To address this issue, FSSD [48] is proposed to improve the ability for small and dense object detection. Therefore, the network proposed in this study called OAFFSSD derives from SSD and FSSD, which applies OABB to feature fusion single stage detection model. The OAFF-SSD consists of three main parts: feature extraction module, feature fusion module and OABB proposal and regression module. The network takes 300 × 300-pixel image as input. Non-Maximum Suppression (NMS) is added at the end of the network to suppress extra OABBs for single target object. The architecture of the network is shown in Fig. 1.
3.2. Feature fusion module Referring to the study of multi-level feature fusion in FSSD [48], treating features of different levels as equivalent and directly producing predictive results causes the network to lose the ability to extract global high-level semantics and local details. In this study, a novel feature fusion approach proposed in FSSD [48] is employed for more accurate object detection. Conv4_3 (size: 38 × 38), fc_7 (size: 19 × 19) and conv7_2 (size: 10 × 10) are employed as fusion base layers. The contribution of feature maps smaller than 10 × 10 to the improvement of the fusion effect is negligible, thus only layers larger than 10 × 10 are selected. FF_1 layer is generated from conv4_3 with 256 convolutional kernels (1 × 1). FF_2 and FF_3 layers are generated from fc_7 and
Fig. 1. The architecture of OAFF-SSD. 3
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
Fig. 2. Difference between BB (first row) and OABB (second row).
features from different layers. Batch normalization (BN) trick is considered for faster training and better prediction. 3.3. OABB proposal and regression module In order to detect multi-size objects, this module uses the feature pyramid generated by the fused feature layer for OABB proposal and regression. Six different level of feature maps (MLF layers 1–6) are generated by 3 × 3 convolutional kernels, with sizes of 38 × 38 × 512, 19 × 19 × 512, 10 × 10 × 256, 5 × 5 × 256, 3 × 3 × 256, 1 × 1 × 256, respectively. To deploy OABB, a new model based on deep learning is employed, which modifies not only the feature representation part, but also the subsequent regression step changed from four-parameter BB (which has only four parameters: central points (cx, cy), width W and height H) to five-parameter OABB (which adds the fifth parameter angle to perform rotation detection as shown in Fig. 3) regression. The OABB proposal module is to propose prior OABBs. Like default box in Faster R-CNN [46] and prior box in SSD [47], prior OABBs are also used for location regression. There are five parameters to define an OABB, thus this module includes center point proposal and a_r and a_g proposal.
Fig. 3. Five parameters of an OABB.
3.3.1. Center point (cx, cy) proposal Center point represents the location of an object. Coordinates of center point are sampled from MLF layers. The number of OABBs are decided by sizes of MLF layers. For example, a MLF layer is m × n, cx = (i + 0.5) × 300/m, cy = (j + 0.5) × 300/n, i∈(0, m) and j∈(0, n). In this study, sizes of MLF layers are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, 1 × 1. 3.3.2. a_r and a_g proposal a_r (i.e., W/H) represents the size of an object. Width, height and angle can be defined according to uniform distribution. To simplify the characterization, a_r is used to replace W and H. In SSD [47], a_r = [1/ 3, 1/2, 1, 2, 3]. By analyzing the size characteristics of the construction vehicles, a_r is set to [1.2, 2.4, 3.6], a_g is set to [0°, 45°, 90°, 135°]. Fig. 4 shows the example of the proposal OABBs. To detect objects of different sizes, multiscale widths and heights are defined based on a_r in Eq. (1). In this study, six different scales corresponding to the above six MLF layers are used. min_size is amplification factor in each feature map, it is set to [30, 60, 111, 162, 213, 264].
Fig. 4. a_r and a_g proposal method.
conv7_2 with 256 convolutional kernels (1 × 1) respectively. Because the sizes of FF_2 and FF_3 are both 38 × 38, larger than fusion base layers (19 × 19 and 10 × 10), bilinear interpolation technique is utilized to uniform in size. FF_1, FF_2 and FF_3 are concatenated to fuse 4
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
Fig. 5. Detailed architecture of the regression module.
employed to measure the accuracy of the proposed detector, which is the ratio of the intersection of ground truth OABB and predicted OABB and the union of those two OABBs. First, each ground truth OABB with best OA-IOU score is matched to the prior OABB. OA-IOU is modified with the fifth parameter angle to force angle regression, as shown in Eq. (2). OABBa is an OABB with cxa, cya, wa, ha and aga, OABBb′ (cxb, cyb, wb, hb, aga) is another OABB with a changed angle. Then prior OABBs are matched to any ground truth with OA-IOU score higher than a threshold (β). This simplifies the training problem and allows the network to predict the high scores of multiple overlapping prior OABBs instead of requiring it to select only the one with the largest overlap [42].
OA
OABBa OABBa
IOU =
OABBb cos(aga OABBb 0,
agb ) , when (aga
agb ) <
others (2)
Training loss of the proposed network is modified from SSD [47] (shown in Eq. (3)), in which x is the proposed OABB, c is the ground truth categories, N is the number of matched prior OABBs, α is weight coefficient (usually 1.0). There are three kinds of OABBs: ground truth, prior and predicted. At training stage, as shown in Fig. 6, g is the transforming relation between ground truth and prior OABB while l is between predicted and prior.
Fig. 6. The relationship among three kinds of OABBs.
W = min _size ×
a _r , H = min _size/ a _r
(1)
Regression in this module includes localization regression and classification. For localization regression, six convolutional layers are added behind six MLF layers with po output channels, where po = k × 5 (k is the product of a_r number and a_g number, 5 means the parameters of an OABB (central points (cx, cy), width W, height H and angle)). For classification, six more convolutional layers are added behind six MLF layers with qo output channels, where qo = k × c (c is the number of object classes). The detailed architecture of the regression module is illustrated in Fig. 5.
There are two parts in loss: one is for object classification called confidence loss; the other is for localization. Confidence loss is a Softmax loss in Eq. (4), in which xpij = 1 represents i-th prior OABB matches j-th ground truth of p-th class, otherwise xpij = 0.
3.4. Training strategy
Lconf (x , c ) =
L (x , c, l, g ) =
1 (L (x , c ) + LOABB (x , l, g )) N conf
N
x ijp log( c ip)
i Pos
During training, the number of prior OABBs is much larger than ground truth. In this study, matching strategy is employed to determine how to pair priors and ground truths. Intersection over Union (IOU) is an evaluation metric widely used to measure the accuracy of an object detector on a dataset, which is the ratio of the intersection of ground truth BB and predicted BB and the union of those two BBs. In order to use the same evaluation criteria, orientation-aware IOU (OA-IOU) is
log( c i0),
(3)
c ip =
i Neg
exp(cip ) exp(cip )
p
(4)
Localization loss is a Smooth L1 loss between predicted and ground truth OABBs in Eq. (5). k
LOABB (x , l, g ) =
xij smoothL1 ( l i i Pos
j
k {cx , cy, w , h, ag }
g ik )
(5)
This transforming relation is encoded as Eq. (6) for efficient 5
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
Fig. 7. Representative testing images.
convergence and effective computation. A tangent function is used for rotation angle regression.
t
cx
= (t cx t
w
pcx )/pw , t
cy
= (t cy
The drone takes images at a height of 40–60 m, and the lens of the camera is perpendicular to the ground. The image captured environment covers as many construction vehicles as possible with various background textures, lighting conditions, and scales. In order to meet the needs of network training, the areas of the construction vehicles are cropped from the original images, then these areas are resized to 300 × 300. The dataset in this study consists of 240 images and corresponding annotations. The annotation form can provide information including location and class of construction vehicles. 90% of the dataset is for training, and the rest 10% is for testing. Representative testing images are shown in Fig. 7. Construction vehicles in the dataset cover most common vehicle types, such as carrier vehicles, earthmovers, yard cranes, cement tankers, etc. They are all labelled as one category, e.g.,
p cy )/ ph
h
= log(t w/ pw ), t = log(t h/ ph ) t
ag
= tan(t ag
pag )
(6)
4. Implementation details 4.1. Dataset Images used for creating the dataset is acquired from a consumergrade drone (DJI Phantom 4 Pro) with a camera (4864 × 3648 pixels).
Fig. 8. Training loss of OAFF-SSD. 6
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
Fig. 9. Testing results detected by SSD, OA-SSD and OAFF-SSD.
speed of model update after loss back propagation. Choosing a good learning rate is challenging, because if the learning rate is too small, the training speed will be very slow and time consuming and it is very likely to be stuck at a local minimum. On the contrary, if the learning rate is too large, the training process will be unstable and it's hard to find the global minimum. The learning rate setting is closely related to the choice of the optimizer. Researchers often apply stochastic gradient descent (SGD) optimizer or Adam optimizer for training. However, in this study, through experiments, RMSprop optimizer [51] is used for faster convergence. Learning rate for the optimizer is set to 5 × 10−5, alpha is 0.9, epsilon is set to 1 × 10−8, momentum is 0.9, weight decay is set to 5 × 10−4. This configuration for the RMSprop optimizer enables a satisfactory training process and results. Batch normalization [52] is regarded as a technique for improving the speed, performance, and stability of CNNs, the harmonic parameter in BN is set to 1 × 10−5 referring to the relevant study. Batch size mainly depends on the hardware environment, especially GPUs. Larger GPU memory means larger batch size, faster training efficiency. In this study, batch size is set to 24. The threshold of OA-IOU (β) is functioned to determine positive samples from prior OABBs, it is the most important hyperparameter of the whole training process. Through Eq. (2), it can be found that if β is too large, the number of qualified prior OABBs will decrease extremely, leading to less positive samples when training. On the contrary, if β is too small, many prior OABBs that differ too much from the ground truth OABBs are positive samples, resulting in poor prediction of the network. After lots of experiments, β in this study is recommended to be set as 0.3. The threshold of angle control (θ) is a part of β. θ is used to filter out the OABBs whose rotation angles in the prior OABBs are too different from the ground truth OABBs, which improves the stability and convergence speed of the network training. In this study, θ is set to 30°.
Fig. 10. Precision-Recall curves of OA-SSD and OAFF-SSD.
construction vehicles, when building this dataset. 4.2. Parameter setting In CNNs, parameters that need to be set manually are called hyperparameters. The hyperparameter setting of CNNs does not have specific methods, but most of them can be adjusted based on network training experience. In this study, hyperparameters include learning rate, parameters for optimizer, harmonic parameter in BN, batch size, threshold of OA-IOU (β) and threshold of angle control (θ). Learning rate is an important hyperparameter that controls the
5. Results and discussions In this study, the created dataset of construction vehicles is employed to train the proposed OAFF-SSD. Meanwhile, to illustrate the 7
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
(a)
(b)
(c)
(d) Fig. 11. Precision-Recall curves of OAFF-SSD when (a) β = 0.2, (b) β = 0.3, (c) β = 0.4, (d) β = 0.5.
effectiveness of the proposed method, SSD [47] and OA-SSD are also trained for comparation. Referring to RBox [42], OA-SSD is another network modified from SSD, applying OABBs for object detection, however without feature fusion module.
SSD, OA-SSD and OAFF-SSD, respectively. As shown in Fig. 9, the complexity (interference) of the background gradually increases from the top row to the bottom, that is, the difficulty of the object detection increases sequentially. Green boxes are the results from SSD, which are traditional bounding boxes with only four parameters. Although green boxes can locate the target to some extent, but the detected areas of the rotating objects are too large to accurately separate the objects; for dense multiple objects, it is impossible to distinguish different objects effectively from green boxes, resulting in the consequences of detection failure. Yellow boxes represent the results detected by OA-SSD, which are OABBs with five parameters. Comparing to green boxes, yellow boxes can solve the problem of inefficient separation caused by the traditional bounding boxes and the rotation of objects, and for dense multiple objects, can also separate more effectively. From the results, the three approaches show robustness for the change of background complexity, and there is no obvious difference in detection effect for different backgrounds. However, for the detection of dense small objects (i.e., the last image in Fig. 9), OA-SSD has natural limitations because there are few low-level features which contributes a lot to the detection of small objects. Red boxes are from OAFF-SSD method. As
5.1. Training and testing results Training loss is an important indicator for monitoring the training effect of deep CNNs. Fig. 8 shows the training loss curves of the proposed OAFF-SSD in this article. Red curve represents the total loss of the network. As descripted above in 2.4, total loss consists of two parts: confidence loss for classification and regression loss for localization. In Fig. 8, the green curve means classification loss and blue means localization loss. After 3000 iterations, all the loss curves steadily decrease first quickly and then slowly. It can be concluded that the convergence and effectiveness of the training process are satisfied. Although the specific value of the loss curve does not represent the degree of network fit, it can be seen from the relative change in the values that there is already a good fit for the training set network. Images from the testing part of the created dataset are detected by 8
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
seen in Fig. 9, whether for single or multiple, sparse or intensive, large or small objects, OAFF-SSD has better performance than the above two approaches, and has met the requirement of the precise detection of dense multiple objects.
Table 2 APs of OAFF-SSD with different βs.
5.2. Evaluation metrics In Fig. 9, the detection ability of different models has been judged by human intuitive feelings. In order to more clearly illustrate the differences between the detection models, a quantitative evaluation method needs to be introduced [53]. There are some basic concepts used commonly in object detection evaluation. Intersection Over Union (IOU) is measured based on Jaccard Index that evaluates the overlap between two OABBs in this study. True Positive (TP) means a correct OABB detection with IOU ≥ threshold, False Positive (FP) means a wrong OABB detection with IOU < threshold, False Negative (FN) means a ground truth OABB not detected. Precision is the ability of a model to identify only the relevant objects. It is the percentage of correct positive predictions and is given by Eq. (7):
Presicion =
TP TP = TP + FP all detected OABBs
(8)
Precision-Recall curve is used to evaluate the performance of an object detector as the confidence is changed by plotting a curve for each object class. An object detector of one class is considered good if its precision stays high as recall increases. As AP curves are often zigzag curves going up and down, comparing different curves (different detectors) in the same plot usually is not an easy task. Thus, Average Precision (AP), a numerical metric, is proposed. AP is the precision averaged across all recall values between 0 and 1. The 11-point interpolation for calculation AP tries to summarize the shape of the Precision-Recall curve by averaging the precision at a set of eleven equally spaced recall levels [0, 0.1, 0.2, …, 1]:
AP =
1 11
interp ( ) ,
where
{0,0.1,...,1}
interp ( )
= max ( ) :
0.2
0.3
0.4
0.5
0.6
0.2 0.3 0.4 0.5
0.889135 0.894292 0.909091 0.818182
0.898268 0.987603 0.909091 0.818182
0.825817 0.987603 0.909091 0.818182
0.802609 0.969752 0.909091 0.818182
0.769059 0.933554 0.904656 0.777937
0.648948 0.916452 0.887085 0.776761
At the training stage, the threshold of OA-IOU (β) is functioned to determine positive samples from prior OABBs. Through Eq. (2), it can be found that if β is too large, the number of qualified prior OABBs will decrease extremely, leading to less positive samples when training. On the contrary, if β is too small, many prior OABBs that differ too much from the ground truth OABBs are positive samples, resulting in poor prediction of the network. As shown in Fig. 11, different βs have a big impact on the AP. Table 2 shows that when β = 0.2, maximum AP = 0.898, β = 0.3, maximum AP = 0.988, β = 0.4, maximum AP = 0.909, β = 0.5, maximum AP = 0.818. Thus, in this article, β is recommended to be set as 0.3. At the inference stage, the threshold of IOU for NMS (δ) controls the number of predicted OABBs. Through Fig. 11, it can be concluded that regardless of the value of β, a small δ always produces a higher AP value. Smaller δ means fewer low confidence scored OABBs, which just proves the validity of NMS. Table 2 shows that when β = 0.3, δ = 0.2 (0.3) gives the best detection result, indicating that the delta value is not as small as possible, but instead has a suitable range of values. Thus, in this study, δ is recommended to be set as 0.2.
(7)
TP TP = TP + FN all ground truth OABBs
0.1
5.3. Influence of key hyperparameters (β and δ)
Recall is the ability of a model to find all the relevant cases (all ground truth OABBs). It is the percentage of true positive detected among all relevant ground truths and is given by Eq. (8):
Recall =
β\δ
6. Conclusions In this study, a deep learning and CNN based end-to-end approach for precise detection of dense multiple construction vehicles using UAV images is proposed, which turns this challenging issue into a rotated object detection problem. This idea is introduced into vehicle detection field for better performance by this paper. The following conclusions are obtained from the study:
• A novel OABB proposal and regression module is presented for
(9)
where ( ) is the measured precision at recall . Instead of using the precision observed at each point, the AP is obtained by interpolating the precision only at the 11 levels taking the maximum precision whose recall value is greater than γ. At the inference stage of the proposed network, OAFF-SSD produces much more predicted OABBs than actually needed, NMS is employed to filter those OABBs [46]. There is a key parameter in NMS operation, i.e., the IOU threshold δ, that means only the OABB with the highest confidence score will be kept among all OABBs with IOU < δ. As shown in Fig. 10, OA-SSD and OAFF-SSD are evaluated using the listed above metrics. As the indicator δ increases, the AP does not increase monotonously. Table 1 shows that OA-SSD reaches the maximum AP at δ = 0.1 with AP = 0.895, and OAFF-SSD reaches the maximum AP at δ = 0.3 with AP = 0.988. Compared to OA-SSD, OAFF-SSD improved the AP by 0.093, significantly improving the detection effects, which is consistent with the previous observations.
• •
dense multiple object detection. This module abandons the traditional four-parameter regression model for vehicle detection and proposes a five-parameter regression approach for rotating vehicle detection. Results show that this module can solve the problem of inefficient separation caused by the traditional bounding boxes and the rotation of objects, and can also separate objects more effectively. A feature fusion module is added to the network for more precise detection. From the results, whether for single or multiple, sparse or intensive objects, especially for small objects, this module contributes a lot to the improvement of detection accuracy with an AP = 0.988. Two key hyperparameters (β at the training stage and δ at the inference stage) are discussed for better value selection. Through experiments, it is found that the proper value β of lies in 0.25–0.35, and δ in 0.2–0.3.
Table 1 APs of OA-SSD and OAFF-SSD. Methods\δ
0.1
0.2
0.3
0.4
0.5
0.6
OA-SSD OAFF-SSD
0.895332 0.894292
0.802597 0.987603
0.889518 0.987603
0.875600 0.969752
0.866052 0.933554
0.817298 0.916452
9
Automation in Construction 112 (2020) 103124
Y. Guo, et al.
The end-to-end network proposed in this study called OAFF-SSD can be applied not only to the construction vehicle detection, but also to the detection problem of dense multiple objects in civil engineering, which can also identify the orientation of target objects (more useful for motion tracking and estimation). However, the limitation of the presented method mainly lies in the complexity of the deep learning model, which is hardly deployed to UAVs to realize real-time online detection. Thus, using state-of-the-art model reduction methods to simplify the detection model for smaller resource occupation and better feasibility to deploy will be explored in future works.
[17]
[18] [19] [20]
Declaration of competing interest [21]
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
[22]
Acknowledgements
[23]
This study was financially supported by National Key Research and Development Program of China [2018YFB1600200], NSFC [51922034, 51678204, 51638007], Heilongjiang Natural Science Foundation for Excellent Young Scholars [YQ2019E025] and Guangxi Science Base and Talent Program [710281886032].
[24] [25] [26]
References [1] J. Seo, S. Han, S. Lee, H. Kim, Computer vision techniques for construction safety and health monitoring, Adv. Eng. Inform. 29 (2) (2015) 239–251, https://doi.org/ 10.1016/j.aei.2015.02.001. [2] E. Rezazadeh Azar, B. McCabe, Automated visual recognition of dump trucks in construction videos, J. Comput. Civ. Eng. 26 (6) (2011) 769–781, https://doi.org/ 10.1061/(ASCE)CP.1943-5487.0000179. [3] H. Tajeen, Z. Zhu, Image dataset development for measuring construction equipment recognition performance, Autom. Constr. 48 (2014) 1–10, https://doi.org/10. 1016/j.autcon.2014.07.006. [4] T. Omar, M.L. Nehdi, Data acquisition technologies for construction progress tracking, Autom. Constr. 70 (2016) 143–155, https://doi.org/10.1016/j.autcon. 2016.06.016. [5] J. Park, K. Kim, Y.K. Cho, Framework of automated construction-safety monitoring using cloud-enabled BIM and BLE mobile tracking sensors, J. Constr. Eng. Manag. 143 (2) (2016) 5016019, https://doi.org/10.1061/(ASCE)CO.1943-7862.0001223. [6] C. Zhang, A. Hammad, S. Rodriguez, Crane pose estimation using UWB real-time location system, J. Comput. Civ. Eng. 26 (5) (2011) 625–637, https://doi.org/10. 1061/(ASCE)CP.1943-5487.0000172. [7] J. Teizer, B.S. Allread, C.E. Fullerton, J. Hinze, Autonomous pro-active real-time construction worker and equipment operator proximity safety alert system, Autom. Constr. 19 (5) (2010) 630–640, https://doi.org/10.1016/j.autcon.2010.02.009. [8] A. Voulodimos, N. Doulamis, A. Doulamis, E. Protopapadakis, Deep learning for computer vision: a brief review, Computational Intelligence and Neuroscience 2018 (2018), https://doi.org/10.1155/2018/7068349. [9] Y. Xu, Y. Bao, J. Chen, W. Zuo, H. Li, Surface fatigue crack identification in steel box girder of bridges by a deep fusion convolutional neural network based on consumergrade camera images, Struct. Health Monit. 18 (3) (2019) 653–674, https://doi. org/10.1177/1475921718764873. [10] M. Golparvar-Fard, A. Heydarian, J.C. Niebles, Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers, Adv. Eng. Inform. 27 (4) (2013) 652–663, https://doi.org/10.1016/j.aei. 2013.09.001. [11] J. Kim, S. Chi, J. Seo, Interaction analysis for vision-based activity identification of earthmoving excavators and dump trucks, Autom. Constr. 87 (2018) 297–308, https://doi.org/10.1016/j.autcon.2017.12.016. [12] M. Memarzadeh, M. Golparvar-Fard, J.C. Niebles, Automated 2D detection of construction equipment and workers from site video streams using histograms of oriented gradients and colors, Autom. Constr. 32 (2013) 24–37, https://doi.org/10. 1016/j.autcon.2012.12.002. [13] Z. Zhu, X. Ren, Z. Chen, Integrated detection and tracking of workforce and equipment from construction jobsite videos, Autom. Constr. 81 (2017) 161–171, https://doi.org/10.1016/j.autcon.2017.05.005. [14] Y. Yang, C. Dorn, T. Mancini, Z. Talken, G. Kenyon, C. Farrar, D. Mascareñas, Blind identification of full-field vibration modes from video measurements with phasebased video motion magnification, Mech. Syst. Signal Process. 85 (2017) 567–590, https://doi.org/10.1016/j.ymssp.2016.08.041. [15] Y. Yang, C. Dorn, T. Mancini, Z. Talken, G. Kenyon, C. Farrar, D. Mascareñas, Spatiotemporal video-domain high-fidelity simulation and realistic visualization of full-field dynamic responses of structures by a combination of high-spatial-resolution modal model and video motion manipulations, Struct. Control. Health Monit. 25 (8) (2018) e2193, https://doi.org/10.1002/stc.2193. [16] M.A. Kuddus, J. Li, H. Hao, C. Li, K. Bi, Target-free vision-based technique for
[27] [28] [29] [30] [31] [32]
[33]
[34] [35] [36] [37] [38] [39] [40] [41] [42] [43]
10
vibration measurements of structures subjected to out-of-plane movements, Eng. Struct. 190 (2019) 210–222, https://doi.org/10.1016/j.engstruct.2019.04.019. C.M. Yeum, J. Choi, S.J. Dyke, Automated region-of-interest localization and classification for vision-based visual assessment of civil infrastructure, Struct. Health Monit. (2018), https://doi.org/10.1177/1475921718765419 (1475921718765419). C.M. Yeum, S.J. Dyke, Vision-based automated crack detection for bridge inspection, Computer-Aided Civil and Infrastructure Engineering 30 (10) (2015) 759–770, https://doi.org/10.1111/mice.12141. X. Kong, J. Li, Vision-based fatigue crack detection of steel structures using video feature tracking, Computer-Aided Civil and Infrastructure Engineering 33 (9) (2018) 783–799, https://doi.org/10.1111/mice.12353. L. Huang, J. Li, H. Hao, X. Li, Micro-seismic event detection and location in underground mines by using Convolutional Neural Networks (CNN) and deep learning, Tunn. Undergr. Space Technol. 81 (2018) 265–276, https://doi.org/10. 1016/j.tust.2018.07.006. S. Sivaraman, M.M. Trivedi, Looking at vehicles on the road: a survey of visionbased vehicle detection, tracking, and behavior analysis, IEEE Trans. Intell. Transp. Syst. 14 (4) (2013) 1773–1795, https://doi.org/10.1109/TITS.2013.2266661. Y. Li, M.J. Er, D. Shen, A novel approach for vehicle detection using an AND-ORgraph-based multiscale model, IEEE Trans. Intell. Transp. Syst. 16 (4) (2015) 2284–2289, https://doi.org/10.1109/TITS.2014.2359493. Z. Chen, H. Li, Y. Bao, N. Li, Y. Jin, Identification of spatio-temporal distribution of vehicle loads on long-span bridges using computer vision technology, Struct. Control. Health Monit. 23 (3) (2016) 517–534, https://doi.org/10.1002/stc.1780. S. Noh, D. Shim, M. Jeon, Adaptive sliding-window strategy for vehicle detection in highway environments, IEEE Trans. Intell. Transp. Syst. 17 (2) (2016) 323–335, https://doi.org/10.1109/TITS.2015.2466652. E.R. Azar, B. Mccabe, Part based model and spatial–temporal reasoning to recognize hydraulic excavators in construction images and videos, Autom. Constr. 24 (7) (2012) 194–202, https://doi.org/10.1016/j.autcon.2012.03.003. W. Ji, L. Tang, D. Li, W. Yang, Q. Liao, Video-based construction vehicles detection and its application in intelligent monitoring system, CAAI Transactions on Intelligence Technology 1 (2) (2016) 162–172, https://doi.org/10.1016/j.trit.2016. 09.001. H. Kim, K. Kim, H. Kim, Data-driven scene parsing method for recognizing construction site objects in the whole image, Autom. Constr. 71 (2016) 271–282, https://doi.org/10.1016/j.autcon.2016.08.018. M.M. Soltani, Z. Zhu, A. Hammad, Skeleton estimation of excavator by detecting its parts, Autom. Constr. 82 (2017) 1–15, https://doi.org/10.1016/j.autcon.2017.06. 023. Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436, https://doi.org/10.1038/nature14539. W. Fang, L. Ding, B. Zhong, P.E. Love, H. Luo, Automated detection of workers and heavy equipment on construction sites: a convolutional neural network approach, Adv. Eng. Inform. 37 (2018) 139–149, https://doi.org/10.1016/j.aei.2018.05.003. X. Xiang, N. Lv, X. Guo, S. Wang, A. El Saddik, Engineering vehicles detection based on modified faster R-CNN for power grid surveillance, Sensors 18 (7) (2018) 2258, https://doi.org/10.3390/s18072258. Y. Guo, H. Niu, S. Li, Safety monitoring in construction site based on unmanned aerial vehicle platform with computer vision using transfer learning techniques, 7th Asia-Pacific Workshop on Structural Health Monitoring (APWSHM 2018), NDT.net, Hong Kong, China, 2018, pp. 1052–1060 https://www.engineeringvillage.com/ share/document.url?mid=cpx_5a7c6c1a16a8e90e1b3M794910178163167& database=cpx&view=abstract. Z. Chen, C. Wang, C. Wen, X. Teng, Y. Chen, H. Guan, H. Luo, L. Cao, J. Li, Vehicle detection in high-resolution aerial images via sparse representation and superpixels, IEEE Trans. Geosci. Remote Sens. 54 (1) (2016) 103–116, https://doi.org/10.1109/ TGRS.2015.2451002. V. Hoskere, J.-W. Park, H. Yoon, B.F. Spencer Jr., Vision-based modal survey of civil infrastructure using unmanned aerial vehicles, J. Struct. Eng. 145 (7) (2019) 4019062, https://doi.org/10.1061/(ASCE)ST.1943-541X.0002321. H. Yoon, J. Shin, B.F. Spencer Jr., Structural displacement measurement using an unmanned aerial system, Computer-Aided Civil and Infrastructure Engineering 33 (3) (2018) 183–192, https://doi.org/10.1111/mice.12338. S. Razakarivony, F. Jurie, Vehicle detection in aerial imagery: a small target detection benchmark, J. Vis. Commun. Image Represent. 34 (2016) 187–203, https:// doi.org/10.1016/j.jvcir.2015.11.002. L. Wang, F. Chen, H. Yin, Detecting and tracking vehicles in traffic by unmanned aerial vehicles, Autom. Constr. 72 (2016) 294–308, https://doi.org/10.1016/j. autcon.2016.05.008. Y. Xu, G. Yu, Y. Wang, X. Wu, Y. Ma, A hybrid vehicle detection method based on viola-jones and HOG+SVM from UAV images, Sensors 16 (8) (2016) 1325, https:// doi.org/10.3390/s16081325. N. Audebert, B. Le Saux, S. Lefèvre, Segment-before-detect: vehicle detection and classification through semantic segmentation of aerial images, Remote Sens. 9 (4) (2017) 368, https://doi.org/10.3390/rs9040368. L. Cao, C. Wang, J. Li, Vehicle detection from highway satellite images via transfer learning, Inf. Sci. 366 (2016) 177–187, https://doi.org/10.1016/j.ins.2016.01.004. Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, Z. Luo, R2cnn: rotational region cnn for orientation robust scene text detection, arXiv preprint arXiv:1706.09579, https://arxiv.org/abs/1706.09579, (2017). L. Liu, Z. Pan, B. Lei, Learning a rotation invariant detector with rotatable bounding box, arXiv preprint arXiv:1711.09405, https://arxiv.org/abs/1711.09405, (2017). Z. Liu, J. Hu, L. Weng, Y. Yang, Rotated region based CNN for ship detection, IEEE International Conference on Image Processing (ICIP), IEEE, Beijing, China, 2017,
Automation in Construction 112 (2020) 103124
Y. Guo, et al. pp. 900–904, , https://doi.org/10.1109/ICIP.2017.8296411. [44] S. Li, Z. Zhang, B. Li, C. Li, Multiscale rotated bounding box-based deep learning method for detecting ship targets in remote sensing images, Sensors 18 (8) (2018) 2702, https://doi.org/10.3390/s18082702. [45] H. Zhou, L. Wei, D. Creighton, S. Nahavandi, Orientation aware vehicle detection in aerial images, Electron. Lett. 53 (21) (2017) 1406–1408, https://doi.org/10.1049/ el.2017.2087. [46] S. Ren, R. Girshick, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis & Machine Intelligence 39 (6) (2015) 1137–1149, https://doi.org/10.1109/TPAMI. 2016.2577031. [47] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: single shot multibox detector, European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, 2016, pp. 21–37, , https://doi.org/10.1007/978-3319-46448-0_2. [48] Z. Li, F. Zhou, FSSD: feature fusion single shot multibox detector, arXiv preprint
arXiv:1712.00960, https://arxiv.org/abs/1712.00960, (2017). [49] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, https://arxiv.org/abs/1409.1556, (2014). [50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A Large-scale Hierarchical Image Database, (2009), https://doi.org/10.1109/CVPR.2009. 5206848. [51] T. Kurbiel, S. Khaleghian, Training of deep neural networks based on distance measures using RMSProp, arXiv preprint arXiv:1708.01911, https://arxiv.org/abs/ 1708.01911, (2017). [52] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167, https://arxiv. org/abs/1502.03167, (2015). [53] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338, https://doi.org/10.1007/s11263-009-0275-4.
11