Automation in Construction 98 (2019) 146–159
Contents lists available at ScienceDirect
Automation in Construction journal homepage: www.elsevier.com/locate/autcon
3D tracking of multiple onsite workers based on stereo vision Yong-Ju Lee, Man-Woo Park
⁎
T
Department of Civil and Environmental Engineering, Myongji University, 13209, 4th Engineering Building, 116 Myongji-ro, Yongin, Gyeonggi-do 17058, South Korea
A R T I C LE I N FO
A B S T R A C T
Keywords: Construction worker Computer vision Tracking Entity matching Camera calibration Occlusion Site monitoring
Varied sensing technologies have been delved in for positioning workers and equipment in construction sites. Vision-based technology has been received growing attentions by the virtue of its tag-free and inexpensive configuration. One of the core research works in this area was the use of stereo camera system for tracking 3D locations of construction resources. However, the previous work was limited to tracking of a single entity. To overcome the limitation, this paper presents a new framework for tracking multiple workers. The proposed framework supplements the previous work by embedding an additional step, entity matching, which finds corresponding matches of tracked workers across two camera views. Entity matching takes advantage of the epipolar geometry and workers' motion directions for finding correct pairs of a worker's projections on two image planes. This paper also presents an effective approach of camera calibration for positioning entities located a few tens of meters away from the cameras. The proposed framework is evaluated based on completeness, continuity, and localization accuracy of the generated trajectories. The evaluation results have shown its capability of retrieving 96% of actual movements, within localization errors of 0.821 m with 99.7% confidence.
1. Introduction
the tags, and the unique IDs assigned to the tags may cause privacy problems. In addition, RFID systems are incapable of providing continuous location data, and are limited to collect tag locations only when a reader appears within a certain range. Because of these issues, few research works have presented the GPS and RFID applications for tracking onsite workers [8,9]. As an alternative to the GPS and RFID technologies, image processing and computer vision technologies have received growing attentions for tracking construction resources. The technologies are attractive in that they do not require any tags to attach on objects and can track multiple resources using only cameras. They are free from signal interference problems, and involve less issues on privacy since it is rarely possible to recognize individual identities when tracking workers from a distance. Though the vision-based technologies also have inherited limitations related to occlusion and illumination, it is still worthwhile to exploit their distinctive strengths for delivering location data in a circumstance where they can complement and harmonize with GPS and RFID or where GPS and RFID are not applicable. The continuous development of image processing and computer vision algorithms with higher confidence level, and widespread installation of onsite cameras have increased the chances of employing vision-based technologies in the construction industry [10]. Various algorithms have been tested for detecting construction equipment, workers, and materials [11–16]. The object detection algorithms localize all objects of a
Manpower is one of the critical resources contributing to onsite construction activities, thus its effective management is considered as a key to productive and successful projects. Considering that construction sites generally involve a considerable number of workers, it is hard for a site manager to manually monitor the behavior of every worker on the site. In this perspective, continuous research efforts have been made on acquiring the location data of multiple onsite workers. As a result, GPS (Global Positioning System) [1,2] and RFID (Radio Frequency Identification) [3–5] technologies are increasingly employed on the site to track the location of equipment and materials, respectively. Moreover, heavy equipment telematics systems in which various sensors including GPS are embedded, have been adopted for providing detailed information about the location and states of equipment as well as its surrounding environment [6]. FHWA (Federal Highway Administration) also actively encourages the use of AMG (Automated Machine Guidance) that provides horizontal and vertical guidance for earth work and paving equipment based on GPS data and 3D models [7]. Despite the strengths of GPS and RFID, they also have some weaknesses which may halt their broader applications. Both technologies rely on navigation signals which are subject to radio frequency interference. Also, the GPS and RFID systems require attaching tags of sensors on every target entity. The requirement entails time and costs for daily maintenance of
⁎
Corresponding author. E-mail address:
[email protected] (M.-W. Park).
https://doi.org/10.1016/j.autcon.2018.11.017 Received 19 August 2018; Received in revised form 10 November 2018; Accepted 19 November 2018 0926-5805/ © 2018 Elsevier B.V. All rights reserved.
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
body and hardhat. While most research works focused on detecting single object type, Memarzadeh et al. [13] presented a framework to detect multiple object types such as workers, excavators, and dump trucks, simultaneously. The framework involves training of a SVM model using HOG features integrated with color histograms. Detecting multiple object types have become more feasible with using deep learning algorithms. Luo et al. [16] utilized a deep-learning-based object detection algorithm, the Faster R-CNN (Region-based Convolutional Neural Networks) [26], for detecting twenty-two types of construction resources in far-field site images. The employed Faster R-CNN method is composed of two modules, the region proposal network and the ResNet-50 CNN model [27]. Different from the traditional feature-based object detection methods, the CNN-based methods require no feature models. It is inferred that the effective features are automatically modeled via weight optimization in the process of CNN modeling.
certain type in each video frame independently but provide no relation between consecutive frames. Therefore, object tacking algorithms are employed to relate location data between frames and track objects throughout a video [17–22]. However, the trajectory data attainable through the object detection and tracking algorithms are the time series of 2D pixel coordinates, which are deficient in depth information. With the use of mono camera system, it is hardly possible to retrieve 3D coordinates of moving entities and to clearly understand their behaviors. Two or more cameras are indispensable for retrieving 3D spatial coordinates. Park et al. [23] presented a 3D vision tracking framework that uses two fixed cameras. It employs an object tracking method to localize workers in two camera views independently, and correlate the results based on the camera calibration parameters to calculate the 3D coordinates. However, it is not applicable to track multiple entities. Also, it used a checker board to calibrate cameras, which may cause a focus-related problem when the baseline – the distance between the two cameras – is long and the target entities are located far from the cameras. As outlined above, the previous framework is yet to be implemented on practical applications for tracking multiple entities. To overcome the limitations, this paper proposes a vision-based framework that can track multiple workers simultaneously in a 3D spatial coordinate system. This research starts from the previous work [23] and improves it to be suitable for automated construction site monitoring. The proposed framework adds an entity matching step for pairing 2D objects corresponding to the same worker across two camera views, which is crucial procedure for tracking multiple entities. Here, in this paper, the ‘entity’ refers to a worker as a 3D entity while the ‘object’ refers to a 2D rectangle region of the worker's projection on two camera views. Entity matching takes advantage of the epipolar geometry and workers' motion directions. Furthermore, the framework replaced the use of a checkerboard with the rectification based on GCP (Ground Control Points) filtered by the RANSAC algorithm. To evaluate the tracking performance, the quality of the estimated trajectories is quantified in three ways – completeness, continuity, and localization accuracy.
2.2. Vision-based 2D tracking methods An object detection method provides no relation between video frames, and each frame's results are independent from other frames' results. Accordingly, no trajectory data is available when using only an object detection method. To relate the results between frames and obtain trajectory data, it is required to introduce a tracking method that infers a new position of each detected or tracked object in the next frame. Park et al. [19] classified various tracking methods into three types based on the way in which the methods represent target objects to track. In their work, the three types of methods - point-based, contourbased, and kernel-based methods - are compared in terms of the stability and the accuracy in tracking onsite construction resources. Yang et al. [17] presented a tracking method that involves a machine learning process and tested the feasibility of tracking multiple onsite workers. Though tracking methods can generate trajectory data, they require specifying the region of the target object at an initial stage. Also, occlusion causes instability. These problems can be resolved by integrating with object detection methods. Paying attention to the complimentary relationship between tracking and detection methods, Park and Brilakis [20] combined the two types of methods for locating wheel loaders in a way that can proactively manage occlusion cases. A similar approach was presented by Rezazadeh Azar et al. [21] for monitoring an earthwork equipment fleet. While Park and Brilakis [20] used only a kernel-based tracking method, Rezazadeh Azar et al. [21] added a point-based tracking method which allowed maintaining stable states under partially occlusions. An enhanced framework of the integration was presented for continuous localization of construction workers which generally behave more dynamically than equipment [22]. In their framework, the detection and tracking methods interact intimately to resolve challenging occlusion cases in which workers are indistinguishable from each other. Instead of commercial cameras, Mosberger et al. [28] used a customized camera system that includes infrared flash to capture the retroreflective tape on workers' safety vests. Their method detects and tracks workers wearing the safety vests relying on the features of the reflective regions.
2. Background 2.1. Vision-based object detection methods Object detection methods localize all of a certain type of objects that appear in a video frame or an image. In general, the methods can be divided into two parts. The first part is to develop effective features that can differentiate the object type with others. The second part is to determine an appropriate machine learning algorithm for classifying the selected features. One of the popular combinations used for detecting construction resources is the HOG (Histogram of Oriented Gradients) feature and the SVM (Support Vector Machine), which was originally proposed for human detection [24]. Rezazadeh Azar and McCabe [11] proposed a dump truck detection method that trained eight SVM models independently for different camera angles with HOG features. It enabled detecting dump trucks captured from various camera angles. Rezazadeh Azar [15] also tested the feasibility of using zoom functions for equipment monitoring. In his research, a heavy machine is detected in zoomed-out views based on the HOG features and its individual identification is recognized in zoomed-in views based on visual tags attached on the machine. On the other hand, Park and Brilakis [12] presented the use of the HOG and SVM combination for detecting workers. Construction workers wearing safety vests are detected through three subsequent steps which apply motion, shape, and color features, independently. The motion and shape features are used to detect human bodies in walking motions whereas the color feature is used to determine whether the detected human bodies wear safety vests. The HOG and SVM combination is also applied to hardhat wearing identification [25]. In their work, two separate SVM models are trained for detecting human
2.3. Vision-based 3D tracking methods The output of the detection and tracking methods described in Section 2.1 and 2.2 is limited to 2D rectangular regions of interested objects in the pixel coordinate system. The distance from the camera to the object, depth information, is lost when an object is projected to an image plane. The 2D tracking results can be useful for calculating the cycle time of equipment for productivity analysis when the movements of the tracked entities range within very small spaces or exhibit linear routes [18,21]. While Luo et al. [16] presented the use of 2D detection results for classifying construction activities, they acknowledged that 147
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
Fig. 1. Effect of entity matching on 3D localization.
entities are located several tens of meters away from the cameras. While the performance of the overall framework is measured through the quantification of trajectory qualities, the entity matching performance is separately evaluated based on the precision and recall of the matching retrieval. It should be noted that the 2D localization - object detection and tracking - is out of the scope of this paper. Though this paper basically chooses to use the method proposed by Park and Brilakis [22], it is expected that the switch of the 2D localization processes can still be valid for the use of the proposed framework. Furthermore, this paper focuses on workers, not any other types like excavators and loaders, since there are much more chances of capturing multiple workers in a camera view than capturing multiple excavators or loaders.
3D spatial information should be a better choice for discovering the relevance between entities. It is infeasible to measure the distance among tracked objects as well as the length of their trajectories. Therefore, in general, it is fair to claim that the output from a single camera view cannot be used for clash avoidance or productivity analysis. To overcome the limitations of 2D results, Park et al. [23] proposed a 3D tracking framework that employs a stereo vision system. The framework consists of three steps – camera calibration, 2D localization, and triangulation. The camera calibration reveals the intrinsic camera parameters including the focal length and the centroid, as well as the extrinsic camera parameters such as the relative position and camera angle between the two cameras. A checker board is used for the calibration process. The framework involves two parallel flows of 2D localization, which provide 2D pixel coordinates of an entity's projections on two camera views, respectively. The 2D pixel coordinates and calibration outputs are fed into the triangulation process that calculates 3D coordinates by finding the intersection of two projection rays (Fig. 1). The calibration is processed only once at the beginning of the framework while the 2D localization and triangulation processes are repeated over the frames. Though experimental tests validated the potential performance to locate an entity with around 1-m distance error, the framework can track only a single entity because it lacks a critical step that matches entities across the two camera views.
4. Methodology Fig. 2 shows the overall framework that this paper proposes for tracking multiple workers. It receives, as main input data, two video streams from the cameras located several meters apart. Each video stream is fed into the 2D localization process that involves worker detection and 2D tracking methods. As a result, for every frame, two lists of 2D pixel coordinates - one from the left camera and the other from the right camera - are provided, representing the 2D locations of worker objects in the two camera views, respectively. The next step, entity matching, is the main contribution of this paper. It finds one-to-one matches between the two lists. The research objective of this step is to find the correct matching pairs in such a way that two coordinates in each pair correspond to the same worker entity. Each pair of the pixel coordinates are handed over to the triangulation process which calculates the final output, 3D coordinates. The entity matching and the triangulation processes rely on the fundamental matrix and the camera projection matrices, respectively. The main components of the matrices are the camera parameters such as the focal lengths and the relative poses of the two cameras, which are estimated through camera calibration. The camera calibration is processed only once at the beginning of the framework since the camera parameters are not altered during tracking. Fig. 3 illustrates the application of the framework on two consecutive frames. A worker's 3D position at n-th frame (Pn) is calculated by correlating a pair of matched 2D points pL,n and pR,n based on the camera calibration results fL, fR, R, and T. The 3D position can be tracked in the next frame unless either of pL,n+1 or pR,n+1 is lost by the 2D localization process.
3. Problem statement and objectives As described earlier, two or more fixed cameras are generally required to localize moving objects in 3D. While a lot of research works focused on detection and tracking in 2D pixel coordinates, Park et al. [23] presented the use of stereo vision for onsite tracking. However, the previous work was limited to tracking of a single object per entity type. If multiple entities of the same type appear in camera views, it is not feasible to track the entities without having an additional process of matching the correct pairs of 2D objects from two cameras. Provided that it is common to have multiple workers on a construction site, the previous framework is not applicable for general onsite environments. Consequently, this paper proposes a new framework that extends the authors' previous work by adding an entity matching process. Fig. 1 illustrates the need and importance of entity matching. In the figure, two worker objects are detected in both views. Two different matching cases result in completely different 3D coordinates marked as o's and x's in Fig. 1. To apply the triangulation and estimate 3D coordinates, it is necessary to find the correct matching pairs of the workers objects found from the two views. This paper also investigates camera calibration to confirm a proper way for the environment where target
4.1. 2D localization The 2D localization step generally relies on the previous work [22] 148
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
Fig. 2. Proposed framework for tracking multiple workers with stereo vision.
that employs a HOG based detection method and a particle filter based tracking method. The two methods are combined to work interactively so that both complement each other. The detection method finds the pixel regions of workers, of which each initiates a tracking instance. While the detection method uses the shape of a whole body in a standing pose, the tracking method takes only the upper half body
which allows less occlusions [22]. The detection and tracking algorithms are processed simultaneously for every frame, and their results are matched together to spot the more accurate worker regions as well as to identify occlusion occurrences. A tracking instances terminates when it is identified as an occluded worker. The more detailed description about the two algorithms' integration is referred to the
Fig. 3. Schematic procedures of the proposed framework processing two consecutive frames. 149
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
Fig. 4. Entity matching based on the epipolar geometry.
corresponding paper [22]. There are two differences made in this research. First, the colorbased classification step which was the last step of the employed detection method [12] is removed to allow the framework to track people regardless of wearing safety vests. Second, the identification of occlusion is partly changed. Basically, the framework receives more reliable information about the occlusion occurrences from the correlation of the two views, which will be treated in Section 4.4. 4.2. Entity matching 4.2.1. Entity matching based on the epipolar geometry Fig. 4 illustrates how the epipolar geometry contributes to the entity matching. The epipolar geometry is defined by the geometric correlation between two cameras - the relative position of the cameras (OL and OR) and the difference of their view angles. In Fig. 4, a worker on the left is positioned at P1 in 3D spatial coordinate system, and its 2D projections onto the left and the right image planes are pL1 and pR1, respectively. The points P1, OL, OR, pL1, and pR1 are all on the same plane which is called the epipolar plane. An epipolar plane is determined for every 3D position, and the one for P1 is represented as a shaded triangle in Fig. 4. The intersections of the epipolar plane with the two image planes form the lines, lL1 and lR1. The line lR1 is referred to the epipolar line of pL1, and can be determined by the line crossing pR1 and eR. The epipole, eR, is the projections of OL onto the right image plane. These notations work in the opposite way between the left and the right image planes. The entity matching in Fig. 4 can be defined as finding the matching pairs across the two coordinate lists, {pL1, pL2} and {pR1, pR2}. As illustrated in Fig. 4, pL1 and pR1 are one of the correct matching pair, and they are located each other's epipolar line: pR1 is on lR1 and pL1 is on lL1. This observation leads to the primary principle that the matching pair of a 2D worker object tracked in the left view should be searched from its corresponding epipolar line in the right, and vice versa. In fact, the matching object is not positioned accurately on the epipolar line because of the errors generated in 2D localization and camera calibration. As illustrated in Fig. 5a, the 2D localization may result in the region slightly off downward, failing to encompass the hardhat, and the errors of the camera calibration may cause a little shift of the epipolar line from lgt (ground truth) to l. In the proposed framework, the one closest to the epipolar line is considered as a candidate match. However, it is still possible that the closest one is not the correct match especially when multiple objects are found near an epipolar line (Fig. 6). Therefore, the candidate is evaluated further to prevent false matches. For
Fig. 5. Significance and definition of the epipolar constraint for the entity matching.
Fig. 6. Distance from 2D objects to an epipolar line.
this purpose, two conditions are devised in terms of the distance between a 2D object and an epipolar line, expressed in Eqs. (1) and (2). The candidate, the closest one, is accepted only when both conditions are satisfied in both views.
Epipolar constraint:d1 < Td, where Td = 0.5 W1
(1)
Distance ratio:Dr = d2/ d1 > Tdr ( > 1)
(2)
here, d1 and d2 represent the distances between the epipolar line and the centroids of the first and the second closest objects, respectively 150
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
(Fig. 6). Eq. (1) specifies a buffer zone along the epipolar line as shown in Fig. 5b and avoids false matches by discarding the candidate outside the buffer zone. It eliminates the candidate that is not close enough to the epipolar line. The threshold Td is set small enough to assure that the tracked object region (the rectangle in Fig. 6a) intersects the epipolar line. The Eq. (2) imposes a constraint that the candidate is determined as a correct counterpart only when the distance ratio is larger than the threshold (Tdr). In other words, the candidate is elected when the scale of its distance is significantly smaller than the second closest. This constraint assumes that the distance ratio (Dr) is proportional to the probability that the matching is correct. The distance di lower than 0.5 pixels is considered meaningless, so is rounded to be 0.5. Accordingly, if d2 is lower than 0.5, then the candidate cannot be selected because d1 = d2 = 0.5 and Dr = 1. Determining the value of Tdr will be discussed in Section 5.1.
(4)
vR, n = (pR, n − pR, n − 1)/(tn − tn − 1)
(5)
As illustrated in Fig. 9, vL,n and pL,n are the velocity and pixel coordinate of the 2D worker object in the left view at the n-th video frame, while WL is the width of its rectangle region. The velocity, pixel coordinates, and width in the right view are denoted by vR,n, pR,n, and WR, respectively. The velocities are calculated by the Eqs. (4) and (5), and their unit is pixels/s. In the equations, tn is the timestamp of the n-th video frame. The widths are employed in the Eq. (3) to adjust the value of Tv, based on the fact that the workers located closer to the camera exhibit higher 2D velocities when projected to the camera view. Since the closely located workers appear larger in the camera view, they are generally marked with wider rectangles. In other words, velocities (vL,n and vR,n) are in a linear relation with rectangle widths (WL and WR). Accordingly, the Eq. (3) sets higher values of Tv for the workers located closer to the cameras. It is worthwhile to note that this condition is valid when the distance between cameras and target entities is considerably longer than the cameras' baseline, thus the projection angles are close to each other, which would be the normal case in construction sites. If an entity is located close to the camera, then the projections of their movement directions on two camera views would differ significantly from each other, and the condition would be no longer applicable. To summarize, the matching counterpart of an object in the other view is found by (i) searching the closest object to its epipolar line, and (ii) checking the conditions defined in the Eqs. (1), (2), and (3). The conditions in the Eqs. (1) and (2) should be satisfied in both views in the opposite way. The candidate which is closest to the epipolar line is declined if any of the three conditions are not complied with. In this way, the proposed framework minimizes the occurrences of false matches. However, the filtering process of the three conditions may also reject true matches. Therefore, it may result in higher matching precision and lower matching recall. The matching precision and the matching recall are defined as follows.
4.2.2. Entity matching based on the moving direction comparison While the Eqs. (1) and (2) reduce the incorrect matches pertaining to 2D localization errors, the framework is still prone to suffer from false detections – false positives (FPs) and false negatives (FNs). The FPs are the detected regions which contain no worker objects, and the FNs are the undetected worker objects. For example, as shown in Fig. 7, a worker is not detected (FN), and instead a non-worker object is detected (FP) close to an epipolar line, which would result in a false matching. Fig. 8 illustrates another probable case where two different entities lying close to each other's epipolar line are matched together due to FNs. To address this issue, the 2D worker objects' moving directions on image planes are compared to each other. The moving directions are marked with arrows in Figs. 7 and 8. Correctly matched objects move in similar directions as Worker 2 in Fig. 7. On the other hand, falsely matched objects exhibit different motion directions. For example, in Fig. 7, the 2D object of Worker 1 in the left view is moving to the right while its candidate match (FP) in the right view stays at the same location because the background is static. Similarly, the different motion directions can also be observed from the potential false matching illustrated in Fig. 8, in which two objects are moving in nearly opposite directions. Accordingly, the probable false matchings that can arise from false detection results can be avoided by investigating the disparity of motion directions. Reflecting the observations, the proposed framework discards a candidate match when its velocity (pixels/s) is considerably different from its counterpart's velocity. This comparison condition is expressed as follows.
∣vL, n − vR, n ∣ < Tv = 2.0 × min(WL, n , WR, n)
vL, n = (pL, n − pL, n − 1)/(tn − tn − 1)
Matching precision = M_TP/(M_TP + M_FP) Matching recall = M_TP/(M_TP + M_FN) M_TP (Matching True Positive) = the count of the retrieved matchings which are correct M_FP (Matching False Positive)
(3)
= the count of the retrieved matchings which are incorrect
Fig. 7. Effect of a false 2D localization on entity matching (potential matching with an FP). 151
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
Fig. 8. Effect of a false 2D localization on entity matching (potential matching with other entities).
Fig. 9. Calculation of the difference between motion directions.
Pi LH−1 = PL
M_FN (Matching False Negative) = the count of the matchings which are not retrieved
Pi R H−1 = PR
In this research which relies on video data, a series of image frames, the precision is more important than the recall. For example, when a frame rate is 10 fps, 100% precision with 50% recall may result in trajectories with data frequency of 5 Hz containing no false information. On the contrary, 50% precision with 100% recall may result in trajectories with data frequency of 10 Hz, half of which containing false information.
PiL
(8) PiR
and are the initial projection matrices of the left and the here, right cameras, respectively. X1 is the 3D homogeneous coordinate reconstructed by PiL and PiR, and X2 is the rectified coordinates. PL and PR are rectified camera projection matrices. There are two ways to determine the 4 × 4 matrix, H: stratified rectification and direct rectification (Hartley and Zisserman, 2003). The former, in general, requires a few of images containing a checker board viewed in various angles from the two fixed cameras (Fig. 10a). However, the use of the checker board is not suitable to the stereo camera system that has a long baseline and focuses on entities at long distance, for several reasons. First, it is hard to capture the checker board located at short distance from both views. As demonstrated in Fig. 10b, the projection of the checker board does not fit in the views and tends to lean on a side boarder. More importantly, the projection is likely to be blurred because the cameras are focused on far regions, significantly degrading the calibration accuracy. Otherwise, if the checker board is located at a farther position, it is projected too small (Fig. 10c), which is also undesirable for achieving precise calibration. On the other hand, the direct rectification simply uses ground truth data of X2's (Control Points, CPs) and their corresponding X1 data. The matrix, H, can be determined by five or more pairs of X1 and X2 [29]. This research follows the direct rectification, using a total station to collect the CPs. Thirty pairs of X1 and X2 are collected, and the RANSAC is also applied to the estimation of H to remove outliers and minimize the influence of the errors caused in X1 and X2.
4.3. Camera calibration and triangulation As shown in Fig. 2, the camera calibration involves calculating the 3 × 3 fundamental matrix and the 3 × 4 camera projection matrices. The fundamental matrix, which represents the epipolar geometry can be determined based on a large number of point matches across the two views [29]. From various types of feature point descriptors, the SIFT [30] is employed in this research because of the capability of generating a large number of matches with high precision [23]. The processing time, which is a weak side of the SIFT, is not a significant factor in selecting the feature point descriptor since the proposed framework processes camera calibration only once after two cameras are fixed on the site. The RANSAC [31] is also used to eliminate outlier point matches. The camera projection matrices can be estimated as an arbitrary function of the fundamental matrix [29]. The arbitrarily estimated matrices are just initial plausible solutions that usually need rectification to remove reconstruction ambiguities: projective, affine, and similarity ambiguities. The rectification can be summarized as a transformation matrix, H, and can be formulated as follows [29].
X2 = HX1
(7)
4.4. Occlusion handling – identification of unstable states of 2D localization The most critical problem that hinders vision-based tracking is occlusion. When a tracked worker is occluded by other objects in either of
(6) 152
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
Fig. 10. Suitability of using a checker board depending on the baseline and the distance from cameras to target entities.
Occlusion Identification II: Matching Status Check (for matched pairs of entities) • Epipolar Constraint (Eq. 1) • Velocity Difference (Eq. 3)
Occlusion Identification I (for every detected object) • Continuous lack of matched detections (Eq. 9) • Abnormal acceleration (Eq. 10)
Matched
Matched Matched
Matched
Not matched
Matched
Matched
Not matched
Not matched
Undetected
Left Camera View
Right Camera View Entity Matching (for not matched objects) • Epipolar Constraint (Eq. 1) • Distance Ratio (Eq. 2) • Velocity Difference (Eq. 3) Fig. 11. Summary of the entity matching and the occlusion identification.
very high.
the two camera views, the corresponding 2D tracking process tends to be unstable and to produce erroneous data [22]. Therefore, it is important to identify the occurrence of an occlusion and address it properly to minimize the erroneous data. The framework basically terminates tracking of the occluded object, rather than making estimations of invisible movements, because of high uncertainty. The proposed framework considers four cues to identify occlusion events. While two of them are related to 2D localization and checked for every 2D object in both views, the other two are associated with entity matching and checked for the 2D objects that have already been matched across the views. The first cue is the absence of detection results for a certain duration, which is also reflected in the previous research on 2D tracking [22]. The detection and the tracking algorithms are concurrently processed every frame. Basically, the 2D tracking algorithm generates a result every frame regardless of occlusion, suggesting the most probable region of an object. If no detection results are found near a tracking region for a certain number of consecutive frames, the chances of the tracked object being occluded and invisible in the camera view, are
tnd > Tnd (sec )
(9)
here, tnd is the duration during which a 2D worker object has no overlapping detection results. The determination of the threshold, Tnd, should be based on the performance of the detection algorithm. The higher the performance is, the lower threshold is appropriate. The second cue is the abnormally high acceleration of a 2D tracked object. As mentioned before, occlusion causes instability of the 2D tracking process. The instability entails an abrupt shift or a jump of the tracking region, which can be noticed by exceptional acceleration records [32]. The cue is summarized as follows.
∣an∣ > Ta (pixels/s2), where an = (vn − vn − 1)/(tn − tn − 1)
(10)
here, an is the acceleration of a 2D tracked object in either of two views, and vn is its velocity defined as in the Eqs. (4) or (5). In this research, Tnd and Ta are set as 2.5 s and 2400 pixels/s2, respectively, after the thorough investigation of the results from a preliminary test video. The third and fourth cues are inspecting the states of entity 153
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
5.2. Camera calibration results
matching. When an object is occluded, the instability of its tracking process may result in the breach of the matching conditions described in Section 4.2, locating far from the epipolar line or exhibiting unusual moving directions. The distance ratio, which represents the reliability of the matching, is used only for an initial matching and is not an interesting condition any more in this state checking stage. Therefore, the matching states are checked based on the compliance of the Eqs. (1) and (3). When a pair of matched objects do not comply with the Eq. (3), a problem remains to find and terminate the occluded one from the two objects. In this case, the one having an overlapped detection result is favored to survive. If both have overlapped detection results, then the one with a higher 2D acceleration an, is selected to be terminated. The procedures described in this section works not only for occlusion cases, but also for any other instability that occur during 2D localization. Fig. 11 summarizes the processes of entity matching and occlusion handling.
For evaluating the performance of the proposed framework, a 3-min test video is taken with three camcorders. The camcorders are fixed in a row with 3 m distance so that the results from two baselines – 3 m and 6 m – can be compared. Fig. 14 shows the scene captured with the three camcorders: leftmost (Fig. 14a), middle (Fig. 14b), and rightmost (Fig. 14c). The 3-m baseline setup consists of the leftmost camcorder and the one in the middle while the 6-m baseline setup employs the leftmost and the rightmost camcorders. The videos contain five workers walking along pre-determined routes consisting of straight lines. Ground truth coordinates of the straight lines' end points are surveyed with a total station. Fig. 14a displays the placement of 30 CPs used for direct rectification. In the figure, the 20 red circles represent the CPs on the pre-determined routes, and the 10 blue circles represent the CPs on other locations. The 3D coordinate axes are also displayed in Fig. 14b. Camera calibration is applied to the first frames of the 3-min test videos, and the results are summarized in Table 1. The fundamental matrix error [29] and the triangulation error are calculated by the Eqs. (11) and (12).
5. Experiments and results The proposed framework is implemented using Microsoft Visual C# in a .NET Framework 4.0 environment. The implementation is tested on onsite videos of which the resolution and the frame rate are 1920 × 1080 pixels and 30 fps, respectively. Camcorders fixed at fourth floor of a building near the site, are used to record the videos. With the auto-focus function turned off, the camcorders are focused on the area where workers move around. The distance from the camcorders to the area is approximately 20–40 m. The videos from two cameras are synchronized by manually finding the frames corresponding to the same instant.
Fundamental Matrix Error =
1 n
n
∑ dF,i2 ,
where dF, i = xR, i TFxL, i
i=1
(11)
Triangulation Error =
1 m
m
∑ dCP,k ,
where dCP, k = ‖X2, k − X1, k ‖2
k=1
(12) Here, n is the number of point matchings, and xL,i and xR,i are homogeneous pixel coordinates of the i-th point matching in the left and the right camera views, respectively. In the Eq. (12), m is the number of the CPs, which is 30 in this experiment. X1,k is 3D homogeneous coordinates of the CPs estimated by triangulation, and X2,k is those acquired by using a total station. The notation, ‖∙‖2, represents the L2-norm calculation. A smaller number of point matchings are found from the long baseline stereo views, which is expected because of its wider view disparity between the two cameras. However, it results in a lower fundamental matrix error than the short baseline stereo views. While all 30 CPs are engaged in the calibration of the long baseline, 29 CPs are used in the calibration of the short baseline with having 1 CP filtered out as an outlier by the RANSAC. The long baseline is also preferred regarding the triangulation error scoring 0.0419 m.
5.1. Entity matching performance and determination of distance ratio threshold An 80-s video of two workers on a site is used for a preliminary test of entity matching, with two camcorders located 6 m apart from each other. Fig. 12 shows the scene captured from the two camcorders. This experiment aims to test the entity matching performance and to find the optimal value of Tdr, which is related to the initial matching. The matching performance was analyzed for varied values of Tdr ranging from 1.0 to 8.0 with 0.5 interval to determine its optimal value (Fig. 13). The matching precision was kept high over 90% regardless of the values and converged to 99.60% when Tdr = 3.5. The matching precision starts slightly decreasing after Tdr = 3.5, hitting 99.48% at Tdr = 8.0. Therefore, Tdr is determined as 3.5. The matching recall decreases as Tdr increases, hitting 84.01% at Tdr = 3.5. The matching recall over 80% is considered fairly good enough to retrieve the matchings within a second once an entity is detected in both views, provided that the frame rate is higher than 10 fps.
5.3. Completeness and continuity of 3D trajectories Using the determined Tdr value and the calibration results, the 3D tracking framework is tested on all frames of the 3-min video. The framework is evaluated with respect to the quality of 3D trajectories, which is categorized into completeness, continuity and localization. The completeness evaluates how large portion of actual trajectories the tracking framework retrieved. It is measured based on the retrieval
Fig. 12. Stereo view of the videos used for the preliminary test of entity matching. 154
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
Fig. 13. Entity matching performance: precision and recall.
Fig. 14. The 1st frames of the test videos captured from (a) the far left with CPs on it, (b) the middle with 3D axes on it, and (c) the far right. Table 1 Camera calibration summary.
Table 3 2D localization recall (long baseline, 6 m).
Baseline Fundamental matrix Triangulation
# of point matchings Error # of CPs Error (m)
Short – 3 m
Long – 6 m
Camera view
490 7.74 × 10−4 29 6.36 × 10−2
183 0.47 × 10−4 30 4.19 × 10−2
Worker Worker Worker Worker Worker Overall
Table 2 Trajectory completeness evaluation: retrieval percentage (%). Baseline
Short – 3 m
Long – 6 m
Worker Worker Worker Worker Worker Overall
99.46 96.55 92.03 99.56 95.40 96.60
98.81 94.03 93.48 99.70 95.96 96.40
1 2 3 4 5
Left camera view
Right camera view
99.28 97.31 95.11 99.83 97.22 97.75
99.48 96.55 96.66 99.78 98.83 98.26
percentage as follows.
Retrieval percentage = (N − Nnr )/N 1 2 3 4 5
(13)
Here, N is the number of frames which is 5395 in this experiment, and Nnr is the number of frames in which a worker's 3D locations are not retrieved. Table 2 summarizes the completeness evaluation. The results from the two baselines are very close to each other, signifying that the proposed framework can capture > 96% of the workers' travel 155
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
Fig. 16. Example case of a trajectory break with no ID change. Fig. 15. Example case of a trajectory break with an ID change.
Table 4 Trajectory continuity evaluation: the number of trajectory breaks and ID changes.
paths. The missing portion of the trajectories are mostly attributed to the worker detection performance. The detection method occasionally fails to detect a worker located at the farthest part of the routes around 40 m away from the cameras. In these cases, the workers are projected to the cameras with the resolution of their height ranging from 64 to 72 pixels. Also, it is observed that Worker 3 is not detected well, when compared to other workers, which is attributed to the white color of the long sleeves worn in a safety vest. In the test video, the white sleeves are relatively indistinguishable than other colors especially when the sleeves are slightly shadowed. Such degraded detection performance increases the chances of a 2D tracking process falling under the condition of the Eq. (9) and being terminated. Also, it wastes more frames until the tracking resumes. Table 3 presents the 2D localization recall results when the long baseline is employed. The 2D localization recall is defined as the percentage of the 2D worker objects that are retrieved by the 2D localization step. In the experiments, no false positive detection is occurred, which results in 100% precision of 2D localization for both camera views. The retrieval percentages in Table 2 are slightly lower than the 2D localization recall in Table 3. The reason is that 3D locations of an entity can be calculated only when its 2D worker objects in both views are retrieved. The entity matching recall of 84% also attributes to the retrieval percentage. The continuity is evaluated based on the number of breaks in the trajectories. A break occurs when either of the matched 2D tracking processes is terminated. The break can be classified into two types depending on whether entailing ID changes or not. When a pair of objects are matched as a new 3D entity, the entity is labelled with a numeric ID. A break associated with the termination of the both objects, causes the ID change. On the contrary, if only one of the object is lost, the ID is maintained after the break. Figs. 15 and 16 demonstrates the
Counts
Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Total Frequency (/min/worker)
Short baseline – 3 m
Long baseline – 6 m
Breaks
Breaks
ID changes
4 9 7 2 10 32 2.13
1 1 1 1 2 6 0.40
2 7 6 2 10 27 1.80
ID changes 1 3 1 2 3 10 0.67
Table 5 Trajectory continuity evaluation: break length. Break length (sec) Average Standard deviation Max. (Max. deviation) Min. (Min. deviation)
Short baseline – 3 m
Long baseline – 6 m
1.121 1.328 5.643 (4.522) 0.099 (−1.022)
1.002 1.200 5.808 (4.806) 0.132 (−0.890)
two break types. In Fig. 15, the worker entity originally labelled as ‘4’ (Fig. 15a) is forced to lose the 2D objects in the left and the right views one by one when occluded (Fig. 15b and c). Once the worker becomes visible again, the proposed framework tracks it again, but assigns a new label, ‘5’, recognizing it as a new entity (Fig. 15d). In other words, the break separates the trajectory into two independent ones, and the framework recognizes a single worker as two different workers with no 156
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
Fig. 17. Completeness and continuity of the trajectories generated by using the long baseline. Table 6 Completeness and continuity evaluation: MOTA. Baseline
Short – 3 m
Long – 6 m
Worker Worker Worker Worker Worker Overall
99.44% 96.50% 92.01% 99.52% 95.35% 96.56%
98.80% 94.01% 93.46% 99.68% 95.92% 96.37%
1 2 3 4 5
Table 7 Trajectory localization accuracy. Error (m)
Short baseline – 3 m Mean
Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Overall (MOTP)
0.471 0.546 0.532 0.406 0.516 0.493
Standard deviation 0.317 0.332 0.364 0.241 0.317 0.320
Long baseline – 6 m Max.
Mean
1.738 1.762 2.784 1.848 2.990 2.990
0.284 0.314 0.352 0.212 0.308 0.293
Standard deviation 0.134 0.177 0.181 0.144 0.162 0.176
Max.
2.599 1.058 1.614 0.904 0.859 2.599
Fig. 18. Trajectory of Worker 3.
overlapped time frame. However, as long as either of the 2D tracking processes is continued, the ID is maintained though the 3D coordinates are no longer generated and a break supervenes. In Fig. 16, the worker entity labelled as ‘8’ (Fig. 16a) maintains its ID even after the break (Fig. 16c) since the 2D object in the right view sustains the ID while the one in the left is lost (Fig. 16b). Accordingly, the worker is tracked as a single entity despite the break. Tables 4 and 5 summarizes the continuity evaluation. The long baseline results in more frequent breaks, but less ID changes. With the short baseline, the camera view angles as well as the projected scenes are more analogous to each other. Thus, 2D tracking processes in the two views also experience analogous states, which increases the chances that both processes fall into unstable states concurrently. It leads to the breaks with ID changes. The use of the long baseline reduces these phenomena and handles occlusion more stably. In Table 5, provided that the maximum deviation is about five times larger than the minimum deviation, it can be inferred that the distribution is extremely skewed and most breaks are less than the average, 1 s. Fig. 17 represents the completeness and the continuity of trajectories generated with the long baseline. In the figure, the numbers in brackets are ID labels. It can be easily observed that the trajectories of
Fig. 19. Trajectory of Worker 5.
157
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
tracked. Nic is the number of ID changes which is presented in Table 4, and Nmm is the number of cases in which two or more entities are switched to each other. In the conducted experiments, neither false positives nor entity switches are found from the results (i.e. Nfp = 0 and Nmm = 0). Therefore, the MOTA can be calculated for each worker separately, and the remained difference from the retrieval percentage defined in the Eq. (13) is the subtraction of Nic in the numerator. Table 6 shows the MOTA results. The MOTA results are quite close to the retrieval percentage since ID changes occurred only 10 and 6 times for the short and long baselines, respectively. 5.4. Localization accuracy of 3D trajectories The localization accuracy is evaluated based on the error comparing the ground truth paths and the tracked trajectories. The error is defined as follows.
Localization accuracy = (∑ di )/(N − Nnr )
here, di is the error for each tracked position which is calculated as the distance from the calculated position to its corresponding line segment of the pre-determined paths, of which coordinates are acquired by using a total station. The denominator is the number of retrieved tracking data. The accuracy evaluation is summarized in Table 7, which signifies that the long baseline is more competitive than the short one. The causes of the localization errors can be classified into 2D localization errors and triangulation errors. The triangulation errors of the short and the long baselines are 0.0636 and 0.0419 m (Table 1), which is far lower than the mean localization errors of 0.493 and 0.293 m. Accordingly, it can be inferred that the localization errors are mostly attributed to 2D localization. 2D localization errors are amplified more substantially with the short baseline, while being transformed into 3D errors through triangulation. When the 6-m baseline is employed, the mean and the standard deviation are 0.293 m and 0.176 m, respectively. Assumed to be normally distributed, the errors are lower than 0.821 m with 99.7% confidence. Same as the completeness and the continuity evaluations in Tables 2 and 3, the trajectories of Workers 2 and 3 exhibit lower quality than others when the mean errors are concerned. It should be noted that the overall accuracy shown in the last row in Table 7 is equivalent to the MOTP (Multiple Object Tracking Precision) proposed by Bernardin et al. [33] as a metric to evaluate multiple object tracking processes. Figs. 18–20 shows the trajectories of Workers 3, 5, and 1, respectively, generated with the long baseline. It can be observed in Fig. 19 that the trajectory of Worker 3 has lengthy breaks, which is equivalent to the results presented in Tables 2 and 3, and Fig. 17. In Fig. 20, the trajectory is comprised of 3 parts with different IDs, indicating that there are 2 breaks entailing ID changes and Worker 5 is identified as 3 separate entities. In Fig. 18, the part marked with the dotted circle corresponds to the maximum error (2.599 m). The large errors are caused by the 2D localization errors shown in Fig. 21. The 2D localization in the right view starts to stray from the object (Fig. 21a), hitting the maximum error (Fig. 21b). Though this instability of 2D localization is related to occlusion, the instability is, afterwards, identified and eliminated by the occlusion handling procedures described in Section 4.4. The framework accurately localizes the object again (Fig. 21c) in the right view and resumes 3D tracking stably. It is worthwhile to note that the error is kept < 3 m even when the 2D localization undergoes the heavily unstable states.
Fig. 20. Trajectory of Worker 1.
Fig. 21. Cause of the maximum localization error (Worker 1).
Workers 2 and 3 contains more missing parts than others, in accord with the lower retrieval percentage in Table 2. Regarding the continuity, the trajectory of Worker 5 is the worst, exhibiting 10 breaks and 2 ID changes. The longest break is observed in Worker 3's trajectory near the timestamp of 160 s, which finally causes the ID change from ‘2’ to ‘9’. Integrating completeness and continuity into a single metric, the MOTA (Multiple Object Tracking Accuracy) suggested by Bernardin et al. [33] is also calculated. Considering that five worker entities appear in all frames of the videos, the MOTA can be defined as follows.
MOTA = (N − Nnr − Nfp − Nic − Nmm )/ N
(15)
6. Conclusions Computer vision technologies are actively investigated in construction areas to take the most use of video data that can be easily collected on construction sites. One of the applications that research efforts are continuously made on, is the localization of construction
(14)
here, Nfp is the number of frames in which false positive entities are 158
Automation in Construction 98 (2019) 146–159
Y.-J. Lee, M.-W. Park
resources including workers and equipment. To obtain 3D coordinates of moving objects, two or more cameras are required. Though there was a previous study on 3D tracking of onsite resources, it was limited to tracking of a single entity. This paper, extending the previous study, proposes a framework that can track multiple workers simultaneously and generate their 3D trajectories. The major contribution lies on the additional step - entity matching - that finds a pair of 2D objects corresponding to the same worker across the stereo views. The components and functions of entity matching also play an important role in occlusion handling. This paper also suggests that the use of a checkerboard for camera calibration is not feasible in the deployment of long baselines and far distances from cameras to target entities. The proposed framework is tested on onsite videos that record 5 workers' movements, and the generated trajectories are evaluated in terms of completeness, continuity, and localization. When the 6-m baseline is applied, the trajectories retrieved 96% of the workers' movements with the mean localization error of 0.293 m. However, concerning the continuity evaluation, the framework recognizes a worker as a few separate entities, which needs to be overcome for broad and stable applications. The future work is to apply the state-of-the-art detection algorithms such as the faster R-CNN (Regional Convolutional Neural Network) that may stabilize 2D localization processes, reduce breaks, and thus enhance continuity. Also, code optimization may be required to decrease processing time which is limited to 2.1 fps with the current implementation at the prototype level.
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Acknowledgements [21]
This work was supported by Basic Science Research Program through the Ministry of Science and ICT and the National Research Foundation of Korea (NRF-2016R1C1B2014997).
[22]
References [23] [1] O. Golovina, J. Teizer, N. Pradhananga, Heat map generation for predictive safety planning: preventing struck-by and near miss interactions between workers-on-foot and construction equipment, Autom. Constr. 71 (Nov. 2016) 99–115, https://doi. org/10.1016/j.autcon.2016.03.008. [2] H. Jiang, P. Lin, M. Qiang, Q. Fan, A labor consumption measurement system based on real-time tracking technology for dam construction site, Autom. Constr. 52 (Apr. 2015) 1–15, https://doi.org/10.1016/j.autcon.2015.02.004. [3] A. Montaser, O. Moselhi, RFID+ for Tracking Earthmoving Operations, 2012 Construction Research Congress, 2012, https://doi.org/10.1061/9780784412329. 102. [4] S. Moon, S. Xu, L. Hou, C. Wu, X. Wang, V.W.Y. Tam, RFID-aided tracking system to improve work efficiency of scaffold supplier: stock management in Australasian supply chain, J. Constr. Eng. Manag. 144 (2) (Feb. 2018) 04017115, https://doi. org/10.1061/(ASCE)CO.1943-7862.0001432. [5] X. Su, S. Li, C. Yuan, H. Cai, R. K. Vineet, Enhanced boundary condition–based approach for construction location sensing using RFID and RTK GPS, J. Constr. Eng. Manag. 140 (10) (Oct. 2014) 04014048, https://doi.org/10.1061/(ASCE)CO.19437862.0000889. [6] A. Shiffler, Telematics systems offer valuable data, American Cranes & Transport, 2018 https://www.americancranesandtransport.com/telematics-systems-offervaluable-data-/131709.article , Accessed date: 1 August 2018. [7] Federal Highway Administration (FHWA), Automated Machine Guidance with Use of 3D Models, U.S. Department of Transportation, 2013 (FHWA-HIF013-054), https://www.fhwa.dot.gov/construction/pubs/hif13054.pdf (accessed on August 1, 2018). [8] T. Cheng, M. Venugopal, J. Teizer, P.A. Vela, Performance evaluation of ultra wideband technology for construction resource location tracking in harsh environments, Autom. Constr. 20 (8) (Dec. 2011) 1173–1184, https://doi.org/10. 1016/j.autcon.2011.05.001. [9] J. Fu, E. Jenelius, H.N. Koutsopoulos, Identification of workstations in earthwork operations from vehicle GPS data, Autom. Constr. 83 (Nov. 2017) 237–246, https:// doi.org/10.1016/j.autcon.2017.08.023. [10] J. Yang, M.-W. Park, P.A. Vela, M. Golparvar-Fard, Construction performance
[24]
[25]
[26]
[27]
[28]
[29] [30]
[31]
[32]
[33]
159
monitoring via still images, time-lapse photos, and video streams: now, tomorrow, and the future, Adv. Eng. Inform. 29 (2) (Apr. 2015) 211–224, https://doi.org/10. 1016/j.aei.2015.01.011. E. Rezazadeh Azar, B. McCabe, Automated visual recognition of dump trucks in construction videos, J. Comput. Civ. Eng. 26 (6) (Nov. 2012) 769–781, https://doi. org/10.1061/(ASCE)CP.1943-5487.0000179. M.-W. Park, I. Brilakis, Construction worker detection in video frames for initializing vision trackers, Autom. Constr. 28 (Dec. 2012) 15–25, https://doi.org/10. 1016/j.autcon.2012.06.001. M. Memarzadeh, M. Golparvar-Fard, J.C. Niebles, Automated 2D detection of construction equipment and workers from site video streams using histograms of oriented gradients and colors, Autom. Constr. 32 (Jul. 2013) 24–37, https://doi. org/10.1016/j.autcon.2012.12.002. L. Hui, M.-W. Park, I. Brilakis, Automated brick counting for Façade construction Progress estimation, J. Comput. Civ. Eng. 29 (6) (Nov. 2015) 04014091, https:// doi.org/10.1061/(ASCE)CP.1943-5487.0000423. E. Rezazadeh Azar, Construction equipment identification using marker-based recognition and an active zoom camera, J. Comput. Civ. Eng. 30 (3) (May 2016) 04015033, https://doi.org/10.1061/(ASCE)CP.1943-5487.0000507. X. Luo, H. Li, D. Cao, F. Dai, J. Seo, S. Lee, Recognizing diverse construction activities in site images via relevance networks of construction-related objects detected by convolutional neural networks, J. Comput. Civ. Eng. 32 (3) (May 2018) 04018012, https://doi.org/10.1061/(ASCE)CP.1943-5487.0000756. J. Yang, O. Arif, P.A. Vela, J. Teizer, Z. Shi, Tracking multiple workers on construction sites using video cameras, Adv. Eng. Inform. 24 (4) (Nov. 2010) 428–434, https://doi.org/10.1016/j.aei.2010.06.008. J. Gong, C.H. Caldas, An object recognition, tracking, and contextual reasoningbased video interpretation method for rapid productivity analysis of construction operations, Autom. Constr. 20 (8) (Dec. 2011) 1211–1226, https://doi.org/10. 1016/j.autcon.2011.05.005. M.-W. Park, A. Makhmalbaf, I. Brilakis, Comparative study of vision tracking methods for tracking of construction site resources, Autom. Constr. 20 (7) (Nov. 2011) 905–915, https://doi.org/10.1016/j.autcon.2011.03.007. M.-W. Park, I. Brilakis, Enhancement of construction equipment detection in video frames by combining with tracking, J. Comput. Civ. Eng. (2012) 2012, https://doi. org/10.1061/9780784412343.0053. E. Rezazadeh Azar, S. Dickinson, B. McCabe, Server-customer interaction tracker: computer vision–based system to estimate dirt-loading cycles, J. Constr. Eng. Manag. 139 (7) (Jul. 2013) 785–794, https://doi.org/10.1061/(ASCE)CO.19437862.0000652. M.-W. Park, I. Brilakis, Continuous localization of construction workers via integration of detection and tracking, Autom. Constr. 72 (Dec. 2016) 129–142, https://doi.org/10.1016/j.autcon.2016.08.039. M.-W. Park, C. Koch, I. Brilakis, Three-dimensional tracking of construction resources using an on-site camera system, J. Comput. Civ. Eng. 26 (4) (Jul. 2012) 541–549, https://doi.org/10.1061/(ASCE)CP.1943-5487.0000168. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 1 2005, pp. 886–893, , https://doi.org/10.1109/CVPR.2005.177. M.-W. Park, N. Elsafty, Z. Zhu, Hardhat-wearing detection for enhancing on-site safety of construction workers, J. Constr. Eng. Manag. 141 (9) (Sep. 2015) 04015024, https://doi.org/10.1061/(ASCE)CO.1943-7862.0000974. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (Jun. 2017) 1137–1149, https://doi.org/10.1109/TPAMI.2016.2577031. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016) 770–778 Dec. 2015 https://doi. org/10.1109/CVPR.2016.90. R. Mosberger, H. Andreasson, A.J. Lilienthal, Multi-human tracking using highvisibility clothing for industrial safety, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2013, pp. 638–644, , https://doi.org/10.1109/ IROS.2013.6696418. R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed., Cambridge University Press, 2003 (ISBN: 978-0521540513). D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (Nov. 2004) 91–110, https://doi.org/10.1023/B:VISI. 0000029664.99615.94. M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM 24 (6) (Jun. 1981) 381–395, https://doi.org/10.1145/358669.358692. E. Palinginis, M.-W. Park, K. Kamiya, J. Laval, I. Brilakis, R. Guensler, Full-body occlusion handling and density analysis in traffic video-surveillance systems, Transp. Res. Rec. 2460 (Dec. 2014) 58–65, https://doi.org/10.3141/2460-07. K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance: the CLEAR MOT metrics, J. Imag. Video Process. (2008) 246309, https://doi.org/10. 1155/2008/246309.