ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
Contents lists available at SciVerse ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs
Detection and 3D reconstruction of traffic signs from multiple view color images Bahman Soheilian ⇑, Nicolas Paparoditis, Bruno Vallet Université Paris-Est, IGN/SR, MATIS, 73 avenue de Paris, 94160 Saint Mandé, France
a r t i c l e
i n f o
Article history: Received 23 February 2011 Received in revised form 12 November 2012 Accepted 22 November 2012 Available online 27 January 2013 Keywords: Traffic sign Color segmentation Geometric shape estimation Template matching Constrained multi-view reconstruction
a b s t r a c t 3D reconstruction of traffic signs is of great interest in many applications such as image-based localization and navigation. In order to reflect the reality, the reconstruction process should meet both accuracy and precision. In order to reach such a valid reconstruction from calibrated multi-view images, accurate and precise extraction of signs in every individual view is a must. This paper presents first an automatic pipeline for identifying and extracting the silhouette of signs in every individual image. Then, a multiview constrained 3D reconstruction algorithm provides an optimum 3D silhouette for the detected signs. The first step called detection, tackles with a color-based segmentation to generate ROIs (Region of Interests) in image. The shape of every ROI is estimated by fitting an ellipse, a quadrilateral or a triangle to edge points. A ROI is rejected if none of the three shapes can be fitted sufficiently precisely. Thanks to the estimated shape the remained candidates ROIs are rectified to remove the perspective distortion and then matched with a set of reference signs using textural information. Poor matches are rejected and the types of remained ones are identified. The output of the detection algorithm is a set of identified road signs whose silhouette in image plane is represented by and ellipse, a quadrilateral or a triangle. The 3D reconstruction process is based on a hypothesis generation and verification. Hypotheses are generated by a stereo matching approach taking into account epipolar geometry and also the similarity of the categories. The hypotheses that are plausibly correspond to the same 3D road sign are identified and grouped during this process. Finally, all the hypotheses of the same group are merged to generate a unique 3D road sign by a multi-view algorithm integrating a priori knowledges about 3D shape of road signs as constraints. The algorithm is assessed on real and synthetic images and reached and average accuracy of 3.5cm for position and 4.5° for orientation. Ó 2013 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved.
1. Introduction Traffic signs are visual icons set up at the sides of roads to provide information, navigation rules and warnings to drivers. They have been designed with strong visual properties in order to be easily noticed and recognized by the drivers. Road signs are thus prominent elements of infrastructure for road traffic safety. They are particularly useful when integrated in road databases for optimal path computations by providing information about prohibitions and warnings of road sections. In addition to their application in road database generation, traffic signs are useful objects in Advanced Driver Assistance Systems (ADASs) in order to provide advice and warnings to the driver. More recently, successful efforts have been made in autonomous navigation of vehicles in urban areas using only low cost vision sensors (Charmette et al., 2009; Meilland et al., 2010). Image based localization using visual ⇑ Corresponding author. Tel.: +33 14398429; fax: +33 143988581. E-mail addresses:
[email protected] (B. Soheilian),
[email protected] (N. Paparoditis),
[email protected] (B. Vallet). URL: http://recherche.ign.fr/labos/matis/~soheilian (B. Soheilian).
landmark is necessary and complementary in such systems when navigating in dense urban areas where GPS signals are corrupted with errors. In these systems, a set of landmarks are accurately localized during an off line step. On-line localization is then performed in real time by registering images to the localized landmarks. Traffic signs together with road marks constitute robust and accurate landmarks that are available on most of the roads. Manually collecting these landmarks is a tedious task that should be automatized as much as possible. The most natural way to do so is to use a mobile mapping system (MMS). We presented in (Soheilian et al., 2010) an algorithm for 3D rectangular road mark reconstruction using calibrated stereo pairs acquired in urban areas by a MMS. 3D reconstruction of road sign should deal with three main issues: (1) Detection: Find the number, location and geometric shape of the road signs within images. image. (2) Identification: Determine the exact type of each road sign, i.e. identify its meaning. (3) 3D reconstruction: Compute the 3D geometry of each road sign by stereo or multi-view image restitution.
0924-2716/$ - see front matter Ó 2013 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.isprsjprs.2012.11.009
2
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
In this paper we present first a method for detection and identification of road signs in individual color images that retrieve the silhouette of signs under simple geometric forms. Then, we show how the detected signs in individual images are used to compute the corresponding 3D signs. 2. Objectives We aim at reconstructing the silhouette of road signs from a set of georeferenced color images. More specifically, we aim at retrieving the position, the orientation and also the size of signs in 3D space. The system would work out with images captured in an uncontrolled manner in dense urban areas. Indeed, the required images are most of the time acquired with a MMS moving along the streets and capturing images in all directions with a specified time or distance interval. So, the signs can be in any position and orientation in relation to the viewing axis of the cameras and will not necessarily be seen in fronto-parallel views. The design of our traffic sign extraction in individual images was mainly guided by the following requirements: Accuracy and Level of Details (LoDs): We aim at retrieving the silhouette of road signs accurately under simple geometric forms (ellipse, quadrilateral or triangle). Robustness to: crowded scenes, non standard position of signs in relation to road axis, perspective distortion (non fronto-parallel signs), unknown size of the signs in images, etc. (cf. Fig. 1). Independency to any initial solution or search area. We will now present a survey of the most common approaches to traffic sign extraction from images. 3. Related work Detection and identification of road signs is investigated by many authors of ADAS (Advanced Driver Assistance System) community since the 90s. Recent review and evaluation study papers testify the large diversity of the proposed algorithms (Fu and Huang, 2010; Li et al., 2010; Belaroussi et al., 2010; Stallkamp et al., 2012). Traffic signs are often manufactured according to a country dependent standard. Their shape, size, color and ideograms are governed by strict specifications. Detection methods use one or multiple characteristics to select objects in images. Despite the large diversity of the road sign extraction algorithms, most of them deal with four main issues: color based
classification, geometric shape estimation, ideogram recognition and finally multi-view tracking and/or reconstruction. 3.1. Color-based sign detection The dominant color of road signs in Europe is red or blue, and the ideograms are mostly black. A large number of authors use this property for road sign detection in color images. In this case, the challenge is to make the algorithm robust to illumination conditions. Several color segmentation approaches have been proposed for that purpose, most of which rely on the HSI (Hue, Saturation, Intensity) color space. For instance, thresholds can be applied to the H and S values supposing that they are independent of illumination conditions (Piccioli et al., 1996). However, this is not completely true as the perceived color of a sign depends on weather conditions, shadows, color of the sky, close peace of facades reflecting sun light, day time, the angles that the sun and sensor directions form with the sign normal etc. The hue value (H) is the most unaltered by surface reflection, such that it can be used to find seed pixels in the image and then to recover the mask of color trough a region growing algorithm (Fleyeh, 2008). The ratios GR and R (or GB and RB) have also been exploited successfully for color-based B sign detection (De La Escalera et al., 1997). More sophisticated color spaces often based on CIE XYZ are also applied to color-based sign detection. For instance, recent approaches rely on the CIELab color space (Reina et al., 2006) or the CIECAM color appearance model along with the HCJ color space (Gao et al., 2008) to detect road signs. We refer the reader to this latter reference for a more detailed comparison of color spaces for color-based road sign detection. Both mentioned color spaces depend on white balance, and different white references are used according to weather condition. However the real white balance depends on day time, orientation of the sign in relation to the sun and also shadows. This is the reason why color segmentation alone is almost never sufficient to make road sign detection method reliable in complex urban scenes. Consequently, several methods rely on color information to extract ROIs (Region of Interests) that are then fed to shape detection algorithms. 3.2. Shape based sign detection Road signs have very specific geometric shapes (circle, rectangle, triangle and octagon) which is an important clue in road sign detection. This is usually exploited by detecting edge points in order to compute geometric characteristics. For instance, equilateral triangles can be detected by studying arrangements of line-segments (Piccioli et al., 1996). Similarly, a
Fig. 1. An example of complex urban areas. Road signs are situated in different depths and orientations so their size and shape are unknown in image space.
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
circle estimation can be applied in order to validate the candidate regions computed by a color segmentation step (Ishizuka and Hirai, 2004). Other authors investigated voting approaches. Polygonal road signs may be detected based on a generalized Hough transform in a hypothesis generation and verification paradigm (Habib et al., 1999). The Hough transform can also be used for circular sign detection (Garcia-Garrido et al., 2006). Conversely, a Chinese transform was proposed to detect rectangular and circular signs (Belaroussi and Tarel, 2009b). The same authors also propose a specific geometric model for detection of triangular signs (Belaroussi and Tarel, 2009a). A single pixel voting schemes such as the Hough transform has also been applied for detection of circular and regular polygon shape signs (Foucher et al., 2009). Another class of shape-based road sign detection methods rely on so called shape signature. The FOSTS model have been proposed that rely on the analysis of histogram of orientations (Gao et al., 2006). Another method (Lafuente-Arroyo et al., 2005) considers the distance of detected edge points to a bounding box of the edges. The FFT (fast Fourier transform) was also applied in order to compare the signature of a hypothesis with signatures of reference signs (Gil-Jimenez et al., 2005). Finally, an interesting method measuring ellipticity, rectangularity and triangularity (Rosin, 2003) can be applied to road sign detection. Most of the proposed methods detect rectangles, circles and equilateral triangles ignoring perspective deformation. Perspective effects are indeed quite weak in the case of fronto-parallel images (highway scenes). However, working in urban areas and/or using panoramic imagery discards this assumption, such that in general, road signs appear as arbitrary convex quadrilaterals, triangles or ellipses in image space. 3.3. Road sign identification In most approaches, road sign hypotheses are detected using color and shape information. The identification step is then considered as a validation step: a candidate is compared to a reference set of signs and validated only if a sufficiently good match is found. Artificial Neural Networks (ANNs) are often exploited for this purpose (Aoyagi and Asakura, 1996; Priese et al., 1995; De La Escalera et al., 1997). The ANN can be combined with radial basis functions (Zheng et al., 1994) or self-organizing map (SOM) (Prieto and Allen, 2009) in order to perform accurate road sign recognition. Other classification methods such as Adaboost (Ruta et al., 2010) and SVM (Zhu and Liu, 2006) have also been experimented in the same context. The main drawback of these methods is their need for a large number of manually classified signs in the learning step. In particular, the learning set must be exhaustive and contains instances of each sign in special cases such as perspective deformations and unfavorable light conditions. Other authors use SIFT (scale-invariant feature transform) descriptors. In this case identification is done by matching the extracted SIFT features with previously stored features of standard signs (Farag and Abdel-Hakim, 2004). Finally, road sign identification has also been performed with template matching using gray scale images (Piccioli et al., 1994; Hsu and Huang, 2001; De la Escalera et al., 2004). A similarity measure between the candidate and a set of reference signs is computed and the best similarity determines the type of the candidate sign. A threshold on the similarity value is usually defined to reject the false candidates. The main advantage of using reference signs is that it does not require to collect (manually) a large collection of learning data, and the main challenge is to ensure robustness to perspective deforma-
3
tion. Even though SIFT descriptors are invariant to rotation and translation in image plane, this is not the case for the out-of-plane rotations as high perspective deformations may disturb the SIFT descriptors. 3.4. Multi-view based road sign extraction While, most of the traffic sign extraction systems deal with detection and classification of signs in single images, some other systems integrate the detection and classification in a tracking mode (Fang et al., 2003; Lafuente-Arroyo et al., 2007; Meuter et al., 2008). Consequently, more reliable decisions can be made using multi-frame rather than single-frame information. Overall, there is a large amount of research work on detection and classification of road signs using single or multi-frame image data. In contrast, few authors investigated 3D reconstruction and localization of road signs. Recently, a multi-view system for road sign detection, classification and 3D localization was proposed (Timofte et al., 2009). First a color and shape based detection algorithm provides road sign candidates that are then analyzed by classification methods such as AdaBoost and SVM in order to reject the false alarms. Once road signs are detected on individual images, geometric and visual consistencies are used to match the candidates between multiple views and generate 3D candidates. A Minimum Description Length (MDL) approach select then an optimal subset of 3D road sign hypotheses. The method reaches 95% of correct reconstruction rate with an average localization accuracy of 24 cm, which can deteriorate to 1.5 m in some cases. An inventory system based on stereo vision and tracking was also developed for road sign 3D localization (Wang et al., 2010). Georeferenced color images are acquired by a mobile system. The system starts with color segmentation on single images providing ROIs (Region of Interests). The validation of the candidates is performed trough a SVM classification using features that are extracted by PCA (Proncipal Component Analysis). Pairs of corresponding road signs are deducted from tracking the detected road signs within successive images. Two stereo-based approaches called single-camera (stereo from motion) and dual-camera (rigid stereo base) are studied. Road sign localization accuracy varies from 1–3 m for single-camera to 5–18 m for dual-camera. The localization accuracies obtained by aforementioned methods (24 cm to a few meters) may be sufficient for road database generation where road signs should be associated to road sections. However several applications such as vision based positioning using road landmarks and high scale 3D city modeling require better accuracies. 4. Overview Surveying of the state of the art in road sign detection systems reveals that most of the efforts were directed by on-line driving assistance applications. Within these systems the main challenge is to provide traffic rules that are addressed to the driver at specific positions on roads. It amounts to look for fronto-parallel road signs at standard directions in relation to vehicle. The main goal is to detect the presence of signs rather than its position and shape. In contrast, mapping applications require to detect and recover the geometry (position and shape) of road signs in any direction (including non fronto-parallel signs). This task is much more complicated in urban areas due to the high complexity of road networks in those areas. In this paper we propose an off-line traffic sign extraction enabling both image indexation and precise topographic 3D reconstruction. The extraction is based on template matching using a
4
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
positive rates (cf. Section 6.2). These false candidates will be filtered in subsequent steps by shape estimation and template matching. A result of this color-based ROI extraction is displayed in Fig. 4.
8 > C if IIr ðx;yÞ >T & > b ðx;yÞ < r I ðx;yÞ Iðx; yÞ 2 C b if Ibr ðx;yÞ > T & > > : other otherwise
Ir ðx;yÞ Ig ðx;yÞ
>T
Ib ðx;yÞ Ig ðx;yÞ
>T
ð1Þ
where I(x, y) is the pixel of coordinates x et y in image I; Ir, Ig, and Ib are value of three channels; Cr and Cb are red and blue classes; T is a fixed threshold. Color segmentation provides two separate masks (red and blue) which connected components are extracted to form ROIs (Region of Interests). This is simply performed by labeling image regions composed of connected pixels of the masks. 5.2. Shape detection Fig. 2. Examples of reference road signs. According to Ministère de l’Écologie, de l’Énergie, du Développement durable et de l’Amenagement du territoire (2008).
complete set of reference road signs specified by a French interministry document.1 The reference set contains 47 prohibition, 31 warning, 18 obligation and 30 information signs (cf. Fig. 2 for some examples). Section 5 explains the developed method for traffic sign extraction from individual color images and its evaluation results on a large set of images. Then, Section 7 demonstrates the possibility of 3D topographic reconstruction of such extracted road signs. Conclusions and perspectives were presented in Section 8. 5. Traffic sign extraction in color images Our off-line detection algorithm enables to recovers the road signs silhouettes anywhere in a color image and to identify them trough their internal texture. The strategy is depicted in Fig. 3 and is composed of three main steps: (1) Focusing: Extracts Region of Interests (ROIs) based on color information (Section 5.1). (2) Shape estimation: Estimates the shape of each ROI by fitting simple geometric forms to edge points (Section 5.2). (3) Identification: Recognizes the road sign candidates by matching their internal textures to those of reference road signs (Section 5.3). 5.1. Color-based segmentation As mentioned in Section 3.1, color segmentation algorithms are influenced by weather condition, day time, shadows, orientation of objects in relation to the sun and many other parameters. These parameters change frequently in dense urban area scenes. In addition there are many other objects in the street of the same color as traffic signs (red and blue). This is the reason why in this work the color information is only used to generate ROIs, and not to perform classification. In consequence, a high rate of false positive (false alarms) can be tolerated as long as the number of false negatives (undetected signs) stays low. For this purpose, we used a criteria similar to the one used in (De La Escalera et al., 1997). Pixels are classified to blue or red classes according to Eq. (1). The threshold T is chosen empirically using a large set of images in a way to minimize the false negative rate, even if this can lead to high false 1 Ministère de l’Écologie, de l’Énergie, du Développement durable et de l’Amenagement du territoire (2008).
The ROIs provided by the previous step might have arbitrary sizes, so we will start by filtering the ROIs to keep only plausible road signs (Section 5.2.1). Once this is done, we will aim at detecting the geometric shapes of the signs (circles, triangles, squares) within the ROIs. Regions containing an acceptable shape are kept as potential signs and will be analyzed in the identification step, while the other regions are discarded. Generally, in dense urban areas road signs are not fronto-parallel to image plane such that they are deformed by perspective. This deformation is rarely taken into account in the existing road sign detection systems. Our acquisition system provides multi-camera panoramic images covering 360°, such that important perspective deformation often occur. This is why we aim at detecting arbitrary quadrilaterals, triangles and ellipses. The shape detection uses a variant of the RANSAC algorithm where shape estimation from the point samples is done not only considering the point position but also a precise estimate of the contour tangent. We explain in Section 5.2.2 how the contour points and tangents are extracted, then present the general RANSAC procedure (Section 5.2.3) and finally explain how the ellipse, quadrilateral and triangle are estimated from the samples (Section 5.2.4 and 5.2.5). 5.2.1. Geometric filtering We start by filtering our ROIs based on global geometric characteristics: (1) Size: We reject regions that are too small (<16 16 pixels) or too large (>200 200 pixels). The small regions often correspond to noise, and large regions to objects much larger than road signs, such as facades (cf. Fig. 4). (2) Aspect ratio: a threshold is applied to height/width ratio. Only regions with the ratio smaller than 4 are kept. It corresponds to a sufficiently high tolerance regarding perspective deformation. First line of Fig. 9 depicts the accepted ROIs on our running example (image of Fig. 4). 5.2.2. Estimation of oriented edge point A Canny Deriche edge detector (Deriche, 1987) followed by a local non-maxima suppression and a hysteresis thresholding are applied to each remaining region. Second line of Fig. 9 shows the obtained edge maps on each ROI. We aim at detecting the outer boundary of signs. However we will also detect the edges corresponding to the inner boundary and to the ideograms. In order to simplify shape detection, inner boundary and some ideogram edges can be filtered out. This is performed by applying the color
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
5
Fig. 3. Our global strategy for road sign detection in color image. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 4. Red and blue masks obtained with our method applied to an image acquired in a dense urban area.
6
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
mask to the edge map in order to filter out black ideograms. Then we discard the edges for which the gradient direction points toward the center of the ROI as depicted in Fig. 5. This filters out the inner edges of danger, obligation and also information signs. Third line of Fig. 9 shows such filtered edge maps of each ROI. A chaining algorithm is then applied to the filtered edge map to link edge points. Sub-pixel localization of edge points is then performed with a method similar to the one proposed by Devernay (1995). In order to compute accurate tangents for each edge point, a neighborhood of 1–2 pixels (of the chain) is employed. A tangent is computed by fitting a line to these sub-pixel points (3–5 points). This process provides accurate position and tangent for each edge point. A set of sub-pixel oriented edge points is computed for each region and inputted to the shape estimation algorithms. 5.2.3. RANSAC based shape estimation Even after color gradient based filtering, the edge map contains some outliers (cf. Fig. 9). In order to ensure a robust estimation of geometric shapes, we implemented a RANSAC based algorithm (Fischler and Bolles, 1981) to detect each shape (ellipse, quadrilateral and triangle): (1) (2) (3) (4)
Select randomly n points. Estimate a shape (cf. Section 5.2.4 and 5.2.5). Find support points. Compute compatibility measure:
Support contour length 1 C¼ Perimeter of estimated shape (5) (6) (7) (8)
ð2Þ
If (C > ) go to step 1. Else register the estimated shape. Repeat steps 1–6 k times. If any shape is registered, accept the largest one.
Number of points to select depend on the shape. As will be explained in Section 5.2.4 equilateral and triangle are estimated respectively with 4 and 3 points and ellipse is estimated with 3 points (cf. Section 5.2.5). The number of iterations (k) is computed regarding the rate of outliers. Theoretically, in order to reach
99% of success probability supposing 50% of outliers, the algorithm needs 34 iterations for estimation of ellipse and triangle (n = 3 points) and 71 iterations for estimation of quadrilateral (n = 4 points). In practice we perform more than 100 iterations. Fig. 6 (top) illustrates some iterations of the RANSAC algorithm for shape estimation. Regarding road sign specification (cf. Section 4) red signs are circular or triangular and blue ones are circular and rectangular. Therefore both corresponding shape estimation algorithms are applied to red and blue ROIs. The best shape is chosen regarding the compatibility measure (cf. Eq. (2)). Fig. 6 (bottom) illustrates the selected shapes. If no model is registered by RANSAC algorithm the ROI is rejected. In order to ensure detection of partially occluded and damaged signs a coarse threshold is applied to the compatibility measure ( = 0.3). Once a geometric shape is estimated in a ROI, the sign’s category can be determined. It is considered as a hypothetical sign of a given category and will be analyzed by the validation step. 5.2.4. Triangle and quadrilateral from oriented edge points Triangles can easily be estimated from 3 oriented points: each oriented point defines a line and the three triangle vertices are defined as intersections of three pairs of lines (cf. Fig. 7). For quadrilaterals, four oriented points are grouped in two pairs by associating the 2 points with the closest orientation in a pair (the remaining two forming the other pair). The four corners of the quadrilateral are then simply defined as intersections of a line from each pair. 5.2.5. Ellipse from oriented edge points An ellipse is defined as a set of points satisfying:
aðx pÞ2 þ 2bðx pÞðy qÞ þ cðy qÞ2 ¼ 1
ð3Þ
The ellipse parameters a, b, c, p and q can be estimated from five points by solving a non linear equation. It was shown in (Zhang and Liu, 2005; Song and Wang, 2007) that it is possible to accurately estimate the center of an ellipse using only three points and their corresponding tangents with the method which is depicted in Fig. 8. The estimation of the center is very sensitive to accuracy of tangents but our sub-pixel edge localization provides sufficient precision in the tangent estimation. Once the center C = (p, q) of the ellipse has been determined, Eq. (3) becomes linear in a, b and c. Each of the three contour point coordinates can be injected in Eq. (3) which results in a linear system of three equations which should yields an ellipse that is included in the bounding box of the ROI. If it does not, the RANSAC sample is simply discarded. 5.3. Validation and identification We use a complete set of reference road signs in order to identify or reject a hypothesis trough a template matching process. However, a road sign in image space undergoes a projective deformation that should be corrected prior to this template matching. In order to solve this problem, a local rectification of the image within the detected signs is proposed in Section 5.3.1. Then, the template matching itself is detailed in Section 5.3.2.
Fig. 5. Color gradient direction is used to filter out inner boundaries of signs. Black arrows show gradient direction at each point.
5.3.1. Local image rectification Theoretically, a projective transformation needs eight parameters to be fully determined, which requires to know the transforms of four independent points. In the case of rectangular signs, this is simply performed by mapping each corners of the detected sign to a corner of the square template (Fig. 10a). In the case of triangular
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
7
Fig. 6. Examples of RANSAC iterations and obtained results on three shapes of road signs.
Fig. 7. Estimation of quadrilateral (resp. triangle) using 4 (resp. 3) oriented points.
estimation also relies on assimilating the vertical and horizontal axes of a circle with the main axes of the ellipse obtained by transforming the circle. Once again this is an approximation that is negligible as long as the plane defined by viewing direction and sign normal is close to horizontal, which is always the case in practice. Eq. (4) expresses the applied projective and affine transformations. While computing the transformation parameters (for all three shapes), the coordinates of re-sampled images are set such that the obtained image have the same size as the reference road sign patterns, so they will be superimposable.
a1 x þ b1 y þ c1 a0 x þ b0 y þ 1 a 2 x þ b2 y þ c 2 y0 ¼ a0 x þ b0 y þ 1 x0 ¼
Fig. 8. Estimation of an ellipse’s center using three points Pi and their tangents (ti). Iij is the intersection of ti and tj and Mij is the midpoint of line segment [PiPj]. The center C of the ellipse is estimated as the intersection of the lines (IijMij).
signs we have only three points so we will only be able to estimate an affine transformation (Fig. 10b). The error caused by approximating perspective transform by affine transform is negligible as long as the sign size (a few tens of centimeters) is negligible compared to its distance to the optical center (usually >10 m), which is almost always the case. Finally in the case of circular signs (ellipses in the image), we choose three points that fully characterize the ellipse: its center O0 and the intersection A0 (resp. B0 ) of the most vertical (resp. horizontal) axis with the ellipse. These points are mapped to the template circle center O, and intersection A (resp. B) of the vertical (resp. horizontal) axis with the circle in order to estimate an affine transform (Fig. 10c). In addition to the affine approximation, this
ð4Þ
where x0 and y0 are the coordinates in original image and x and y are the coordinates in re-sampled image. For affine transformation: a0 = b0 = 0. Once the transformation estimated, the image is locally re-sampled by a bilinear interpolation resulting in a normalized size texture pattern (same size as reference signs). Fig. 10d shows the interest of rectification in the case of important perspective deformation. Results of rectification on our running example are depicted in the fourth line of Fig. 9. 5.3.2. Template matching Once a detected sign is re-sampled to overlay reference signs (templates), matching is performed using ZNCC (Zero-mean Normalized Cross Correlation), which is quite invariant to illumination condition. Each candidate will only be matched with the reference signs of the category that was previously infered from its shape and color. The computed ZNCC values provide useful quality measures for detected candidates. Values close to 100% correspond to ideal candidates while values below a threshold (around 50%) are considered as false detection and filtered out. This similarity measure decreases when the sign is (partially) occluded or damaged but also when the rectification process is imperfect. This is due to imprecisions during shape estimation and to intensity interpolation in the re-sampling process. Hence the choice of ZNCC threshold is critical and should depend on the
8
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
Fig. 9. Intermediate and final results obtained by our algorithm on the running example of Fig. 4. Relative scale is kept in order to show the efficiency of the algorithm in detection of small signs. The first line depicts the ROIs provided by the color segmentation algorithm. The second line shows the edge map of each ROI. The filtered edge maps are shown in the third line. The fourth line shows the re-sampled regions after shape estimation. These images are perspective-free and of standard size. N/A means that non valid geometric shape is detected. The last line depicts the corresponding reference sign and the correlation score. Candidates with a score lower than 50% are filtered.
application and in particular on the acceptable false alarm rate. We will discuss the threshold determination in Section 6.3. Fig. 12 displays ZNCC values for the speed limit sign (30 km) of our running example. The maximum value (69%) corresponds to the correct reference sign. The correlation peak is not very strong because from texture point of view other speed limit signs are quite similar and the differences are fairly weak. We will discuss in Section 6.3 how to set the ZNCC threshold for road sign validation and identification in single image. In multi-view detection and reconstruction paradigm, the ZNCC values can be considered as scores for different types. The final type of a candidate can then be deducted from a multi-view score (cf. Section 7.2.3). Last line of Fig. 9 shows the identified and rejected signs together with the corresponding ZNCC values (with ZNCC threshold set to 50%). The identified signs are displayed in context in Fig. 11.
6. Road sign detection results and evaluations The presented algorithm was applied to a large set of images acquired in the dense city center of Paris by a mobile mapping system (Paparoditis et al., 2012). We start by presenting in Section 6.1 some qualitative results showing the robustness of the method to the problems due to dense urban scenes. Then in Section 6.2 we explain how we used an experiment to tune the color threshold of our approach. Finally in Section 6.3 we present some quantitative results for both detection and identification of road signs.
6.1. Qualitative results In contrast with many existing road sign extraction systems, our method does not make any assumption about position, size and orientation of road signs in image space. We only make one (resp. two) very weak assumption(s) in the case of triangular (resp. circular) signs. This property makes our algorithm efficient in dense urban areas where signs can have very diverse positions and orientations relatively to the camera (cf. Fig. 13). Moreover, the precision of the geometric shape estimation and rectification steps makes our algorithm robust to perspective deformations for both detection and identification of signs.
Most of the false alarms come from objects that look alike road signs, such as the red hhOii letters shown in Fig. 14. Car taillights are also often detected as circular red signs and sometimes identified as hhno entryii sign (cf. Fig. 15). Similarly, rectangular and triangular objects in shop windows or publicity boards may be confused with information and danger signs. The pipeline is implemented under C++ using the InsightToolkit (Ibanez et al., 2003). The average processing time of our algorithm is 2.3 s for detection and lower than 0.5 s for recognition on a 960 1080 image on an Intel 2.4 GHz processor. 6.2. Color threshold tuning As depicted in Fig. 3 our algorithm is composed of three main steps: focusing, shape estimation and identification. The combination of the first two steps provide a road sign detection method. The quality of this detection relies mainly on the color threshold (cf. Eq. (1)) used in the focusing step. The main issue is to balance between a low threshold that will not distinguish some signs from their background, and a high threshold that will miss some signs. In order to evaluate the impact of this color threshold on detection, we used a ground-truth database2 of 847 images (960 1080 pixels). This database contains 251 road signs of different categories, sizes and types. We detected road signs based on the first two steps of our algorithm, and confronted the results to the reference database using different color thresholds (from T = 1.10 to T = 1.46). Let P be the total number of signs and Nb Img be the total number of images, for each T value, and for 4 categories of sign sizes, TP we computed the true positive rate TP% ¼ P 100 and false posiFP tive per image NbImg . The results are detailed in Belaroussi et al. (2010) where the detection step of our algorithm was compared with two road sign detection methods. Note that our method is not designed to minimize the number of false alarms as most of them will be eliminated in the identification step. The maximum true positive rate was obtained for T = 1.20, and the corresponding results are displayed in Table 1. The best true positive rate is 97% for road signs larger than 64 pixels with 0.32 false positives per image. The correct detection rate then naturally decreases with size of signs as the shape 2 Available at www.itowns.fr/benchmarking.html (accessed 25.10.12) (Belaroussi et al., 2010).
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
9
(a)
(b)
(c)
Fig. 12. Sign identification by template matching. The ZNCC values correspond to the detected speed limit sign (30 km) of our running example (cf. Fig. 9).
(d) Fig. 10. (a–c) Applied points for local rectification of image. (d) An example of local rectification.
Fig. 11. Final result of our algorithm on the running example.
10
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
Fig. 13. Results obtained in dense urban areas. Signs are in arbitrary positions relatively to the camera and may undergo severe perspective deformations.
Table 1 Maximum reached correct detection rate (computed on 847 images). TP: True Positive; FP: False Positive; Threshold value T is fixed to 1.20. Min. size (pixels)
Number of signs
64 48 32 16
Fig. 14. An example of false alarm: the letters hhOii (a) are accepted as hhno trafficii sign (b).
estimation precision decreases with the number of oriented edge points available in ROIs. 6.3. Quantitative results In order to evaluate our algorithm we used a ground-truth database containing 3384 images (excluding those used for parameter tuning) containing 1246 road signs of different types, shapes and sizes. The color threshold parameter was fixed to its optimal value T = 1.20 obtained during parameter tuning. Table 2 displays the detection and identification performance of our algorithm on this database. The best true positive rate is obtained for signs that are larger than 64 pixels (82%) with 1.78 false alarms per image. True positive rate remains acceptable for the smallest category (TP = 73% and FP/image = 15.04). In order to evaluate the impact of ZNCC threshold value in identification step, ROC (Receiver Operating Characteristic) curves are drawn for identification of each
30 74 173 251
Detection TP (%)
FP/image
97% 87% 82 % 72%
0.32 0.83 4.70 11.50
Table 2 Correct detection rates and false alarm rates for four categories of road sign sizes (computed on 3384 images). TP: True Positive; FP: False Positive; Threshold value T is fixed to 1.20. Size min. (pixels)
Number of signs
Detection
Detection + Identification
TP (%)
FP/ image
TP (%)
FP/image
64 48 32 16
261 502 975 1246
82% 82% 76 % 73%
1.78 3.05 6.31 15.04
71% 68% 60% 55%
0.07 0.10 0.21 0.59
size category (cf. Fig. 16). These curves are obtained by tunning the ZNCC threshold from 0.5 to 0.9. A high ZNCC threshold value decreases the number of mis-identified signs (i.e. a non road sign object identified as a road sign or a road sign wrongly identified), but also decreases the correct identification rate (left end of the curves). It should be used for applications requiring robustness (the database does not contain wrong information). Conversely, a low ZNCC threshold value (right end of the curves) should be used for applications requiring completeness (having all existing signs
Fig. 15. An example of false alarm: (a) An ellipse is fitted to a car taillight. (b) Resulting rectified image. (c) hhno entryii sign.
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
11
order to provide confident identification, whereas, in a multi view system, even road signs with lower confidence can be involved in a multi-view identification process resulting to a reliable recognition. We will explain this in Section 7. Table 2 shows the rate of correctly detected and identified signs (TP) and also wrongly detected and/or wrongly identified ones (FP) using a ZNCC threshold of 0.55. Considering that the identification is performed within a very large set of 126 different reference signs, the obtained results are very promising. 7. 3D traffic sign reconstruction In this section we demonstrate the interest of the presented extraction algorithm in high precision 3D reconstruction and modeling of road signs. The reconstruction algorithm get as input the road signs extracted from a set of georeferenced images (cf. Fig. 17) and retrieve the road sign projections corresponding to the same sign. The reconstruction is then performed by a multiview approach taking into account the geometric constraints related to the shape of road signs. This is performed by a hypothesis generation and verification strategy. In other words, a matching step generates 3D hypotheses using hypothetical 2D matches and then a verification step select the best matches. Thus, the matching step uses the 3D reconstruction algorithm. This is the reason why first, the reconstruction algorithm is explained in Section 7.1 and then the matching step is described in Section 7.2. 7.1. Multi-view constrained road sign reconstruction We aim at computing an optimal 3D road sign from its projections in n georeferenced images. The resulting 3D model should match the geometric specifications of road signs. Therefore, a multi-view constrained reconstruction approach was set-up. Indeed, such an algorithm enables handling the eventual residual errors of 2D road sign shape estimation in images by improving the number of observations and also constraining the resulting 3D model. Road signs are planar objects of known shapes (cf. Fig. 18). Rectangular and triangular signs can be modeled by 4 and 3 points which projections in images are corners that can be accurately estimated. Thus we will perform 3D reconstruction of polygonal signs by estimating the 3D positions of those individual points under a set of constraints. Conversely, circular signs are projected as ellipse in images, for which no specific point can be distinguished. Thus the whole ellipses should be matched in order to reconstruct the corresponding circle. This problem is far from trivial and will not be tackled in this paper which focuses instead on reconstruction of rectangular and triangular signs. The adopted method relies on analytical minimization of an energy measuring the adequacy of the 3D shape with its projection in n georeferenced images (data term) under geometric constraints enforcing the 3D shape of the sign. We will start by defining the data attachment energy, then explain how we parametrize rectangular and triangular road signs in order to ease the expression of the geometric constraints.
Fig. 16. ROC curves for road sign identification for different size categories.
in the database). As an example, extraction and identification of road signs in individual images, may imply a high ZNCC value in
7.1.1. Data term The data term measures the adequacy of a 3D road sign s with its projection ri2I in a set of images I. The ri is a 2D polygon of equilateral shape (resp. triangle shape) in the case of rectangular ! (resp. triangular) sign. Let pik be the view ray of kth polygon’s vertex in image i (cf. Fig. 19). The data term measures the distances of these rays from the corresponding 3D vertexes:
E¼
X ! !2 Oi Pk pik
i2I;k2K
ð5Þ
12
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
Fig. 17. Example of road signs detected on georeferenced images. Intrinsic and extrinsic parameters enables retracing the 3D rays to signs corners. Corresponding 3D signs can be reconstructed by intersecting the corresponding rays.
Fig. 18. Road signs have known shapes: rectangle of given aspect ratio, equilateral triangle or circle.
where K ¼ f1; 2; 3g for a triangle and {1, 2, 3, 4} for a rectangle, Oi is the projection center of image i, and Pk is the kth vertex of 3D polygon (road sign). The best solution from data attachment point of view is the solution minimizing the E value. Optimum value for each Pk can be obtained by the following closed-form expression:
Pk ¼
Xh !i 2 pik i2I
!1
2
Xh !i 2 pik Oi
i2I
0
def 6 With : ½p~ik ¼ 4 v z
vy
vz
v y
0
vx
v x
0
ð6Þ
! ! Pk ¼ Q þ 1 ðkÞ v 1 þ 2 ðkÞ v 2
1 ðkÞ ¼ f0; 0; 1; 1g 2 ðkÞ ¼ f0; 1; 1; 0g 1 ðkÞ ¼ f0; 0:5; 1g 2 ðkÞ ¼ f0; 1; 0g
! ! Each road sign is parameterized by ~ x ¼ Q ; v 1 ; v 2 , which ensures co-planarity of points in the case of rectangular signs. In order to guarantee the shape of signs two constraints are considered:
! ! v1 v2 ¼ 0 ! ! ~ C 2 ðxÞ ¼ m2 k v 1 k2 k v 2 k2 ¼ 0
! for pik ¼ ½v x ; v y ; v z t ð7Þ
However, the solution cannot ensure that the resulting points (Pk) respect the geometric specifications of road signs. For example, the four 3D points of a rectangular sign will not be necessarily coplanar.
ð8Þ
Rectangle : Triangle :
C 1 ð~ xÞ ¼
3 7 5
7.1.2. Constraints In order to ensure a solution in agreement with geometric specifications of road signs, geometric constraints should be introduced in the optimization process. In the case of rectangular sign: the four points should be coplanar and form a rectangle with a given aspect ratio (m). The three points in the case of triangular signs should form an equilateral triangle. In order to simplify the application of these constraints we propose a parametrization for rectangular and triangular signs. They are ! ! modeled with one 3D point Q and two 3D vectors v 1 and v 2 (cf. Fig. 20). The vertices of each polygon can expressed by the following equations:
ð9Þ ð10Þ
In the rectangular case, m is given bypspecification (known aspect ffiffiffi ratio), while in the triangular case, m ¼ 3=2 (triangle is equilateral). 7.1.3. Constrained minimization We look for the minimizer of Eð~ xÞ under two constraints C 1 ð~ xÞ ¼ 0 and C 2 ð~ xÞ ¼ 0. The solution of this problem is given by cancelling all derivatives of the Lagrangian:
13
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
Fig. 19. Reconstruction of a rectangular road sign from its projection in n images.
and ~ u is a constant vector depending only on observations. The solution of the linearized problem (12) is:
ð1Þ ~ x ; x^ ¼ ~ xð1Þ N1 Dt ðDN1 Dt Þ1 D~
Fig. 20. Rectangle and triangle are coded by a 3D point Q and two 3D vectors ~ v 1 and ~ v2.
L ~ x; ~ k ¼ Eð~ xÞ þ k1 C 1 ð~ xÞ þ k2 C 2 ð~ xÞ
ð11Þ
Derivatives along ~ x are linear, but derivatives along ~ k ¼ ðk1 ; k2 Þ are not, so we need to linearize the problem into:
ð12Þ def
with N ¼
@2L @~ x@~ xt
! def
D ¼
@2L @~ x@ki
!
~ xð1Þ ¼ N1~ u
ð13Þ
We then iterate Eq. (13) to compute finer estimates of the solution to the non linear system. The initial iterate ~ xð1Þ is the solution of the system without taking into account the constraints (C1 and C2). Note that only D is to be re-estimated as N; ~ u, and thus ~ xð1Þ are constants. The corresponding term of Eq. (13) is a corrective term arising from the enforcement of the constraints. The process is repeated until the variation of this corrective term is smaller than a fixed threshold. Fig. 21 depicts an example of triangular sign reconstructed from multiple-view images under geometrical constraints. 7.2. Matching road signs within images 3D reconstruction of a road sign given its projections in a set of georeferenced images was described in Section 7.1. We will now explain how given a set of detected road signs in georeferenced images (cf. Fig. 22a), we look for subsets of detected signs corresponding to the same 3D road signs. Each subset will serve then as input for reconstruction of a single 3D road sign. This is based on a hypothesis generation and validation paradigm. We explain first in Section 7.2.1 the applied constraints for matching a pair
Fig. 21. An example of multi-view constrained reconstruction.
14
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
– Size: the size of resulting 3D road sign should lie in some range (specified in a road sign manual). – Visibility: the resulting 3D road sign should face both camera centers. Similarity constraint: both detected signs should have the same visual characteristics. In order to compare the visual characteristics of two detected signs, we could compare their type identified in the detection step. However, as identification is not 100% accurate: (1) Road signs of different type falsely identified as the same type may lead to reconstructing a non existing road sign. (2) Road signs of the same type falsely identified as different types may lead to not reconstructing an existing road sign. In order to reduce the impact of mis-identification, only the main category of matching candidates (warning = triangular or information = rectangular) is taken into account, such that matching disambiguation is mainly based on geometry. 7.2.2. Hypotheses generation The method of hypotheses generation is detailed in Algorithm 1. The input is a set of unorganized 2D detected road sign (R). The output consists in subsets of R, where every subset correspond plausibly to an individual 3D road sign. For every 2D detected road sign, the algorithm tests the matching constraints against all other 2D road signs. If the matching constraints are verified the pair is considered as compatible and the corresponding 3D road sign is reconstructed. Whenever a new 3D road sign is reconstructed, it is compared to already reconstructed signs by computing their distance. We define the 3D distance between two hypotheses as he max of the distance between corresponding points. If this distance is below a certain threshold that we call clustering threshold (), then the two hypotheses are considered to be close enough to belong to the same 3D sign. They are then replaced by a single new 3D hypothesis computed using the union of their supporting 2D signs. Conversely, if there is no any hypothesis close enough to the new 3D hypothesis, the later is added to the hypotheses set S. The process stops when the compatibility of all possible pairs is verified. Algorithm 1. Algorithm for generating 3D road sign hypotheses
Fig. 22. Main steps of the matching.
of 2D detected road sign. Then Section 7.2.2 will describe how these constraints are used to generate a set of hypothesis. Finally the validation of hypotheses will be explained in Section 7.2.3. 7.2.1. Matching constraints As classically done to solve stereo-matching problems, two kinds of constraints are applied for matching a pair of features: Geometric constraints: – Epipolar geometry: both detected sign centers should lie on the same epipolar line of the camera pair.
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
15
Fig. 23. Horizontal panoramic constituted of 8 images and two forward looking and rear looking stereo pairs are used in road sign reconstruction process.
7.2.3. Hypotheses validation The hypotheses generation does not prevent a single 2D sign to participate in the reconstruction of several 3D hypotheses, thus the process results in a large set of 3D hypothesis that may contain false positives (cf. Fig. 22b). In order to filter out these 3D false positives, we apply the very obvious constraint that every 2D sign can explain at most one 3D sign. For this purpose, every 2D sign is associated exclusively to the hypothesis with the highest number of supports, and only the 3D hypotheses associated with more than three 2D supports are validated to form our final reconstructed road sign models (cf. in Fig. 22c). Finally, the type of the 3D models is determined by a majority voting over their supporting 2D signs. 7.3. Evaluations and results In this section, we will present the experimentations that we led (Section 7.3.1), and then discuss their results (Section 7.3.2). 7.3.1. Evaluation setting The aim of our experiments is to evaluate: (1) The detection rate of our matching algorithm: number of false positives (reconstructed signs not present in the real scene) and false negatives (real road signs not reconstructed).
(2) The accuracy of our 3D reconstruction: geometric error between the real signs and their 3D reconstructions. This evaluation will rely on running our full pipeline on two datasets: one real, and one synthetic. \The real dataset consists in a large set of georeferenced images acquired by a MMS called Stereopolis (Paparoditis et al., 2012) on a 1 km long path of the city center of Paris. The MMS is equipped with direct georeferencing devices (GPS/INS/ Odometer) and its imaging system captures 16 images every 4 m: 12 combined in a panoramic (from which we will only use the 8 horizontal views), and rear and forward looking stereo rigs (cf. Fig. 23). This configuration ensures a high redundancy and minimizes the blind areas. Around 400 poses are considered for this study, resulting in 400 12 = 4800 images for which the intrinsic parameters are determined by a calibration method and extrinsic orientations are obtained by the direct georeferencing system. A 3D ground truth reference is needed in order to precisely evaluate the accuracy of our 3D reconstruction, which requires considerable time and expensive surveying measurements on road signs. A much cheaper alternative consists in simulating acquisition with a known geometry. We simulated this synthetic dataset by creating a set of 66 3D road signs of various types and sizes. These signs
Fig. 24. Simulated 3D road signs embedded in a 3D city model. Camera parameters can be defined to capture virtual images for given poses (displayed in white).
16
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
7.3.2. Road sign reconstruction results and evaluations Our full reconstruction pipeline produced a result set of 24 3D road signs from the real data, and 63 road signs were reconstructed from the synthetic data. 54 out of 63 correspond to correct 3D road signs. As we aim at evaluating the matching process and not the 2D detection that is already evaluated in Section 6, we will only consider the signs that are detected in at least three 2D images. Signs seen in less than 3 images are discarded in the matching step (cf. Section 7.2). For real (resp. synthetic) signs, we found 2 (resp. 9) false positives, and 4 (resp. 1) false negatives: False positives: they correspond to multiple responses occuring when geometric imprecision is higher than the clustering threshold, such that hypothesis corresponding to the same 3D real sign are not clustered in a single hypothesis (cf. Fig. 26a). This imprecision is mostly cause by the 2D detection step that may locate the sign borders inaccurately. In the real case, inaccuracy of the geopositionning add to this imprecision. False negatives: 11 missing signs in the synthetic case were not considered false negatives as they were detected in less than 3 single images. The only false negative come from false match which is caused by geometric ambiguity (cf. Fig. 26b). Geometric ambiguities occur when 3D cones corresponding to two different signs intersect, which often happens when two or more 3D road signs of same category are close. Lowering the clustering threshold helps reducing the number of false matches, but does not solve all of them as they can be undistinguishable from the global inaccuracy. Texture of road signs can help disambiguate such tricky cases if the signs are of different types. We however did not take into account type identification in order be more robust to possible imperfections of the 2D road sign detection step. Conversely, the residuals of the reconstruction process can also serve to solve such ambiguities: in the case of Fig. 26b, the mean residual of hypothesis C is higher than that of hypothesis A.
Fig. 25. (a) An example of an image used for 3D road sign reconstruction. (b) Reconstructed road signs seen from the same point of view. (c) 3D rays involved int the 3D reconstruction.
were embedded in a 3D model containing 3D buildings and a Digital Terrain Model (DTM) textured by aerial images. Using the method presented by Vallet and Houzay (2011) a virtual camera with given intrinsic and extrinsic parameters can be defined to capture virtual images of the scene for given poses (cf. Fig. 24). The simulation was ran using the same 400 poses as in the real acquisition, which resulted in 400 12 synthetic images of known poses. Such a simulated image can be seen in Fig. 25b.
In conclusion, the clustering threshold is the pertinent parameter to tune the detection. It offers a tradeoff between false positive and false negative rates: augmenting the clustering threshold reduces false negative rate but causes more false matches, leading to more false positives. For the real dataset, the precision of the reconstruction can be evaluated qualitatively by visual comparison of real images with renderings of the reconstructed 3D signs in the same point of view. This inspection reveals a very high precision as seen in Fig. 25. A first quantization of this precision can be obtained by looking at the image residuals of the constrained optimization. They reach a mean value of 5.05 pixels (cf. Fig. 27a for residual histogram). If the constraints are not enforced, the residuals reach a mean value of 4.75 pixels (cf. Fig. 27b for residual histogram). This means that the constraints have a low influence in image back-projection error, or in other words that our model is compatible with the measurements. However in both cases the residual values seems quite high compared to 2D shape estimation errors that is 1–2 pixels. We guess that this is caused by inaccuracies in extrinsic parameters of the cameras obtained by the direct georeferencing system. Nevertheless, the introduction of constraints in the optimization should lead to a finer reconstruction as they correspond to pertinent geometric information on the observed signs. We can validate this idea on the synthetic data for which the ground truth exists and can be employed to precisely measure the accuracy in both 2D and 3D. Fig. 29 shows the histograms of 2D image residuals, planimetric, altitude and orientation errors in 3D for (constrained and unconstrained) reconstruction. Naturally, image residuals are slightly higher for the constrained solution than the unconstrained one (1.8 pixels vs. 1.77 pixels). The
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
17
Fig. 26. Main causes detection errors.
mean residual value for synthetic data (1.8 pixel) confirm our guess that the higher residual for the real data (5.05 pixel) is due to camera calibration errors. Concerning 3D errors, the constrained solution reached 36 mm for planimetric mean error while the unconstrained one reached only 48 mm. The improvement is more substantial for orientation error. It reached 4.55° mean error while the unconstrained solution touched 9.12°. Fig. 28 depicts the contribution of constraints in the correct estimation of a road sign’s plane. The hight component error is quite precise in both cases (10 mm). It can be explained by the fact that the height component is nearly orthogonal to the mean epipolar axis.
8. Conclusions and perspectives The main contribution of this paper is a pipeline for road sign detection and estimation in images enabling topographic 3D reconstruction in complex urban areas. Thanks to geometric form estimation the method is robust to perspective distortions for both detection and identification of road signs in individual images. The evaluation on a large set of complex reference data revealed the stability and an acceptable detection and identification rate. Confrontation of our method with two new methods in the state of the art (Belaroussi et al., 2010) denoted competitive results in
18
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
(a)
(b)
Fig. 27. Histogram of image residuals: (a) Without geometrical constraint. Mean residual = 4.75 pixels. (b) With geometrical constraints. Mean residual = 5.05 pixels.
Fig. 28. Orientation error with/ without taking into account the constraints.
terms of detection rate while enabling the required base for accurate 3D reconstruction. Indeed, we did not care about computation time in the implementations since it is not mandatory in mapping applications. Most of the computation time goes into shape estimation and recognition steps, which can easily be accelerated through parallel computing.
The application of the detection pipeline in topographic 3D reconstruction is also demonstrated on real and synthetic images. Integration of geometric constraints related to the road signs shapes in a multi-view paradigm enables the reconstruction to reach centimetric accuracies. To the best of our knowledge, this is the first system reaching such an accuracy for automatic 3D reconstruction and modeling of road signs. The matching part of the reconstruction algorithm who generate 3D hypotheses by clustering 2D detected road signs can be improved by formulating the problem of matching as a maximum clique search in a correspondence graph. The presented approach can easily be implemented in an online mode. Indeed, road signs can be detected in individual images and be injected progressively into the reconstruction process. The implausible 3D sign candidates situated too low or too high in relation to the road can be discarded. These candidates correspond most of the time to other objects such as hhOii letters on the store’s plate names (cf. Fig. 14). The remained reconstructed road signs can be fed back to the detection step in order to define ROIs (Region of Interests) for detection of the same road signs in other images. The shape detection and identification would be run on
Fig. 29. Geometric evaluation: difference between reconstructed and reference road signs.
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
the provided ROIs. In the case where any road sign is detected, a tag can be generated. Thus, an operator can be brought intelligently only to the plausible regions. The entire process will be accelerated and road sign detection completeness will be improved. Finally, coupling the detection and reconstruction steps enables usage of image texture in the matching step and contributes to robustness for both 3D reconstruction and identification of road signs. Circular traffic signs are often the most frequent signs on roads. In order to achieve a complete road sign database generation, our system should also deal with this category of traffic signs. Our road sign detection algorithm detects and estimates 2D ellipses corresponding to sign borders in every individual images. In contrast to polygonal road signs, circular signs cannot be modeled by individual points and the whole ellipses should be matched in order to reconstruct the corresponding circle. Such a conic matching and reconstruction method is investigated in (Quan, 1996) but limited to the stereo case, and integration of geometric constraints and extension of the method to multi-view context is far from being trivial. This problem constitutes an interesting trend from theoretical point of view. Acknowledgements The authors gratefully acknowledge the contribution of Philippe Nicolle, Philippe Foucher and Rachid Belaroussi in building the reference data. Our grateful thanks also go to Mathieu Brédif for interesting discussions and help. This work was financed by the ANR (French National Research Agency) in the CityVIP3 project. References Aoyagi, Y., Asakura, T., 1996. Detection and recognition of traffic sign in scene image using genetic algorithms and neural networks. In: Proc. Society of Instrument and Control Engineers Annual Conference, July 24–26 1996, SICE, Tottori, Japon, pp. 1343–1348. Belaroussi, R., Tarel, J.P., 2009a. Angle vertex and bisector geometric model for triangular road sign detection. In: Proc. Workshop on Applications of Computer Vision, 7–8 December 2009, IEEE, Snowbird, Utah, USA, pp. 1–7. Belaroussi, R., Tarel, J.P., 2009b. A real-time road sign detection using bilateral chinese transform. In: Proc. International Symposium on Visual Computing, 30 November–2 December 2009, IEEE, Las Vegas, Nevada, USA, pp. 1161–1170. Belaroussi, R., Foucher, P., Tarel, J., Soheilian, B., Charbonnier, P., Paparoditis, N., 2010. Road sign detection in images: a case study. In: Proc. International Conference on Pattern Recognition, 23–26 August 2010, IAPR, Istanbul, Turkey, pp. 484–488. Charmette, B., Royer, E., Chausse, F., 2009. Matching planar features for robot localization. In: Proc. International Symposium on Advances in Visual Computing, 30 November–2 December 2009, IEEE, Las Vegas, Nevada, USA, pp. 201–210. De La Escalera, A., Moreno, L., Salichs, M., Armingol, J., 1997. Road traffic sign detection and classification. IEEE Transactions on Industrial Electronics 44 (6), 848–859. De la Escalera, A., Armingol, J., Pastor, J., Rodriguez, F., 2004. Visual sign information extraction and identification by deformable models for intelligent vehicles. IEEE Transactions on Intelligent Transportation Systems 5 (2), 57–68. Deriche, R., 1987. Using canny’s criteria to derive a recursively implemented optimal edge detector. International Journal of Computer Vision 1 (2), 167–187. Devernay, F., 1995. A Non-Maxima Suppression Method for Edge Detection with Sub-Pixel Accuracy. Tech. Rep. RR-2724. INRIA.
(accessed 26.10.12). Fang, C.-Y., Chen, S.-W., Fuh, C.-S., 2003. Road-sign detection and tracking. IEEE Transactions on Vehicular Technology 52 (5), 1329–1341. Farag, A., Abdel-Hakim, A., 2004. Detection, categorization and recognition of road signs for autonomous navigation. In: Proc. Advanced Concepts in Intelligent Vision Systems, 31 August–3 September 2004, Brussels, Belgium, pp. 125–130. Fischler, M.A., Bolles, R.C., 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), 381–395. Fleyeh, H., 2008. Traffic and Road Sign Recognition. Ph.D. Thesis. Napier University, Edinburgh, Scotland, 2008. Foucher, P., Charbonnier, P., Kebbous, H., 2009. Evaluation of a road sign predetection system by image analysis. In: Proc. International Conference on 3
http://projet_cityvip.byethost33.com (accessed 29.10.12).
19
Computer Vision Theory and Applications, 5–8 February 2009, Lisbon, Portugal, pp. 362–367. Fu, M., Huang, Y., 2010. A survey of traffic sign recognition. In: Proc. International Conference on Wavelet Analysis and Pattern Recognition, 11–14 July 2010, IEEE, Qingdao, China, pp. 119–124. Gao, X., Podladchikova, L., Shaposhnikov, D., Hong, K., Shevtsova, N., 2006. Recognition of traffic signs based on their colour and shape features extracted using human vision models. Journal of Visual Communication and Image Representation 17 (4), 675–685. Gao, X., Hong, K., Passmore, P., Podladchikova, L., Shaposhnikov, D., 2008. Colour vision model-based approach for segmentation of traffic signs. EURASIP Journal on Image and Video Processing, 7,
(accessed 25.10.12). Garcia-Garrido, M.A., Sotelo, M.A., Martm-Gorostiza, E., 2006. Fast traffic sign detection and recognition under changing lighting conditions. In: Proc. Conference on Intelligent Transportation Systems, 17–20 September 2006, IEEE, Toronto, Canada, pp. 811–816. Gil-Jimenez, P., Lafuente-Arroyo, S., Gomez-Moreno, H., Lopez-Ferreras, F., Maldonado-Baseon, S., 2005. Traffic sign shape classification evaluation II: FFT applied to the signature of blobs. In: Proc. Intelligent Vehicles Symposium, 6–8 June 2005, IEEE, Las Vegas, Nevada, USA, pp. 607–612. Habib, A.F., Uebbing, R., Novak, K., 1999. Automatic extraction of road signs from terrestrial color imagery. Photogrammetric Engineering and Remote Sensing 65 (5), 597–601. Hsu, S.-H., Huang, C.-L., 2001. Road sign detection and recognition using matching pursuit method. Image and Vision Computing 19 (3), 119–129. Ibanez, L., Schroeder, W., Ng, L., Cates, J., 2003. The ITK Software Guide. Kitware, Inc., ISBN 1-930934-10-6 (accessed 25.10.12). Ishizuka, Y., Hirai, Y., 2004. Segmentation of road sign symbols using opponentcolor filters. In: Proc. Intelligent Transportation Systems World Congress, 18–21 October 2004, Nagoya, Japan, 8p. (CDROM). Lafuente-Arroyo, S., Gil-Jimenez, P., Maldonado-Bascon, R., Lopez-Ferreras, F., Maldonado-Bascon, S., 2005. Traffic sign shape classification evaluation I: SVM using distance to borders. In: Proc. Intelligent Vehicles Symposium, 6–8 June 2005, IEEE, Las Vegas, Nevada, USA, pp. 557–562. Lafuente-Arroyo, S., Maldonado-Bascon, S., Gil-Jimenez, P., Acevedo-Rodriguez, J., Lopez-Sastre, R.J., 2007. A tracking system for automated inventory of road signs. In: Proc. Intelligent Vehicles Symposium, 13–15 June 2007, IEEE, Istanbul, Turkey, pp. 166–171. Li, Y., Pankanti, S., Guan, W., 2010. Real-time traffic sign detection: an evaluation study. In: Proc. International Conference on Pattern Recognition, 23–26 August 2010, IAPR, Istanbul, Turkey, pp. 3033–3036. Meilland, M., Comport, A.I., Rives, P., 2010. A spherical robot-centered representation for urban navigation. In: Proc. International Conference on Intelligent Robots Systems, 18–22 October 2010, IEEE/RSJ, Taipei, Taiwan, pp. 5196–5201. Meuter, M., Kummert, A., Muller-Schneiders, S., 2008. 3D traffic sign tracking using a particle filter. In: Proc. 11th International Conference on Intelligent Transportation Systems, 12–15 October 2008, IEEE, Beijing, China, pp. 168–173. Ministère de l’Écologie, de l’Énergie, du Développement durable et de l’Amenagement du territoire, 2008. Instruction interministérielle sur la signalisation routière – Version consolidée – Deuxième – Cinquième parties. Paris, France. Paparoditis, N., Papelard, J.P., Cannelle, B., Devaux, A., Soheilian, B., David, N., Houzay, E., 2012. Stereopolis II: a multi-purpose and multi-sensor 3D mobile mapping system for street visualisation and 3D metrology. Revue Française de Photogrammétrie et de Télédétection 200, 69–79. Piccioli, G., Micheli, E.D., Campani, M., 1994. A robust method for road sign detection and recognition. In: Proc. European Conference on Computer Vision, 2–6 May 1994, IEEE, Stockholm, Sweden, pp. 493–500. Piccioli, G., Micheli, E.D., Parodi, P., Campani, M., 1996. Robust method for road sign detection and recognition. Image and Vision Computing 14 (3), 209–223. Priese, L., Lakmann, R., Rehrmann, V., 1995. Ideogram identification in a realtime traffic sign recognition system. In: Proc. Intelligent Vehicles Symposium, 25–26 September 1995, IEEE, Detroit, Michigan, USA, pp. 310–314. Prieto, M.S., Allen, A.R., 2009. Using self-organising maps in the detection and recognition of road signs. Image and Vision Computing 27 (6), 673–683. Quan, L., 1996. Conic reconstruction and correspondence from two views. IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (2), 151–160. Reina, A.V., Sastre, R.J.L., Arroyo, S.L., Jiménez, P.G., 2006. Adaptive traffic road sign panels text extraction. In: Proc. International Conference on Signal Processing, Robotics and Automation, 15–17 February 2006, WSEAS, Madrid, Spain, pp. 295–300. Rosin, P., 2003. Measuring shape: ellipticity, rectangularity, and triangularity. Machine Vision and Applications 14 (3), 172–184. Ruta, A., Li, Y., Liu, X., 2010. Real-time traffic sign recognition from video by classspecific discriminative features. Pattern Recognition 43 (1), 416–430. Soheilian, B., Paparoditis, N., Boldo, D., 2010. 3D road marking reconstruction from street-level calibrated stereo pairs. ISPRS Journal of Photogrammetry and Remote Sensing 65 (4), 347–359. Song, G., Wang, H., 2007. A fast and robust ellipse detection algorithm based on pseudo-random sample consensus. Computer Analysis of Images and Patterns 4673 (0302-9743), 669–676. Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C., 2012. Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks 32, 323–332.
20
B. Soheilian et al. / ISPRS Journal of Photogrammetry and Remote Sensing 77 (2013) 1–20
Timofte, R., Zimmermann, K., Van Gool, L., 2009. Multi-view traffic sign detection, recognition, and 3D localisation. In: Proc. Workshop on Applications of Computer Vision, 7–8 December 2009, IEEE, Snowbird, Utah, USA, pp. 1–8. Vallet, B., Houzay, E., 2011. Fast and accurate visibility computation in urban scenes. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences 38 (Part 3 W22), 77–82. Wang, K.C., Hou, Z., Gong, W., 2010. Automated road sign inventory system based on stereo vision and tracking. Computer-Aided Civil and Infrastructure Engineering 25 (6), 468–477.
Zhang, S., Liu, Z., 2005. A robust, real-time ellipse detector. Pattern Recognition 38 (2), 273–287. Zheng, Y.-J., Ritter, W., Janssen, R., 1994. An adaptive system for traffic sign recognition. In: Proc. Intelligent Vehicles Symposium, 24–26 October 1994, IEEE, Paris, France, pp. 165–170. Zhu, S., Liu, L., 2006. Traffic sign recognition based on color standardization. In: Proc. International Conference on Information Acquisition, 20–23 August 2006, IEEE, Shandong, China, pp. 951–955.