Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks

Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks

ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx Contents lists available at ScienceDirect ISPRS Journal of Photogrammetry and ...

4MB Sizes 0 Downloads 17 Views

ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs

Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks Dimitri Bulatov ⇑, Gisela Häufel, Jochen Meidow, Melanie Pohl, Peter Solbrig, Peter Wernerus Fraunhofer Institute of Optronics, System Technologies and Image Exploitation, Gutleuthausstr. 1, 76275 Ettlingen, Germany

a r t i c l e

i n f o

Article history: Available online xxxx Keywords: Building detection Building reconstruction Pose estimation Simulation Texturing Urban terrain

a b s t r a c t Highly detailed 3D urban terrain models are the base for quick response tasks with indispensable human participation, e.g., disaster management. Thus, it is important to automate and accelerate the process of urban terrain modeling from sensor data such that the resulting 3D model is semantic, compact, recognizable, and easily usable for training and simulation purposes. To provide essential geometric attributes, buildings and trees must be identified among elevated objects in digital surface models. After building ground-plan estimation and roof details analysis, images from oblique airborne imagery are used to cover building faces with up-to-date texture thus achieving a better recognizability of the model. The three steps of the texturing procedure are sensor pose estimation, assessment of polygons projected into the images, and texture synthesis. Free geographic data, providing additional information about streets, forest areas, and other topographic object types, suppress false alarms and enrich the reconstruction results. Ó 2014 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved.

1. Introduction 1.1. Motivation The importance of accurate and realistic modeling of urban terrain has been demonstrated in various applications such as automatic navigation, urban planning, disaster management, and security-related rapid response tasks. More than the others, the latter two applications are time-critical. A precise plan requires up-to-date knowledge about the terrain including its recent and perhaps sudden changes. Though from a certain moment on, field training becomes indispensable for such time-critical application, the advantages of simulation are numerous and evident: In many cases, it is relatively easy to tailor the simulated environment to a concrete situation, to manipulate weather and time-of-day conditions, and to experiment with new, expensive sensors and equipment, which, contrary to the real world, does not need to be prepared, transported, and maintained. Therefore, simulation is continuously gaining popularity as a training tool for decisiontaking within a short time and under enormous pressure. From the point of view of the abstraction of a simulated environment, it can be stated that the smaller the training unit is and the lower in the chain of command it is, the less abstract the simulated ⇑ Corresponding author. Tel.: +49 7243992136. E-mail address: [email protected] (D. Bulatov).

training environment usually has to be. This allows to train small unit leaders under realistic conditions like stress and unclear situations. Commercial simulators give a good example: Strategy games often make use of more abstract 2D maps whereas flight simulators show detailed and realistic virtual 3D environments the aircraft crew can interact with. Our system of choice is Virtual Battlespace 2 (VBS2).1 Although chiefly a training simulator for military applications, VBS2 can also be used to demonstrate the characteristics of automatically instantiated 3D models and provides a convenient access to our results, including for evaluation purposes (see Fig. 1). As reported in Bulatov et al. (2012a), VBS2 offers relatively user-friendly and clearly structured tools, which allow importing not only imagery and free geographic data (shapefiles) but also reconstruction results such as elevation maps and building models. However, time-critical applications and their simulation also put certain requirements on the 3D models and the underlying algorithms. First of all, the context information is essential for a plausible interaction with the environment even though models may become less accurate than 2D manifolds in space without context information. Expert users who want to find their way in an unknown urban terrain will rate the simulation the more convincing the more information of the scene this simulation includes; like

1 Virtual Battle Space 2 by Bohemia Interactive, http://products.bisimulations.com/ products/vbs2, May. 2013.

http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016 0924-2716/Ó 2014 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved.

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

2

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

small details of buildings, texture of walls and roofs, and information about vegetation and streets. For example, in order to make vehicles move faster on streets than in other areas of the terrain, or to make some trees lose their leaves in winter while others do not, it is necessary to identify and to model the relevant topographic objects types ‘‘streets’’ and ‘‘vegetation’’. The same argument holds for buildings: To customize collision geometry and damage models, it is important to identify buildings and their parts. Context information is important both within a simulation process and for faster updating the terrain if some changes have happened. For example, if after a certain building measure, a wall has been removed, a semantic model can be updated by deleting a couple of lines in a text document whereas a model without semantics would probably require a mesh-processing software. Closely related to the previous point is the level of compression of a model. Effort must be made to generalize structures and to increase the level of abstraction. For example, in many applications, it is sufficient to describe a building wall by a single polygon. A tree model at the approximate position of the trunk is more efficient since it can be instantiated over and over again. It is also more visually appealing than fitting a shape around the data points describing this tree, which leads to a ‘‘tree-mushroom’’ with many small textured polygons. Finally, for time-critical applications, urban terrain models must be instantiated and integrated into the simulation environment within a reasonable time. This motivated us to develop a rather modular procedure and avoid, for instance, graph-based methods that are computationally expensive and run danger to end up in a local minimum. Also, evolution of available data has to be considered. At the beginning of an operation, often only availability of satellite/aerial images or airborne laser scans may be assumed, which leads to the assumption that the scene has a rather 2.5D structure and that requirements on sensor data must be kept to a minimum. As the operation advances, more detailed data of higher detail taken from lower altitude is acquired. Thus, as soon as an additional source of information becomes available, the first thing we must be able to do is to enrich the available model by this information – without need to recompute the model – and to make the simulated world more realistic and suitable for training and planning purposes. An inspiring publication about how to increase the level of detail of 3D models by sensor data fusion has been written by Haala (2005). In our paper, oblique UAV images are used to cover building walls with real textures aiming for better recognizability of the scene. The second possibility, not followed in this paper, is to update and to improve the geometry of the reconstruction. For example, it was shown in (Bulatov et al., 2012b) how streets imported from free geographic data can be used to filter out spurious hypotheses for buildings caused by moving objects. 1.2. Previous work In the recent years, the advantages of 3D reconstruction from optical data, such as aerial images, including and especially images and videos captured by unmanned aerial vehicles, have become evident. Besides relatively low cost and easier availability of optical images compared to data collected by traditional active sensors, there is also rapid progress in the quality of imaging sensors and sensor platforms. In addition, there is a broad spectrum of algorithms that can process optical data, even if it is noisy, contains many outliers (Bulatov and Lavery, 2010), or covers a large urban area at a high resolution (Hirschmüller, 2008). There is a large number of excellent algorithms that do not presuppose a dense point cloud as input, but rely on edge matching in images (Baillard and Zisserman, 2000), color-segmentation (Henricsson et al., 1996), or hierarchical assessment (Fischer et al., 1998). However, these methods are less suitable for building detection in large

Fig. 1. A view of a scene rendered by the VBS2 simulation environment.

scenes than for model instantiation of a few single buildings. Most state-of-the-art approaches for context-based modeling of buildings therefore rely on dense point clouds, see (Elberink and Vosselman, 2009; Sohn et al., 2012; Huang et al., 2013; Lafarge et al., 2013; Lafarge and Mallet, 2012; Brenner, 2000; Gross et al., 2005; Rottensteiner, 2006), preferably resampled into an elevation map. This is the main input of a typical 2.5D approach: The vertical direction is the dominant one and elevation is interpreted as a function of longitude and latitude. If optical images are given, such point clouds and elevations map can be computed from depth maps obtained from multi-view configurations (Bulatov et al., 2011a; Haala and Rothermel, 2012). The authors of the related contributions mostly agree that, because of a huge variety of roof types, the roof detail analysis is the most challenging and complicated step. Their approaches can be classified as model-driven and data-driven approaches. The model-driven approaches make use libraries of models into which the input data should be fit. The key idea of the proceedings of (Elberink and Vosselman, 2009; Verma et al., 2006) and also of (Brenner, 2000) is matching of noisy graphs. The vertices of the graph are roof segments and the weights of the edges represent relations between neighboring roofs (step edges, dormer edges, concave edges, orthogonal edges, etc.). While Verma et al. (2006) emphasize subdivision of complex buildings and introduce symmetry relations as soft constraints for fitting the data to a model, one important contribution of Elberink and Vosselman (2009) is handling uncovered details. The number of uncovered details depends on the input data, the result of the graph matching algorithm, and on the variety of objects in the library. For example, if only planar roofs are contained in the library, modeling of non-planar roofs is impossible. Brenner (2000) reports short-comings for the class of buildings with flat roofs. Other difficulties of modeldriven approaches are initialization (Huang et al., 2013) and a requirement for additional input, such as accurate building footprints (Brenner, 2000). In contrast, the data-driven approaches exploit data in combination with reasonable priors in order to extract certain model properties. For man-made objects such as buildings, examples of these priors are parallelism, orthogonality, and symmetry. Rottensteiner (2006) introduces after dominant planes computation soft constraints for regularizing building shapes while Sohn et al. (2012) evaluate a cost function that, among other things, takes into account the model complexity for several hypotheses for the roof structure. Soft constraints often cannot completely correct the inconsistencies while hard constraints can lead to linear dependence and, due to data noise, even some self-contradicting conditions. Also, it is not always clear how to choose weights for these constraints.

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

The approaches discussed above are suitable for reconstructing several single, albeit complicated buildings. However, what we need for quick-response applications is a simple procedure that goes all the way from raw data to the urban terrain model. As mentioned above, this model should not be too complicated and – in contrast to the previously mentioned approaches – should not only contain buildings but also other kinds of objects, like trees and ground. Two such procedures are given in Lafarge and Mallet (2012) and Gross et al. (2005). Lafarge and Mallet (2012) work with very dense and accurate 3D point clouds and make twofold use of graph cut techniques on Markov Random Fields: First, to classify terrain into ground, buildings, vegetation and outliers (clutter) and second, to group the points into dominant roof primitives, which are not necessarily planes but can also be spherical and conical surfaces. The results depend a lot on the accuracy of points, because in point clouds obtained by photogrammetric methods, reliable computation of the normal vector is hardly possible in small 5  5 neighborhoods unless the terrain is considerably smoothed. Another characteristic of the approach of Lafarge and Mallet (2012) is that they do not consider superposition of roof segments. That is, neighboring roof segments ‘‘do not know’’ about each other and can overlap. In contrast, Gross et al. (2005) consider pairs of roofs and align the roof polygons with the intersection lines of segments corresponding to pairs of roof planes. Flat roofs are then handled as different buildings since jumps of heights are used to subdivide buildings. However, this kind of subdivision may fail in data sets with many outliers. Both Lafarge and Mallet (2012) and Gross et al. (2005) use the same sequence of steps for building detection and, after the roof detail reconstruction step, buildings, trees, and ground can be modeled in an appropriate way. 1.3. Contributions The main contribution of this paper is a bottom-up algorithm that automatically processes large amounts of real data recorded within a restricted time interval – for example, during UAV missions – with little consideration of using the data to produce realistic 3D models of urban terrain. The steps of the algorithm were implemented in such a way that the short-comings of one step are taken care of by the following step. The resulting models contain only a few large polygons with a small number of vertices and can be exported into an interactive simulation environment. Additional information, e.g., free geographic data, can enrich the level of detail of the simulation. In this paper, we describe improvements of the steps over those in our first algorithm (Gross et al., 2005):  The computation of the Digital Terrain Model (DTM) has been improved for data sets with many outliers, see Section 2.1.  The vegetation areas are further classified into subclasses grass, isolated trees, and larger forest regions. These subclasses are processed and modeled in an appropriate way as described in Section 2.2.  In addition to subdivision (labeling) of buildings along steplines, a new method based on Random Walks has been used for subdivision of complex buildings that are connected by narrow structures, see Section 2.3.  In the computation of the building boundaries in Section 3.1, an assessment of rectangularity is carried out. Furthermore, contours are improved by considering segmentation results.  For the dominant plane estimation in Section 3.2, a global procedure based on J-Linkage has turned out to be a powerful tool that is superior to the widely used RANSAC-based procedure and to local methods.  The most important new contribution is a texturing procedure that is simple, fast, and fully automatic as long as certain assumptions hold. Building polygons receive realistic and

3

up-to-date textures derived from video material. The three steps of this process, i.e., pose estimation, assessment of images, and texture synthesis, are explained in Section 4. With respect to the texturing step, it is important to emphasize that both the geometry and visual appearance of the facades can be improved in certain configurations by means of oblique images and videos gathered by minituarized UAVs (Meixner et al., 2011). However, since, due to narrow baseline, high depth range, and abundance of slanted surfaces and occlusions, it is risky to rely on this data while performing dense reconstruction, these videos will be used in our work only for texturing. 1.4. Organization The paper is structured as follows: In Sections 2 and 3, we describe our procedures for detection of buildings in large scenes and the reconstruction of a single building, respectively. One main focus is on the texturing step that presupposes integration of images and video frames obtained from the ground and from UAVs into the model. A separate Section 4 is dedicated to the texturing procedure. Section 5 presents quantitative and qualitative results for the reconstruction procedure while Section 6 summarizes our work and outlines ideas for future research.

2. Data preparation and building detection The input for our building detection procedure is an elevation map, or, if geo-referenced, a Digital Surface Model (DSM) that either results from an Airborne Laser Scan (ALS) or is sampled from a dense point cloud resulting from one or several depth maps. If the data is not geo-referenced, interactive transformation into a metric coordinate system must be performed by rescaling and rotating the point cloud forcing the z-axis to coincide with the physical vertical direction. In order to identify buildings, the difference between the DSM and the DTM must be computed. We denote this difference by nDSM (normalized DSM) and assume that the larger elevated regions of nDSM correspond either to buildings or vegetation. As shown in Fig. 2, left, the process of building detection thus consists of three main steps: DSM and DTM computation (Section 2.1), discarding vegetation (Section 2.2), and filtering out outliers followed by labeling building regions (Section 2.3). The process of modeling vegetation is visualized in Fig. 2, right, and outlined in Section 2.2. 2.1. Computation of DSM and DTM As noted above, the 3D point cloud is rasterized into a 2D-grid D by choosing a unique elevation value for each cell. Hence, the elevation map can be treated as an image and its cells as pixels. Pixels corresponding to walls will contain several points, so for 2.5D applications, an appropriate elevation value, e.g., the median, should be chosen. Pixels that lie in occluded regions are not assigned any elevation values. To fill these pixels with meaningful values, Gross et al. (2005) propose solving a modified Neumann partial differential equation (PDE) by state-of-the-art algorithms for sparse matrices. The modification of the PDE over conventional procedures concerns a border condition: Boundary pixels with undefined values are permitted. In case of availability of a highresolution almost-nadir image, as for the data set Vaihingen used in Section 5, the elevation computed from the corresponding depth map can already be considered as a surrogate for a DSM, since it has an analogous structure, namely, rows and columns in pixels and elevation in meters. The advantage is that there is no need for additional work to solve a PDE, which is rather unstable in

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

4

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

presence of outliers, and to resample the image(s). However, the results of building outlining are slightly less accurate. If the data is contaminated by outliers, the elevation of a pixel identified as an outlier is replaced by the median of the elevations of neighboring pixels. Afterwards, the elevation map is smoothed by a Gaussian or median filter. While some small details usually disappear in this process, it allows to obtain building outlines in a more reliable manner. In order to calculate the DTM, we first have to identify points that are part of the ground. Around each pixel of D, a square filter with a fixed side length w is considered. The lowest point in the box is marked as a ground point. The filter size should have an order of magnitude that corresponds to the size of the largest elevated region, for example, the largest building. If w is too small, points of a building can be spuriously included into the list of ground points, and if w is too large, smaller elevated regions of the DTM may get lost. Since this approach may have problems in densely built-up regions and also when there is a high percentage of outliers in the elevation map, the functional that approximates the ground points should be a robust one. We chose the 2.5D cubic spline surface computed by minimization of a functional as in (Bulatov and Lavery, 2010) in L1 -norm, because of the robustness of this norm against heavy-tailed, non-Gaussian noise and large percentages of outliers. The functional considers the ground points within a data-term and the smoothness of the underlying surface. Its minimization is carried out by the primal-affine method formulated in (Lavery, 2004). In Fig. 3, we show the improvements of our current approach over the original approach of Gross et al. (2005) for the data-set Bonnland captured during a UAV flight. The outliers in the DSM and the large forest region in the bottom left part of the image are robustly handled by our method.

2.2. Removing and modeling vegetation While there exist methods that allow separation of vegetation from buildings in high-quality point clouds (Lafarge and Mallet, 2012), in noisy point clouds, one should make use of additional information for the DSM. For example, if the near infrared channel is available, the Normalized Difference Vegetation Index (NDVI) is equivalent to the quotient of different channels: rnir =ðmaxðg; bÞÞ > a, where r nir ; g, and b are the near infrared, green, and blue channels, respectively, and a > 1 is a scalar. This measure works well in the majority of cases. However, if only RGB images

are available, one should be able to extract several isolated trees and use the RGB values of pixels corresponding to these isolated trees as training data for propagation of their color information over the whole domain D. To identify isolated trees, it has been proposed by Bulatov et al. (2011b) to assign to each elevated region R the so-called lineness measure:

lmðRÞ ¼

X 2 ðlengthðlÞÞ =areaðRÞ;

ð1Þ

l2R

where l 2 R is the set containing lines lying entirely in a morphological dilation of R. Straight line segments in the orthophoto are determined by the method of Burns et al. (1986) and are projected into the DSM. Typically, regions with a low value of lineness measure correspond to isolated groups of trees. The shortcomings of this approach consist of dependence of the results on seasonal appearance of trees and the potentially incorrect identification of buildings not having many straight lines as trees. In order to obtain more stable results and model larger tree regions, an alternative is to extract large forest areas from geographic data.2 We extract shapefiles3 for larger forest regions and integrate them into the reconstruction and a simulation environment4 of Section 1. We conclude this section with the modeling aspect of vegetation. For isolated trees, an approximate position of the trunk as well as overall height and diameter can be extracted from the NDVI-image, elevation map, and the DTM. To identify isolated trees, we start with a height- and NDVI-value-thresholded binary image V in which the components are filtered by their area (between 0.5 m2 and 10 m2) and eccentricity (below 0.8), so that they match a tree. All other components of V are either too small (noise) or are taken for a configuration of several trees. In the latter case, the position of a tree is determined in the highest point of the component. Then, pixels in a circular region of constant diameter around this point are left out of consideration. The procedure is then applied on the remaining pixels of the component. The procedure terminates if there are no pixels of interest in the component left. Once position, altitude, and diameter of a tree have been extracted, a generic model of a tree is placed into the reconstruction result. Its visual aspect can be adjusted according to the seasonal appearance (Bulatov et al., 2011b). For those trees that stem from shapefiles, approximate tree positions, heights, and types are provided by random numbers that depend on the probabilities of tree types and the density of afforestation. Analogously, all non-elevated regions with a high NDVI-value are modeled as grass. 2.3. Filtering and labeling of building hypotheses According to our assumption, larger elevated regions of the difference between the DSM and the DTM correspond either to buildings or to vegetation. After the suppression of vegetation, filtering of the remaining regions based on size, height, and eccentricity is necessary to eliminate false alarms caused by outliers. Another important issue with respect to building labeling is the subdivision of large and complex buildings. Though it is also possible to perform subdivision after the building ground plan is estimated, there are several reasons to perform this step at the raster level: First, smaller number of points while considering single buildings; second, better handling topological inconsistencies, and, finally, faster 2

available under http://download.geofabrik.de/europe/germany/. ESRI Shapefile Technical description, an ESRI White Paper-July 1998, http:// www.esri.com/library/whitepapers/pdfs/shapefile.pdf. 4 In the analogous way, shapefiles for streets allow mitigating the efforts for larger cars that can be confused with buildings (Bulatov et al., 2012b). To export streets into simulation environments, parameters road width, its slope and its type must be specified. 3

Fig. 2. Overview of the procedure for building detection in large scenes. The option of using shapefiles from free geographic data to extract large areas of vegetation is not included in the figure to ensure an enhanced readability.

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

5

Fig. 3. Comparison of two procedures for DTM extraction from noisy elevation maps (on the left). Middle: PDE-solution computed from the ground points as was proposed in Gross et al., 2005. Right: L1 -spline-based surface computed from the ground points.

convergence at the stage of building contour computation. Another important reason is explained in Section 3.2: Simplification of topology allows avoiding spurious hypotheses for roof plane extraction. However, in Section 3.2, we outline a fast, recursive algorithm of subdivision of a building along its diagonal once its ground-plan polygon has been estimated. There are two ways to perform subdivision of buildings. The first way, implemented in Gross et al. (2005), is to compute a planarity measure (Gross and Thönnessen, 2006) from smoothed gradient images and to divide regions along jumps of heights. The advantage of this method is that buildings with many flat roofs will ideally all be modeled separately and, as a consequence, the risk of over-segmentation due to parallelism is low while clustering normal vectors for extraction of dominant planes. But there are some disadvantages: The threshold parameter for jumps of heights Dh cannot be chosen too low for data with many outliers or fine details since these would then be filtered out. But if Dh is too large, a building with many flat roofs is in danger of being modeled as a building with one shed roof. ‘‘Disruptive’’ structures on a roof, such as a spiral stair case, could additionally degrade the results. We therefore reserve this approach for a high value of Dh  2 m. However, on the one hand, we take additional care for planes with identical normal vector during roof detail analysis step. On the other hand, we apply an additional subdivision module on buildings with complex structures that are connected by narrow structures. The subdivision of a building is performed by strong morphological erosion followed by a new labeling of the terrain that yields, in general, more regions. Border pixels are lost during this procedure. We strive to label each lost pixel with one new label in a reasonable way. There are two intuitive ways to do so: To assign the label of the closest component to each pixel and to dilate the resulting image until all missing pixels assigned to one component. Unfortunately, this often leads to wrong, ambiguous labels. The reason is that geometric neighborhoods do not always coincide with topological neighborhoods. To take also topological properties into account, we focus on other methods. Fixing the center of gravity of every evolved label, clustering with one of the existing approaches (Hartigan and Wong, 1979; Martinetz and Schulten, 1991) is carried out for all missing pixels. However, since these methods do not consider the elevation of points to be assigned, the above mentioned problem cannot be completely solved. A method that considers the elevations is the Random Walks algorithm introduced in (Grady, 2006). As an additional advantage, it is also very robust against noise in elevation maps and performs well in the case of noisy boundaries of regions. Among all considered methods, the Random Walks method achieved the best

performance for topology preservation, although at cost of increased computing time. As depicted in Fig. 4, the height-thresholded nDSM (without vegetation) is subject to erosion. We define at least one pixel pk of every eroded building part as a seed k. An additional pixel p0 in the background gets label 0. The algorithm computes the probability of every unassigned pixel p to belong to each label. For the computation of the probability, the image is regarded as a graph. The vertices of the graph are pixels and the edges are defined by the 4-neighborhood. A random walk, which can be modeled by solving a Dirichlet problem, starts in p and runs along the weighted edges of the graph, e.g., weighted gradient norms of the DSM, before it ends in a seeded pixel pk . The probability of a path is high if the weights of its edges are small and vice versa. For every k, the probability of such a path is computed and the label with the highest probability is chosen. 3. Building reconstruction The main result of the previous section includes the labeled image where the label of a pixel corresponds to the ground plan of the corresponding building. The two steps of the algorithm for building geometry reconstruction are building outlining and roof detail analysis. These steps will be described in the following two subsections; we apply them for every labeled segment. The only input required for this section is the elevation map. However, using an orthophoto and a segmentation algorithm can further improve the results. 3.1. Building outlining The building outlining step consists of four substeps, of which the first and second are optional. At the beginning, results of a segmentation algorithm due to Wassenberg et al. (2009) can be used in order to enhance the height-thresholded building mask. All pixels of a segment will either belong to a building or not, depending on whether the majority of pixels does so. For a pixel p:

8 if p 2 S and #ðS \ BÞ > c1 #S > <1 b BðpÞ ¼ 0 if p 2 S and #ðS \ BÞ < ð1  c2 Þ#S > : BðpÞ otherwise;

ð2Þ

where B is true if the pixel belongs to the building and false otherwise, S is the segment label and c1=2  0:8 is a pair of scalars. The effects of using segmentation results are shown in Fig. 5, top. In the image on the right, those pixels, which belong to the class ‘‘building’’ after taking into account segments, are shown in yellow

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

6

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

Erosion and Seeding

Threshold

0 1

2

3

Binary Mask

Seed Points / Labels

Input: RGB Image & Depth Map

Random Walk on Depth Map

Overlay: Orthophoto & Segmentation Result Fig. 4. A schematic representation of the Random Walk algorithm for subdivision of complex buildings. The colors of the segments in the bottom image correspond to those of the seeded points on the right.

b (that is, BðpÞ ¼ 1 and BðpÞ ¼ 0), and those, which fall out of this b class, are shown in cyan (that is, BðpÞ ¼ 0 and BðpÞ ¼ 1). By the blue b and red segments, we denote pixels for which BðpÞ ¼ BðpÞ ¼ 0 and b BðpÞ ¼ BðpÞ ¼ 1, respectively. In the next step, the dominant direction of clusters of orthogonal lines is extracted. Straight lines (Burns et al., 1986) are obtained in the slightly smoothed DSM (see Fig. 5, bottom left) and their slopes are stored in a histogram. The entries of the histogram modulo 90° are weighted by the lengths of the line segments. An assessment of orthogonality is performed for each building. If the histogram has a strong peak, resulting from the ratio best to second-best, the dominant direction corresponds to this peak. Otherwise, there is no dominant direction. The histogram can also be supported by straight lines detected in the low-pass-filtered orthophoto, thus allowing determination of the correct building orientation even if the elevation map is very noisy. However, since many straight lines on building roofs in the optical images may have orientation different from that of the building, their lengths should be weighted by a factor smaller than one. The goal of the third step is to obtain the ground-plan polygon from the binary mask at the highest resolution. The starting value for a ground-plan polygon is given by the minimal bounding rectangle of the binary mask encoding the building. In the case of orthogonal polygons, the axes of the rectangle are given by the dominant directions of the building orientation. Using the binary mask, the contours are refined for each blob by recursively adding and removing rectangular subparts until the area of the remaining blob lies under a threshold. The threshold, similar to the other thresholds for the building reconstruction, reflects the maximal deviation of areas that one is ready to accept and depends on resolution, quality of the data, etc. For non-orthogonal polygons, pixel-wise polygonization is performed followed by exploitation of line constraints and orthogonality constraints in order to reduce the number and improve the positions of vertices. The result is shown by two red polygons in Fig. 5, bottom left. For orthogonal polygons, the final step of generalization allows reducing the number of vertices by neglecting parts that are too small and too narrow (see the orange polygon in Fig. 5, bottom left). A visual example that shows the necessity of both modules (polygonization and generalization) for our procedure is given by a T-shaped building binary mask where the horizontal cross-piece is long and narrow. During the recursive refinement, both side rectangles underneath the cross-piece are removed while during the

generalization step, both parts of the horizontal cross-piece are deleted and the building’s minimal bounding box is adjusted according to its current extent. 3.2. Roof detail analysis The roof detail analysis has two main tasks: Computation of dominant planes and subdividing the ground-plan polygon into several roof polygons with corresponding plane equations. As the main result, we will obtain 2D polygons embedded in the 3D space. For a better handling the number of points and that of the dominant planes, the point cloud can be reduced by subdividing building complexes along their diagonals (see, for example, Haala et al. (1998), Bulatov et al. (2012b)) and detecting small roof segments, such as chimneys, (Meixner et al., 2011; Bulatov and Pohl, 2013). Among all of the diagonals of the building ground polygon, we select the shortest one that lies completely inside the building and for which the area of each subpart exceeds a threshold. This threshold is about 5000 pixels for the considered data-set while the length of the diagonal should not exceed 50 pixels. The procedure iterates several times until there are no sufficiently short diagonals remaining. Chimneys are extracted as hot-spots in the rescaled elevation map subject to several filters (Bulatov and Pohl, 2013). They are modeled as vertical cuboids of a height given by the DSM and square cross-sectional plane of side length given by the area of the hot-spot (in pixels) and the ground sampling distance (GSD) of the DSM. These points are not considered in the subsequent step of dominant planes extraction. Here, we differentiate between local and global methods. The local methods, like clustering gradient values, originally proposed by Gross et al. (2005), work well for DSMs computed from ALS point clouds with few outliers. In case of DSMs stemming from depth maps in images, elevation values often suffer from discretization artifacts that result in a step structure of 3D points and hence, systematic errors for the normal vectors. Bulatov et al. (2012b) propose using global methods applied on the whole point cloud of the building, or at least to a part of the building. The subdivision of the building into parts, mentioned in Section 2.3 and the previous paragraph, is done to avoid ‘‘ghost planes’’, e.g., those that have few inliers in several parts of the building. Examples of global methods are RANSAC (Fischler and Bolles, 1981) or the more recently introduced procedure called J-Linkage (Toldo and Fusiello, 2008). There are two basic differences in the methodology of RANSAC and J-Linkage. First, the choice of triples of points which is needed to form a hypothesis for a plane is not completely random in J-Linkage, since adjacent points are preferred to be included in the same hypothesis: The first point is chosen using the uniform distribution while the second and the third point are chosen using the normal distribution with an empirical parameter r. In addition, while RANSAC, after forming a new hypothesis, abolishes the old one if it had fewer inliers and thus performs iteratively for several dominant planes, J-Linkage strives to compute all dominant planes at once. A fixed number of hypotheses is stored a the points are clustered according to their code-vectors and to the so-called Jaccard Distance between clusters. The larger clusters are kept and plane parameters in these clusters together with the inlier set are iteratively recomputed and refined by a Least Squares optimization. The regions are labeled according to the dominant planes they belong to and preprocessed by morphological operations. Then, they are filtered by area and eccentricity, and sorted by areas (within one building). These steps compensate for noise and outliers in the elevation data that leave holes in the inlier sets. Preselection of candidates for the pairs of intersecting planes takes place by morphological dilation. For a pair of neighboring segments with plane equations p1 ; p2 , the projected line l of their cut-line into an image J by the projection matrix P is

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

7

Fig. 5. Top: Improving building mask by segmentation. From left to right: Part of the elevation map for the data set Vaihingen area 3, the corresponding part of the reference image, the segmentation result, and building masks, which should be post-processed by morphological operations. Bottom: Assessment of the histogram of straight lines for improving building outlines (without step-lines). On the left: Part of the smoothed elevation map with Burns lines (green) and generalized building outlines (orange and red for orthogonal and non-orthogonal outlines, respectively). On the right: Corresponding part of one of the input images.

l ¼ ðPv 1 Þ  ðPv 2 Þ;

ð3Þ

where v i denotes the ith column of the orthogonal complement to the vector subspace spanned by p1 and p2 . Here, all variables are given in homogeneous coordinates. In case J is D (the DSM), matrix P has a canonical form: The third column is filled by zeros and the other three columns form the identity matrix. By comparing 2D distances from inliers of both planes to l, it is possible to assess whether the intersection line corresponds to a building edge. If it does, the search range for the positions of the end points of the line is narrowed according to the extension of the inlier set of both planes. The current implementation allows at most two end points of the straight line segment. They correspond to the maximal and minimal extension of the inlier set within a user-specified distance to the line. A similar procedure for computation of end points is carried out for to determine intersections of inlier sets of dominant planes with building outlines. As soon as one end point lies near an already computed line, it can be slightly extended so that the roof becomes topologically consistent. The lines for every dominant roof plane are stored and, finally, completed to polygons. So far we have a subdivision of a building into roof polygons. By subtracting them from the ground plan, we obtain uncovered building polygons. Mostly, these are regions partly occluded by trees and their shadows, with rather unreliable elevation values. Nevertheless, they must be filled in a reasonable way. In (Bulatov et al., 2012b), four possibilities were outlined: Single plane fitting ðf Þ, merging with an adjacent polygon ðmÞ, plane synthesis from edges of all adjacent polygons ðsÞ, and ignoring ðiÞ, that is, assigning to the building exterior. The evaluation step ðeÞ compares the ranges zmin ðPÞ; zmax ðPÞ of elevations of the vertices of the uncovered polygon P with those zmin ðBÞ; zmax ðBÞ of the remaining vertices of the building B. By definition, the hypothesis passes the evaluation test if

zmin ðPÞ > zmin ðBÞ  e and zmax ðPÞ > zmax ðBÞ þ e;

ð4Þ

where e > 0 is a small scalar. By morphological dilation, we estimate the number k of neighboring planes near the uncovered polygon. There are three different configurations: For an isolated polygon (k ¼ 0), we consider the strategy ðf ; e; iÞ, i.e., try single plane fitting, if evaluation test not passed, ignore. In the case k ¼ 1, the strategy is ðm; e; f ; e; iÞ. Otherwise, when k P 2, the strategy is ðs; e; m; e; f ; e; iÞ. After simplifying and rescaling (geo-referencing) the roof polygons, we create wall polygons. For each edge of every roof segment, the corresponding wall is modeled by a vertical trapezoid; one of its arms connects two neighboring vertices of the roof polygon and the other one connects the corresponding points on the ground with altitudes given by the DTM. 4. Pose estimation and texturing For realistic building visualization and representation, we want to assign textures to the faces of the polyhedral model instances. Often, image information has to be extracted from several images, i.e., from an unordered set of images or a video stream. For proper texture synthesis we need camera orientations and a visibility analysis because of occlusions that may occur. The relative orientations of the images and their registration in the model coordinate system (COSY) can be obtained by image registration techniques described in Section 4.1 and visualized in Fig. 6. Given one or several images, we wish to assign realistic texture to every polygon that is at least partly visible in one of these images. To do this, we need an assessment which polygon should be textured by which images. The key algorithm of this step (see Section 4.2) is the occlusion analysis that extracts the foreground for each pixel of each input image. Texturing of building polygons that are often extended over several images is described in Section 4.3. 4.1. Pose estimation For the orientation of a single image, a fully automatic, imagebased registration procedure based on the appearance of a dense

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

8

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

3D point cloud enriched by radiometric properties was implemented (Bodensteiner and Arens, 2012). Preferably, this point cloud should be computed from terrestrial and airborne laser scanning providing reflectance values. The main idea of the method is to render a synthetic image from the dense point cloud and the initialization of a camera pose. In this synthetic image, key points are detected and matched to key points in the input image J. A stateof-the-art method for establishing point correspondences in images rather different in their radiometry and geometry is described in (Shechtman and Irani, 2007) where clusters of characteristic points are matched. Alternatively, especially if the configuration of the input data does not allow extracting a sufficient number of key points in the synthetic image, the whole content of both images can be compared by a cost function that is robust against outliers and against high-frequent radiometric oscillation. One example of such a function is the mutual information function used in (Hirschmüller, 2008). Since a 3D coordinate is available for every pixel of a synthetic image, the pose can be estimated by means of the PnP algorithm (Lepetit et al., 2009) followed by RANSAC (Fischler and Bolles, 1981). After this, a new synthetic image is rendered and the procedure is repeated until there is no significant change in the parameter values or the number of iterations reaches a fixed threshold. In the previous paragraph, we made two important assumptions: Both a dense 3D point cloud with reflectance values and a reliable initialization of the camera pose must be available. If this is not the case, the pose can be obtained by a spatial resection whereby the correspondences in the model and in the image are established by a human operator. Because of the calibrated camera, three points are sufficient, but in order to obtain a more reliable result for the camera pose, 6–7 points well-distributed across the input image should be selected in this image and in the 3D model. The pose in the world COSY is estimated by (Lepetit et al., 2009) and refined by means of a Least Squares algorithm. Given a sequence of images – captured by a calibrated single camera mounted on an UAV, for instance – we can process this stripe by means of a simultaneous localization and mapping (SLAM) algorithm (see e.g., (Beder and Steffen, 2008)). This yields camera poses with an unknown global scale in a local COSY. Fixing

the first image in the world COSY by one of the techniques described above allows obtaining the relative rotation and the relative translation between the COSYs. The unknown scale is computed by estimating the most frequent quotient of depth values of characteristic 3D points in the image and in the model. The estimated motion of the camera within this sequence is subject to an inevitable drift caused by measurement noise and remaining systematic errors. Therefore we also determine the orientation of the last image by spatial resection (see Fig. 6). Then, the drift can be compensated by enforcing that the first and last orientations to fit their independent estimates by either using the trajectory bending proposed by Dubbelman et al. (2010) or by an approach that considers the estimated uncertainties of the poses (Meidow, 2012). 4.2. Assessment of images According to El Hakim et al. (1998), the visual appearance of a textured 3D model is enhanced if a separate texture image JðPÞ is prepared for each polygon P. By an orthogonal projection of P into the xy-plane of the model COSY followed by a suitable metric transformation, we allocate JðPÞ that has a desired resolution. To perform a more rapid occlusion analysis, we need a binary mask MðPÞ of the same size as JðPÞ, which indicates which pixels of JðPÞ correspond to an interior point of P. A fast computation of MðPÞ is carried out after decomposing P into triangles by applying the constrained 2D Delaunay Triangulation on its projected vertices. To transform pixels from JðPÞ into the 3D space and from the 3D space into the input image J, we use formulae of Chapter 13.1 of Hartley and Zisserman (2000), which describe the relations of corresponding points in two images, given a scene plane p, by homographies. In our case, p is the support plane containing the vertices of P. For each polygon, its decomposition into triangles, its binary mask MðPÞ, its support plane p, the homography induced by p and its inverse, as well as other relevant properties are stored for the next steps. The occlusion analysis is employed to determine the foreground for each pixel in each input image. This is done by traditional raycasting techniques. Polygons are sorted by distance dk ðPÞ of their center of gravity to the projection center after which the content of the bounding box of P in J is projected into MðPÞ. Depth values of points for which MðPÞ ¼ 1 holds are compared and the foreground is updated. After projecting all polygons and computing a synthetic depth map, the visibility rate v ðPÞ is calculated for each polygon P. This value has to be considered while assessing by which images P should be textured. Several assessment functions were implemented and tested. For example, given an image J k , the assessment function proposed in (Bulatov and Lavery, 2010) is

C k ¼ ð1  cos bk ðPÞÞdk ðPÞ;

ð5Þ

where bk ðPÞ is the incidence angle. This term is suitable for texturing many small triangles that are completely visible in Jk . However, to adjust (5) in a way which also allows to handle polygons extending over several images, we must extend the equation by a monotonically decreasing function of v k ðPÞ. In addition, we note that the equation above is not applicable if an orthophoto is available because an orthogonal projection corresponds to a camera at infinity. In this case, values bk ðPÞ and dk ðPÞ should be replaced by a resolution function wk ðPÞ, which denotes the number of pixels actually covering the projection of P divided by the number of points for which MðPÞ ¼ 1 and scaled uniformly over k. Clearly, wk ðPÞ depends on bk ðPÞ and dk ðPÞ and for any pair f1 ; f2 of functions monotonically decreasing in ½0; 1, the product Fig. 6. Pose estimation algorithm for single images and videos. By ‘‘⁄’’, we refer to the alternative to compute a transformation from one image only (using depth values of points from the SLAM algorithm).

C k ¼ f1 ðv k ðPÞÞf2 ðwk ðPÞÞ

ð6Þ

can be chosen for assessment.

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

4.3. Texture synthesis The process of texture synthesis extracts the image J k with the maximum value of the assessment function (6) and textures all pixels of JðPÞ, for which MðPÞ ¼ 1 and which are visible in J k , by its content. Then the second-best image is taken and the iterative procedure is repeated for the remaining pixels until there are no untextured pixels left or no images remain for which the assessment measure lies under a user-defined threshold. In the latter case, the remaining pixels can be textured by a user-specified color. Alternatively, the most frequent color can be sampled from a 16-bin histogram and propagated for texturing. For color images, the RGB values for the covered pixels of J are sampled into a 3-dimensional histogram, which can be interpreted as a cube C with 16  16  16 cells, and there are three options for color propagation: The first option to determine the most frequent color is to choose the cell with the largest number of entries in C. Both other options presuppose extracting the most frequent color either from the histogram for all three channels separately, that is, projection onto all three axes of C, or projection onto the diagonal of C followed by separate refinement of color for all three channels. The algorithm for texturing is less sophisticated than the related procedures Böhm (2004), Mayer (2007). The most important reason for this is that the camera registration and the model polygons are not always sufficiently accurate for pixel-wise averaging. It is also possible that differences in the luminance and the reflectivity appear if P is textured by several images: Theoretically, it can happen that one part of P is covered by an infrared image while another part is covered by a daylight image. The best way to avoid this is to modify Eq. (6) by filtering in such a way that it supports texturing P by neighboring frames of the same video. Other reasons to keep the texturing process simple are the computing time and that we are only striving to cover model polygons with a recognizable texture. Finally, polygons not sufficiently observed in any of the input images are either covered by a fixed color with a fixed transparency or are assigned a characteristic texture of the region to be modeled.

5. Results The first data set we present stems from the village Bonnland, a widely used urban training facility in Southern Germany. The input data is a DSM computed from an ALS point cloud of as well as a corresponding digital orthophoto. The 3D building polygons were generated by methods of Gross et al. (2005). The purpose is to demonstrate that an existing model can be updated, enriched by new information, and exported into simulation software. In another measurement campaign five years later, several UAV-videos covering building walls were recorded. A video sequence with around 200 frames was processed by a SLAM method (Beder and Steffen, 2008). Then, texturing of building walls and roofs by the video frames and the orthophoto followed by export of all relevant information to our test simulation environment VBS2 was performed. We show in Fig. 7 the DSM generated from the laser point cloud and the orthophoto in which roof polygons and shapes are depicted. Two exemplary views of the 3D model are shown in Figs. 8 (VRML-viewer) and 1 (simulation environment). All building roofs and walls visible in Fig. 8 that do not bear geo-typical texture were textured automatically while, for Fig. 1, camera poses for few additional ground views were obtained interactively after which a couple of walls in the central square became textured. Despite good recognizability of the scene, Fig. 9 exposes deviations in the automatically textured polygons. There are two reasons for these deviations: Incorrect positions of 3D polygons and pose

9

estimation errors. For the dotted lines, only the first frame of the video was registered by means of the model polygons. One can see that the results of relative orientation for this sequence are quite accurate since the 3D points are projected onto approximately the same image points also in the second frame. The small deviations in the second frame are probably due to drift errors that are almost negligible here; so are the registration errors. The deviations are mostly caused by the insufficiencies of the 3D model. For example, the roof segment marked by the black polygons actually consists of two plane segments. Since the method of Gross et al. (2005) amalgamated it into one plane segment, the vertices are slightly displaced. Another example is the cyan polygon: Since one of its vertices was used for the interactive pose estimation, the left side (dotted) matches almost perfectly in both frames. However, since the corresponding 3D polygon actually lies slightly farther away from the camera, there is a deviation on its right margin. The solid and dashed lines also indicate, though in a different way, that the position of the camera is too close to this polygon. For the dashed line, the inaccurate model was used to correct the SLAM-based trajectory of camera. For the solid line, where camera positions were reconstructed without involvement of the model, statistical errors for pose estimation influence the results; the quality of pose estimation seems (as expected) to be the same in both frames both for the dashed and for the solid lines and it is slightly worse than the quality of the dotted lines, at least for the selected polygons. The second data set (Vaihingen area 3) is an ISPRS WG III/4 benchmark due to Rottensteiner et al. (2012) for urban terrain reconstruction. It represents a purely residential area in the city of Vaihingen (Germany) and contains 56 small, detached houses and many trees. The essential steps of depth-map extraction from multiple images and building reconstruction correspond to the description of Bulatov et al. (2012b); however, an implementation of a method based on graph cuts (Delong et al., 2012) was used to perform a non-local optimization for depth-map computation. Also, additional steps of considering segmentation results and extracting chimneys were integrated. The quantitative evaluation of the results of building detection and reconstruction of this data set was carried out by using the methodology of Rutzinger et al. (2009). This method is based on comparison of two labeled images, one of which corresponds to the detection/reconstruction results and the other to the ground truth, i.e. ground-plan labels and roof-segment labels, respectively. After establishing correspondences and topological clarification, the completeness, the correctness, and the quality of the results are determined both on a per-area level and on a per-object level. In addition, root mean square (RMS) errors of planimetric distances between the points of the extracted building boundaries and their nearest correspondences on the reference were calculated and differences of elevations for reconstructed roof planes were measured. To remove the impact of outliers, only distances smaller than a threshold were used to compute the RMS error. The results of building detection, building reconstruction, and distribution of deviations of altitudes are shown in Fig. 10. The bottom right of the figure shows a detailed (ground) view of the model. Although the RMS errors for building reconstruction (0.68 m in total and 0.52 m for plane correspondences) indicate that our method provides reasonable geometrical accuracies (height error is the same as for single points if the parallax accuracy is about one image pixel), the results in building detection are less precise. They have a magnitude over 1m which is much larger than the ground sample distance. This is partly due to the quite generously selected thresholds for generalization and fill-rate, and partly due to the insufficient accuracy of depth maps in shadowy areas. Also, the results show that the identification of the DSM with a reference image for depth-maps extraction usually leads to deviations of positions of the vertices of 3D polygons.

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

10

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

Fig. 7. Input and intermediate results for the sequence Bonnland. Left: Input DSM. Right: Corresponding part of the orthophoto. The building outlines (cyan transparent overlays) and isolated trees (yellow stars) were modeled from sensor data (results courtesy of Gross et al., 2005) while streets (red lines) and forest areas (closed yellow polygons) were imported from shapefiles.

Fig. 8. A view of the textured 3D model for the data set Bonnland. The camera path is indicated by red pyramids in which corresponding images were placed. The walls that are visible to less than 20% in any input image are given the geo-typical texture of the region.

Finally, the last data set is the area 4 of the Toronto data set which is also an ISPRS benchmark. The input is a DSM with a grid spacing of 25cm that was interpolated from the ALS points corresponding to the last echo of each pulse. The four optical images in which at least a part of the scene is visible were used for texturing only. Two reasons to not include them in the reconstruction pipeline are deficits in the accuracy of estimation of their poses and our wish to perform the reconstruction with minimal input. After the DTM was extracted and the elevated regions were assessed by the lineness measure (1), we discarded tree regions and performed a finer thresholding by height for the remaining elevated regions. The polygonization of the ground plan and reconstruction of roof planes took place according to Section 3. Finally, the input images were used to texture the visible building polygons as explained in Sections 4.2 and 4.3. Two views of the textured model and quantitative results of the reconstruction are shown in Figs. 11 and 12, respectively. As one can see Fig. 12, top left, the detection results are acceptable with respect to both completeness and correctness: Quantitative analysis shows values per area around 97% and 91%, respectively, and per object both close to 100%. With around 1.2 m, the RMS deviations of building outlines have the same order of magnitude as for the Vaihingen data set

Fig. 9. Results for pose estimation for the sequence Bonnland. Top: A frame at the beginning of the video and blended outlines of five exemplary polygons using camera poses from the automatic method that uses airborne and terrestrial laser scan as input (solid line), the SLAM method with interactive registration of two frames (one at the beginning and one at the end of the sequence, dashed line) and interactive selection of points (dotted line). Bottom: A frame in the middle of the sequence. Solid and dashed polygons are as above and dotted polygons are obtained using the camera pose of the SLAM method registered using only one frame.

and with errors mostly due to the shadowy regions where the elevation values are incorrect. The results of roof detail analysis are not as good. Despite the RMS errors of the DSM difference being relatively low compared to building heights, these errors are 2.65 m for corresponding planes and 2.93 m for other planes. Note that the procedure of roof detail analysis was replaced by simple polygonization of the regions obtained by the J-linkage module since the method of Gross

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

11

Fig. 10. Results of the evaluation for the Vaihingen data set. Top left: Ground truth for roof plane segmentation, to be compared to results on roof plane segmentation (bottom left). Top row, middle: Evaluation of the building detection results on a per-pixel base. Yellow, red and blue colors denote true positive, false positive and false negative pixels, respectively. To be compared to building outline polygons projected into the reference image (bottom row middle). Top right: Absolute difference between ground truth and the DSM obtained after building reconstruction algorithm. Bottom right: Textured view of the 3D model (close-up view).

et al. (2005) exhibits problems when there are many horizontal roof planes. Originally, the authors presupposed a partition of buildings by means of erosion around jumps of elevation (such that a building with two horizontal planes is cut into – at least – two parts) followed by the reconstruction of each part. But this approach oversmoothes fine details and is insufficient fails data sets contaminated by outliers. If step-edges are ignored and the normal vectors are clustered, they fall into the same cluster and are modeled as a single plane. Similarly, since no intersection lines between pairs of neighboring segments are detected (since all planes are parallel), the whole building is modeled by a single plane and so the RMS errors for building reconstruction for all planes grow by around 50% while they fall to around 1.6 m for the corresponding planes. However, there are not many corresponding planes when there are too many over-segmentations. Compared to other algorithms submitted for evaluation (Rottensteiner et al., 2012), the results presented in this paper are better than those of Rau and Lin (2011) but are not as good as those of Sohn et al. (2012), probably because modeling step-lines shows enormous advantages in the Toronto data set with many flat roofs. Extraction of step-lines can be performed by the method of Gross and Thönnessen (2006), an approach that we will consider in the future. In addition to the geometric configuration that was not favorable for our method, the following difficulties caused problems during reconstruction of the scene: Since the data set was acquired in winter, there is no characteristic color for the tree crowns. Hence, propagation of characteristic colors for detection of trees

near buildings is hardly possible. Building heights and building size vary within the data sets. While varyation of heights cause problems during detection of chimneys, large complex buildings can be a problem for DTM extraction and exhibit outliers in closed courts and similar areas. Finally, errors in pose estimation hardly allow using images in a straight-forward manner to improve the reconstruction results. Some deviations within the texturing process are visible in Fig. 11, right. In closing this section, we provide information about the computing time. All steps of our procedure were implemented in MATLAB and C/C++ programming languages without major attention toward minimization of computing time. Steps are executed automatically as soon as the necessary parameters have been set. The total computing time without the two most-time consuming steps – dominant plane estimation by J-Linkage and building subdivision by Random Walks – is approximately 15 min. Mainly because of the clustering module, see Section 3.2, J-Linkage requires more than two hours for one data set while building subdivision by Random Walks takes 15 min for 15–20 simple erosions, due to solution of a Dirichlet equation for each unassigned pixel, see Section 2.3. For a very time-critical application, J-Linkage can be replaced by RANSAC, because there is almost no significant difference in performance for simple buildings. Alternatively, parallel processing of buildings can be implemented in the future. Random walks can be replaced by the procedure of step-edge determination of Gross et al. (2005) or by subdivision of polygons. Three other time-consuming steps, namely, DTM-extraction, dominant plane extraction by RANSAC, and occlusion analysis for a video of several

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

12

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

Fig. 11. Two views of the textured 3D model for the Toronto data set. The appearance of trees models corresponds to winter. Building walls and building roofs that are visible to less than 20% in every input image are given constant, neutral colors.

Fig. 12. Results of the evaluation for the Toronto data set. Top left: Ground truth for roof plane segmentation, to be compared to the results of roof plane segmentation (bottom left). Top, right: Evaluation of the building detection results on a per-pixel level. Yellow, red and blue colors denote true positive, false positive and false negative pixels, respectively. Bottom right: Absolute difference between the ground truth and the DSM obtained after building reconstruction.

dozens of frames, take around one minute of time. Computing times for all other steps lie significantly below one minute. 6. Conclusions and future work We presented a robust, modular algorithm for context-based urban terrain modeling from sensor data. Except for the choice of parameter values and integration of additional sources of information, the algorithm is automatic and requires only a 2.5D elevation map as input. The elevation map may stem from an airborne laser

scan or may be computed by a dense matching algorithm. In (Bulatov et al., 2011a), there is a reference image, for each pixel of which a discretized depth value has to be determined. Obtaining dense point clouds from such images and superimposing them on a xygrid delivers a DSM. In our future work, we wish to obtain DSMs simultaneously with the depth map: Instead of parametrizing the distance of a 3D point to the image plane of the reference image, the unknown height of a point above ground is calculated by means of a plane sweep algorithm. This approach will probably be more advantageous if the number of input images is small.

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

The important assumption stating that elevated objects correspond either to buildings or to trees enables detecting and reconstructing buildings as well as extracting positions and heights of isolated trees. However, e.g., in winter, trees near buildings can barely be extracted by propagation of color information. In order to improve the recognition of vegetation and also in order to find a better bound for positioning tree trunks for modeling purposes, we will make efforts to propagate texture properties, for example, entropy, both in 2D and in 3D. The process of building reconstruction consists of boundary extraction and modeling roof polygons. For the most data sets, roofs are piecewise planar and the two important subtasks of the roof detail analysis step, namely, extraction of dominant planes and polygonization of roof segments, were tackled in the procedure described in Section 3.2. The advantage of this procedure is its ability to reconstruct arbitrarily complicated roof structures. However, this procedure has problems if there are many flat roofs and many outliers in the data. In the future, it will be investigated to what extent explicit modeling step-edges can improve the performance. Moreover, we will improve the plane equations by considering man-made constraints, such as orthogonality, symmetry, and horizontal cutlines, or mixed 2D/3D constraints, e.g., that characteristic edges in images often coincide with the cut-lines of roof segments. Finally, in many applications, a user must orient him/herself in an unknown urban terrain. The geometry and the visual appearance of facades are very important for this purpose. In this article, we proposed and described an automatic and a semi-automatic method for pose estimation. The automatic method requires a dense point cloud with reflectances and approximate positions for camera poses, and it is the most challenging method. To make the automatic registration more robust with respect to more general input data and of possibly lower quality, several challenges have to be overcome. For example, point clouds obtained from multi-view matching of almost-nadir aerial images only rarely contain building walls in sufficient resolution and often have outliers. Handling such point clouds requires modification of the method. Our visibility analysis and texturing processes are fast and fully automatic, though less sophisticated than those proposed by other state-of-the-art methods. The reason is that the assumptions of a precise model and accurate registration do not always hold. One could see in Fig. 9 that errors in the initial registration and drift of the camera poses and the model geometry influence each other. As a consequence, it is difficult to clearly identify the cause of incorrect mapping of the model polygons into the images. Also, Böhm (2004) argues that for eliminating static objects, e.g., trees in front of walls, additional views should be taken into account for better background estimation. However, in time-critical applications, this is barely possible. Instead, it is possible to detect a tree in the DSM, even if it is immediately in front of a building wall. Instantiating a tree model in front of the wall is sufficient in some applications, because those parts of the wall texture that are contaminated by this tree, are occluded by it. This is an example of how insufficiencies in one module can be corrected by the following module. Additionally, in order to tackle this problem, experiments with depth maps computed in the input images are currently being carried out. Depth maps not only help to extract foreground objects and to clean up the texture of scene polygons, but also help to improve the positions of walls and the geometry of facades. Increasing the level of detail in reconstruction of facades by extracting planar sub-segments induced by window or door niches, balconies and similar structures will improve the geometry. Together with edge detection or segmentation of facade regions in images, we should be able to obtain the number of floors and positions of

13

doors, windows, and balconies, as well as their outlines. Depending on the application at hand, all of these new objects should be modeled in a suitable way, for example, by integrating them into the simulation software VBS2 described in Bulatov et al. (2012a). Acknowledgment The Vaihingen data set was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF) (Cramer, 2010): http://www.ifp.uni-stuttgart.de/dgpf/DKEPAllg.html The reference for Vaihingen was generated by RAG Steinkohle AG and SIRADEL (www.siradel.com). The Toronto data set was provided by Optech Inc., First Base Solutions Inc. and York University. The reference for Toronto was created by York University. The authors thank the reviewers and John Lavery for reading the manuscript and providing fruitful comments. References Baillard, C., Zisserman, A., 2000. A plane-sweep strategy for the 3D reconstruction of buildings from multiple images. Int. Arch. Photogramm. Rem. Sensing Spatial Inf. Sci. 33 (Part B2), 56–62. Beder, C., Steffen, R. 2008. Incremental estimation without specifying a-priori covariance matrices for the novel parameters. In: Proc. of VLMP Workshop on IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, pp. 1–6. Bodensteiner, C., Arens, M. 2012. Real-time 2D video/3D LiDAR registration. In: IEEE 21st International Conference on Pattern Recognition, Tsukuba (Japan), pp. 2206–2209. Böhm, J., 2004. Multi-image fusion for occlusion-free façade texturing. Int. Arch. Photogramm. Rem. Sensing Spatial Inf. Sci. 35 (5), 867–872. Brenner, C., 2000. Towards fully automatic generation of city models. Int. Arch. Photogramm. Rem. Sensing Spatial Inf. Sci. 33 (Part B3/1), 84–92. Bulatov, D., Lavery, J.E., 2010. Reconstruction and texturing of 3D urban terrain from uncalibrated monocular images using L1 splines. Photogramm. Eng. Rem. Sensing 76 (4), 439–449. Bulatov, D., Pohl, M. 2013. Detection of small roof details in image sequences. In: Proc. of Scandinavian Conference on Image Analysis, Espoo (Finland), 23–26 June, pp. 601–610. Bulatov, D., Wernerus, P., Heipke, C., 2011a. Multi-view dense matching supported by triangular meshes. ISPRS J. Photogramm. Rem. Sensing 66 (6), 907–918. Bulatov, D., Solbrig, P., Gross, H., Wernerus, P., Repasi, E., Heipke, C., 2011b. Contextbased urban terrain reconstruction from UAV-videos for geoinformation applications. Int. Arch. Photogramm. Rem. Sensing Spatial Inf. Sci. 38 (Part 1C22), 75–80. Bulatov, D., Solbrig, P., Wernerus, P. 2012a. Ad-hoc model acquisition for combat simulation in urban terrain. In: Conference on Earth Resources and Environmental Remote Sensing/GIS Applications, 2012, Edinburgh, (United Kingdom), Proceedings of SPIE 8538, 85380G, 10 p. Bulatov, D., Rottensteiner, F., Schulz, K., 2012b. Context-based urban terrain reconstruction from images and videos. ISPRS Ann. Photogramm. Rem. Sensing Spatial Inf. Sci. 1 (3), 185–190. Burns, J., Hanson, A., Riseman, E., 1986. Extracting straight lines. IEEE Trans. Pattern Anal. Mach. Intell. 8 (4), 425–455. Delong, A., Osokin, A., Isack, H.N., Boykov, Y., 2012. Fast approximate energy minimization with label costs. Int. J. Comput. Vision 96 (1), 1–27. Dubbelman, G., Esteban, I., Schutte, K. 2010. Efficient trajectory bending with applications to loop closure. In: Proc. of IEEE Conference on Intelligent Robots and Systems, Taipei (Taiwan), pp. 1–7. Elberink, S.O., Vosselman, G., 2009. Building reconstruction by target based graph matching on incomplete laser data: analysis and limitations. Sensors 9 (8), 6101–6118. El Hakim, S., Brenner, C., Roth, G., 1998. A multi-sensor approach to creating accurate virtual environments. ISPRS J. Photogramm. Rem. Sensing 53 (6), 379– 391. Fischer, A., Kolbe, T., Lang, F., Cremers, A., Förstner, W., Plümer, L., Steinhage, V., 1998. Extracting buildings from aerial images using hierarchical aggregation in 2D and 3D. Comput Vision Image Understan. 72 (2), 185–203. Fischler, M., Bolles, R., 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 45, 381–395. Grady, L., 2006. Random walks for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 28 (11), 1768–1783. Gross, H., Thönnessen, U., 2006. Extraction of lines from laser point clouds. Int. Arch. Photogramm. Rem. Sensing Spatial Inf. Sci. 36 (Part 3/W49), 86–91. Gross, H., Thönnessen, U., Hansen, W.v., 2005. 3D-Modeling of urban structures. Int. Arch. Photogramm. Rem. Sensing 36 (Part 3W24), 137–142. Haala, N. 2005. Multi-Sensor-Photogrammetrie – Vision oder Wirklichkeit? Habilitation, Deutsche Geodätische Kommission, München.

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016

14

D. Bulatov et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2014) xxx–xxx

Haala, N., Rothermel, M., 2012. Dense multi-stereo matching for high quality digital elevation models. Photogramm. Fernerkundung Geoinf. 2012 (4), 331–343. Haala, N., Brenner, C., Anders, K.-H., 1998. 3D urban GIS from laser altimeter and 2D map data. Int. Arch. Photogramm. Rem. Sensing Spatial Inf. Sci. 32 (Part 3/1), 339–346. Hartigan, J., Wong, M., 1979. Algorithm AS 136: a k-means clustering algorithm. Appl. Statist. 28 (1), 100–108. Hartley, R., Zisserman, A., 2000. Multiple View Geometry in Computer Vision. Cambridge University Press. Henricsson, O. 1996. Analysis of image structures using color attributes and similarity relations, Ph.D. thesis, ETH Zürich, No. 11663. Hirschmüller, H., 2008. Stereo processing by semi-global matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30 (2), 328–341. Huang, H., Brenner, C., Sester, M., 2013. A generative statistical approach to automatic 3D building roof reconstruction from laser scanning data. ISPRS J. Photogramm. Rem. Sensing 79, 29–43. Lafarge, F., Mallet, C., 2012. Creating large-scale city models from 3D-point clouds: a robust approach with hybrid representation. Int. J. Comput. Vision 99 (1), 69– 85. Lafarge, F., Keriven, R., Brédif, M., Vu, H., 2013. A hybrid multi-view stereo algorithm for modeling urban scenes. IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), 5–17. Lavery, J.E., 2004. Shape-preserving approximation of multiscale univariate data by cubic L1 spline fits. Comput. Aided Geom. Des. 21 (1), 43–64. Lepetit, V., Moreno-Noguer, F., Fua, P., 2009. EPnP: an accurate O(n) solution to the PnP problem. Int. J. Comput. Vision 81 (2), 155–166. Martinetz, T.M., Schulten, K.J., 1991. A neural-gas network learns topologies. Artif. Neural Networks, 397–402. Mayer, H., 2007. 3D reconstruction and visualization of urban scenes from uncalibrated wide-baseline image sequences. Photogramm. Fernerkundung Geoinf. 2007 (3), 167–176. Meidow, J., 2012. Efficient multiple loop adjustment for computer vision tasks. Photogramm. Fernerkundung Geoinf. 2012 (5), 501–510.

Meixner, P., Leberl, F., Brédif, M., 2011. Interpretation of 2D and 3D building details on facades and roofs. Int. Arch. Photogramm. Rem. Sensing Spatial Inf. Sci. 38 (Part 3/W22), 137–142. Rau, J.-Y., Lin, B.-C., 2011. Automatic roof model reconstruction from ALS data and 2D ground plans based on side projection and the TMR algorithm. ISPRS J. Photogramm. Rem. Sensing 66 (6), 13–27. Rottensteiner, F., 2006. Consistent estimation of building parameters considering geometric regularities by soft constraints. Int. Arch. Photogramm. Remote Sensing Spatial Inf. Sci. 36 (Part 3), 13–18. Rottensteiner, F., Sohn, G., Jung, J., Gerke, M., Baillard, C., Benitez, S., Breitkopf, U., 2012. The ISPRS benchmark on urban object classification and 3D building reconstruction. ISPRS Ann. Photogramm. Rem. Sensing Spatial Inf. Sci. 1 (3), 293–298. Rutzinger, M., Rottensteiner, F., Pfeiffer, N., 2009. A comparison of evaluation techniques for building extraction from airborne laser scanning. IEEE J. Selected Topics Appl. Earth Observation Rem. Sensing 2 (1), 11–20. Shechtman, E, Irani, M. 2007. Matching local self-similarities across images and videos. In: Proc. of Conference on Computer Vision and Pattern Recognition, Minneapolis, pp. 1–8. Sohn, G., Jwa, Y., Kim, H.B., Jung, J., 2012. An implicit regularization for 3D building rooftop modeling using airborne LIDAR Data. ISPRS Ann. Photogramm. Rem. Sensing Spatial Inf. Sci. 2 (3), 305–310. Toldo, R., Fusiello, A. 2008. Robust multiple structures estimation with J-linkage. In: Proc. of European Conference on Computer Vision, Marseille, (France), vol. 1, pp. 537–547. Verma, V., Kumar, R., Hsu, S. 2006. 3D building detection and modeling from aerial Lidar data. In: IEEE Conf. on Computer Vision and Pattern Recognition, New York, vol. 2, pp. 2213–2220. Wassenberg, J., Middelmann, W., Sanders, P. 2009. An efficient parallel algorithm for graph-based image segmentation. In: Proc. of Computer Analysis on Images and Patterns, Münster, (Germany), pp. 1003–1010.

Please cite this article in press as: Bulatov, D., et al. Context-based automatic reconstruction and texturing of 3D urban terrain for quick-response tasks. ISPRS J. Photogram. Remote Sensing (2014), http://dx.doi.org/10.1016/j.isprsjprs.2014.02.016