ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs
Object detection in remote sensing imagery using a discriminatively trained mixture model Gong Cheng, Junwei Han ⇑, Lei Guo, Xiaoliang Qian, Peicheng Zhou, Xiwen Yao, Xintao Hu Department of Control and Information, School of Automation, Northwestern Polytechnical University, 127 Youyi Xilu, Xi’an 710072, PR China
a r t i c l e
i n f o
Article history: Received 23 November 2012 Received in revised form 20 July 2013 Accepted 12 August 2013
Keywords: Object detection Remote sensing imagery Part-based model Mixture model
a b s t r a c t Automatically detecting objects with complex appearance and arbitrary orientations in remote sensing imagery (RSI) is a big challenge. To explore a possible solution to the problem, this paper develops an object detection framework using a discriminatively trained mixture model. It is mainly composed of two stages: model training and object detection. In the model training stage, multi-scale histogram of oriented gradients (HOG) feature pyramids of all training samples are constructed. A mixture of multi-scale deformable part-based models is then trained for each object category by training a latent Support Vector Machine (SVM), where each part-based model is composed of a coarse root filter, a set of higher resolution part filters, and a set of deformation models. In the object detection stage, given a test imagery, its multi-scale HOG feature pyramid is firstly constructed. Then, object detection is performed by computing and thresholding the response of the mixture model. The quantitative comparisons with state-of-the-art approaches on two datasets demonstrate the effectiveness of the developed framework. Ó 2013 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved.
1. Introduction Object detection in remote sensing imagery (RSI) is very important for a wide range of applications such as environment monitoring (Durieux et al., 2008), image analysis and classification (Blaschke, 2010; Mallinis et al., 2008; Tzotsos et al., 2011; Xu et al., 2010), change detection (Tong et al., 2012; Walter, 2004), and geographic image retrieval (Xie et al., 2008). With the development of remote sensing technology, a large number of remote sensing imageries with high spatial resolutions have become available, which facilitates building a superior object detector. However, it is still a challenging problem to achieve reliable object detection in RSI not only due to object appearance, orientation, scale variations, but also because of object non-rigid deformation and occlusion. During the past decades, object detection in RSI has been extensively studied. Some researchers have performed object detection using wavelet multi-resolution analysis (Tello et al., 2005; Li et al., 2010b). For example, Li et al. (2010b) developed an algorithm for straight road edge detection from high resolution RSI based on the ridgelet transform with the revised parallel-beam Radon transform. A number of object detectors were built using scale invariant feature transform (SIFT) features (Sirmacek and Unsalan, 2009) or SIFT-based bag-of-visual-words (BOVW) features (Cheng et al., ⇑ Corresponding author. Tel./fax: +86 29 88431318. E-mail address:
[email protected] (J. Han).
2013; Sun et al., 2012; Xu et al., 2010). Specifically, Sirmacek and Unsalan (2009) proposed to detect urban areas and buildings from very high resolution (VHR) satellite imagery using the SIFT keypoints and graph theory. Sun et al. (2012) presented an automatic target detection framework using a spatial sparse coding bag-ofwords model. Other groups of researchers have applied image segmentation techniques to detect a variety of geospatial objects such as man-made objects in aerial imageries (Cao and Yang, 2007) and small target detection from high resolution panchromatic satellite imageries (Segl and Kaufmann, 2001). In addition, Tournaire and Paparoditis (2009) proposed a geometric stochastic approach based on marked point processes for road mark detection from high resolution aerial imageries and the experimental results have demonstrated its effectiveness. Recently, several object detectors have also been investigated such as building detection (Aytekın et al., 2012; Kim and Muller, 2011), and ship detection (Corbane et al., 2010; Tello et al., 2005). Most of the above described approaches are non-learning models. They may be effective for detecting objects with simple appearance and small variations. However, the prior knowledge acquired from the training stage cannot be used in these methods, which severely limited their detection performance. With the advance of machine learning techniques, many approaches regarded object detection as a classification problem. In contrast to traditional non-learning methods, learning-based methods can obtain useful prior knowledge in advance from training samples via constructing and training supervised classifiers.
0924-2716/$ - see front matter Ó 2013 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS) Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.isprsjprs.2013.08.001
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
These trained detectors are therefore more reliable. A variety of supervised classifiers have been utilized such as Support Vector Machines (SVMs) (Inglada, 2007; Li et al., 2010a; Sun et al., 2012), Gaussian Mixture Models (GMMs) (Bhagavathy and Manjunath, 2006), boosting classifiers (Grabner et al., 2008), Quadratic Discriminant Analysis (QDA) (Eikvil et al., 2009), and Hough Forests (Lei et al., 2012). To be specific, Li et al. (2010a) proposed to detect the building damage in urban environments from multitemporal VHR imagery using the one-class SVM which was trained from the damaged building samples. Bhagavathy and Manjunath (2006) proposed a method to learn a GMM from training samples using texture motifs and then detect compound objects based on the learned model. Grabner et al. (2008) developed an online boosting algorithm for car detection from large-scale aerial imageries. Eikvil et al. (2009) proposed a vehicle detection approach in high resolution satellite imagery combining image segmentation with two stages of object classification. Lei et al. (2012) presented a novel colour-enhanced rotation-invariant Hough Forest method for detecting geospatial objects in RSI. With the help of prior information obtained from training samples, most of these methods have achieved good detection performance. Recently, the availability of more and more remote sensing imageries with high spatial resolution makes it possible to train more refined object detectors. The part-based models (Bar-Hillel et al., 2005; Crandall and Huttenlocher, 2006; Felzenszwalb and Huttenlocher, 2005; Felzenszwalb et al., 2008, 2010; Kumar et al., 2009) that represent each object category by a collection of parts arranged in a deformable configuration have offered a good solution to this problem. Each part of the model captures local appearance properties of an object and the spatial relationships between parts are represented by spring-like connections between pairs of parts. In addition, as pointed out by Felzenszwalb et al. (2010), the model can be trained using a weakly supervised learning method in which it is unnecessary to provide part locations in the training data. Weakly supervised learning has the potential to achieve better detection performance by automatically finding effective parts from training data. Although part-based models (Felzenszwalb et al., 2008, 2010) have achieved impressive success in the detection of persons, cars, horses, and other objects in ground-shot images, these approaches cannot be directly used to detect objects from remote sensing imageries because they are incapable of effectively handling the problem of target rotation variations. Essentially, this problem is not critical in detecting persons, cars, horses, etc. from ground-shot images because these objects are typically in an upright
33
orientation due to the Earth’s gravity and orientation variations across images are generally small. On the contrary, geospatial objects in RSI such as airports, airplanes, ships, and vehicles, usually have many different orientations since remote sensing imageries are taken from the upper airspace and arbitrary viewpoints. In order to address this problem, inspired by the existing partbased models (Felzenszwalb et al., 2008, 2010), this paper develops a geospatial object detection framework using a discriminative mixture of multi-scale deformable part-based models. Each partbased model can detect objects in a certain range of orientation. The combination of a number of independent part-based models into a mixture model can result in a rotation-invariant object detector. To our best knowledge, this work is amongst the earliest efforts to improve and apply the part-based models to geospatial object detection. The reminder of the paper is organized as follows. Section 2 briefly describes the developed object detection framework. Sections 3 and 4 detail the mixture model and its training process, respectively. Section 5 introduces the object detection using the trained mixture model. Section 6 presents experimental results. Finally, conclusions are drawn in Section 7. 2. Framework overview The flowchart of our developed framework is illustrated in Fig. 1. It is mainly composed of two stages: model training and object detection. In the first stage, we train a mixture model for each object category using a weakly supervised learning method in which the positive training samples are obtained by drawing bounding boxes around the objects of interest. To be specific, multi-scale histogram of oriented gradients (HOG) feature pyramids of all training samples are constructed first. A mixture of multi-scale deformable part-based models, as illustrated in Fig. 4, is then trained for each object category by learning a latent SVM. Each part-based model is composed of a coarse root filter, a set of higher resolution part filters (the HOG feature map used for the part filters are computed at finer resolution (half the pixel size) relative to the feature map corresponding to the root filter, which is illustrated in Fig. 2), and a set of deformation models. The root filter is designed to capture the global shape of the object. The part filters capture local appearance properties of the object. The deformation models define the location of each part filter relative to the root filter. In the object detection stage, given a test imagery, its HOG feature pyramid is constructed to describe the imagery. Then, the responses of the trained mixture model at each position and each
Fig. 1. Framework flowchart for our developed object detection in RSI.
34
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
Fig. 2. An image pyramid and its corresponding HOG feature pyramid. Note that the feature map used for the part filters (blue rectangles) are computed at finer resolution (half the pixel size) relative to the feature map corresponding to the root filter (red rectangle). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
level of the HOG feature pyramid are computed. Finally, the object detection is implemented by thresholding the responses and eliminating repeated detections via non-maximum suppression.
3. Mixture model 3.1. HOG feature extraction Feature extraction plays a critical role in developing an object detector. Recent papers have demonstrated that texture, edge and local shape features are reliable and robust for object detection in RSI (Bhagavathy and Manjunath, 2006; Cao and Yang, 2007; Lei et al., 2012; Li et al., 2010a,b). As a dense version of the SIFT feature (Lowe, 2004), HOG feature (Dalal and Triggs, 2005) has shown great success in object detection (Dalal and Triggs, 2005; Felzenszwalb et al., 2008, 2010; Wang et al., 2009; Zhu et al., 2006), and has been widely acknowledged as one of the best features to capture the edge or local shape information of the object. Consequently, we use the HOG as the visual feature for building the mixture model. We implement the HOG feature extraction by following the work of Dalal and Triggs (2005). To be specific, given an imagery, (1) it is firstly divided into 8 8 non-overlapping pixel regions which are called ‘cells’ hereinafter. (2) For each cell, we accumulate a local one-dimensional histogram of gradient orientations over the pixels in the cell. The gradient at each pixel is quantized into one of nine orientation bins. Each pixel votes into the corresponding orientation bin with a voting weight based on the gradient magnitude. For colour images, we compute the gradient of each colour channel and choose the highest gradient magnitude at each pixel from all three channels (Felzenszwalb et al., 2008; Santosh et al., 2010). The method can handle multispectral/hyperspectral imagery containing more than three channels by following the works of Benediktsson et al. (2005) and Huang and Zhang (2013). We firstly adopt principal component analysis (PCA) technology (Landgrebe, 2003) to generate a new PCA image with three principal components (i.e. a false colour image with three channels). As demonstrated by the work of Huang and Zhang (2013), the PCA image can contain over 99% of the information of the multispectral/ hyperspectral imagery. Then, the one-dimensional histogram of gradient orientations of the PCA image is extracted by the same procedure as used for colour image. (3) Each 2 2 neighbourhood of cells is grouped into one block and a robust normalization process based on 2-norm is run on each block to provide strong illumi-
nation invariance, which finally forms a 36-dimensional HOG feature vector. The normalization process can be formulated as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi e ¼ V= kVk2 þ n2 , where V denotes the unnormalized descriptor V 2 e denotes the normalized descriptor vector, k k denotes vector, V 2 2-norm, n is a small regularization constant and the results are insensitive to the value of n over a large range (Dalal and Triggs, 2005). We then construct an 11-level HOG feature pyramid for each imagery as shown in Fig. 2. To be specific, we (1) repeatedly smooth the original imagery I(x, y) using a variable-scale Gaussian function ðx 1 Gðx; y; rl Þ ¼ 2pr 2 e
2 þy2 Þ=2 2 l
r
, where rl is the scale factor of the lth le-
l
vel and we set rl ¼ 1:6 2ðl1Þ=k by following the work of Lowe (2004). Here, k is the number of levels in an octave and it is set to 5 in terms of the work of Felzenszwalb et al. (2010); (2) sub-sample the Gaussian smoothed images by a sampling factor of 2ðl1Þ=k to obtain a standard image pyramid; (3) compute the HOG feature maps from each level of the image pyramid to construct a HOG feature pyramid. Thus, the features at the top of the pyramid capture coarse gradients over large areas, and the features at the bottom of the pyramid capture finer gradients over small areas. 3.2. Mixture model description The mixture model is actually a mixture of m deformable partbased models. Each part-based model (called ‘sub-model’ hereinafter) consists of a root filter, a set of part filters, and a set of deformation models (Felzenszwalb et al., 2010). Fig. 3 illustrates the structure and Table 1 lists a few key mathematical notations of the mixture model. Let F be a w h filter (e.g. root filter or part filter), i.e. a multidimensional matrix specifying the weights for a sliding-window, H be a HOG feature pyramid, d = (x, y, l) specify a position (x, y) in the lth level (where x corresponds to the column number in the imagery and y corresponds to the row number), and U(H, d, w, h) denote the HOG features in the w h sliding-window with top-left corner at d. The response of the filter F at d can be computed through the dot product of the filter F and U(H, d, w, h), i.e.
ResponseðF; dÞ ¼ FUðH; d; w; hÞ X ¼ Fðx0 ; y0 ÞU½H; dðx þ x0 ; y 16x0 6w;16y0 6h
þ y0 ; lÞ; w; h
ð1Þ
35
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
Fig. 3. The structure of the mixture model.
Table 1 A few key mathematical notations of the mixture model. Sign
Description
M = (M1, . . ., Mm)
M is a mixture model; Mi is the ith sub-model; m is the total number of sub-models in M, 1 6 i 6 m
M i ¼ ðFi0 ; Fi1 ; . . . ; Fin ; Q i1 ; . . . ; Q in ; ai Þ
Fi0 is the root filter of Mi; Fij is the jth part filter of Mi; Q ij is the jth deformation model of Mi; n is the total number of part filters in each sub-model, 1 6 j 6 n; ai is a real-valued bias term that makes a number of sub-models comparable (Felzenszwalb et al., 2010)
Q ij
¼
i ðbj ; cij Þ
i
bj is a two-dimensional vector specifying the anchor position of the jth part filter relative to Fi0 ; cij is a four-dimensional vector specifying the coefficients of a quadratic function that is used to define the deformation cost for each possible placement of the part filter relative to the anchor position
In the following text, we will use FU(H, d) to denote FU(H, d, w, h) because the sliding-window dimensions are implicitly specified by the dimensions of F. A placement of mixture model M in a HOG feature pyramid is i i i given by D = (D1, D2, . . ., Dm), where Di ¼ ðd0 ; d1 ; . . . ; dn Þ (1 6 i 6 m) i i i i is the placement of sub-model Mi, and dj ¼ ðxj ; yj ; lj Þ is the location of the root filter when j = 0 and the location of the jth part filter when j P 1. The response of the mixture model at placement D is defined by the maximum response over all sub-models:
Response ðDÞ ¼ max ResponseðDi Þ
ð2Þ
i¼1;...;m
where ResponseðDi Þ is the response of Mi at placement Di, which is defined by the responses of each filter at their respective locations minus a deformation cost that depends on the relative positions of each part filter with respect to the root filter, plus the bias term: i
i
i
ResponseðDi Þ ¼ Responseðd0 ; d1 ; . . . ; dn Þ ¼
n n X X i Fij UðH; dj Þ c ij /ðDxij ; Dyij Þ þ ai j¼0
ð3Þ
j¼1
where
ð7Þ
is the concatenation of the parameter vectors of each sub-model, and
bi ¼ ðFi0 ; . . . ; Fin ; ci1 ; . . . ; c in ; ai Þ
ð8Þ
is the parameter vector of sub-model Mi (1 6 i 6 m), and
wðH; DÞ ¼ ð0; . . . ; 0; wi ðH; Di Þ; 0; . . . ; 0Þ
ð9Þ
is sparse, with non-zero entry depending on the sub-model whose response is maximum over all sub-models, and i
i
wi ðH; Di Þ ¼ ðUðH; d0 Þ; . . . ; UðH; dn Þ; /ðDxi1 ; Dyi1 Þ; . . . ; /ðDxin ; Dyin Þ; 1Þ ð10Þ is the concatenation of HOG features from feature pyramid H and part deformation features of sub-model Mi. The Eqs. (6)–(10) illustrate the connection between the mixture model and linear classifiers. We use the relationship to learn the mixture model parameters through a latent SVM framework that will be described in the next subsection. 3.3. Latent SVM
i
ðDxij ; Dyij Þ ¼ ðxij ; yij Þ ½2ðxi0 ; yi0 Þ þ bj
ð4Þ
gives the displacement of the jth part filter relative to the root filter location and 2
2
/ðDxij ; Dyij Þ ¼ ðDxij ; Dyij ; ðDxij Þ ; ðDyij Þ Þ
ð5Þ
is the jth part deformation feature. Furthermore, the response of the mixture model at placement D can also be expressed in terms of the following dot product:
ResponseðDÞ ¼ bwðH; DÞ ¼ bi wi ðH; Di Þ where
b ¼ ðb1 ; . . . ; bi ; . . . ; bm Þ
ð6Þ
The training samples partially labelled by the weakly supervised manner (the part locations are not provided) are given by S ¼ ðhx1 ; y1 i; . . . ; hxK ; yK iÞ, where yk 2 {1, 1}, 1 6 k 6 K, K is the total number of training samples. To infer the mixture model parameters using the partially labelled training samples, we use the latent SVM framework which can be viewed as a latent variable formulation of multiple-instance SVM proposed by Andrews et al. (2002). In the latent SVM, the response of each training sample xk can be computed by a function as:
fb ðxk Þ ¼ max bwðHðxk Þ; zÞ z2Zðxk Þ
ð11Þ
36
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
Fig. 4. Two mixture models for the airport and airplane categories learned on training datasets: (a) the mixture model for the airport category; (b) the mixture model for the airplane category. The visualization of the root filters and part filters shows the positive weights at different orientations, and the visualization of the deformation models reflects the deformation cost of placing the center of a part filter at different locations relative to the root filter (darker areas represent smaller cost and vice versa).
where b is the vector of mixture model parameters defined by Eq. (7), w(H(xk), z) is a concatenation of HOG features from feature pyramid H(xk) and part deformation features (similar to Eq. (10)), and z are latent values specifying the placement of mixture model in the feature pyramid H(xk). The set Z(xk) defines the possible latent values for the training sample xk. We intend to train b from partially labelled training samples S ¼ ðhx1 ; y1 i; . . . ; hxK ; yK iÞ by optimizing the following objective function: K X 1 b ðSÞ ¼ arg min kbk2 þ maxð0; 1 yk fb ðxk ÞÞ 2 k¼1
ð12Þ
where max (0, 1 ykfb(xk)) is the standard hinge loss. As can be seen from Eq. (12), by restricting Z(xk) to a single choice for sample xk, fb(xk) becomes linear in b, and we can obtain a linear SVM as a special case of a latent SVM.
them into m groups P1, . . ., Pm, where aspect ratio is used as an indicator of intra-class variations. We then train m root filter F10 ; . . . ; Fm 0 (each root filter for one sub-model) using a standard SVM by following Dalal and Triggs (2005). The dimension of Fi0 (1 6 i 6 m) is defined according to the mean aspect ratios of the bounding boxes in Pi.
4.2. Updating root filters We firstly combine the initial m root filters into a mixture model with no parts. In this case the sub-model labels and the root locations are the latent variables for each sample. We then re-train the parameters of the mixture model with the full positive samples and negative samples using the stochastic coordinate descent algorithm (Felzenszwalb et al., 2008, 2010).
4. Mixture model training
4.3. Initializing part filters
For a particular object category, the training samples S ¼ ðhx1 ; y1 i; . . . ; hxK ; yK iÞ are composed of positive samples P and negative samples N. The positive samples are small image patches obtained by drawing bounding boxes around the targets in the training images. The negative samples are a set of image patches that do not contain the instance of the object category. In this section we will briefly describe how to train a mixture model for a particular object category. Its details can be found in Felzenszwalb et al. (2008, 2010).
The part filters in each sub-model are initialized using a simple heuristic method. To be specific, for a sub-model with n part filters, we find the region with the highest energy in its root filter first. The region dimension is equal to half of the size of the part filter, and the energy of the region Eregion is computed by the 2-norm of the positive weights in this region using qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2 1 2 T where Eregion ¼ kwopt k2 ¼ ðwopt Þ þ ðwopt Þ þ þ ðwopt Þ ,
4.1. Initializing root filters To train a mixture model with m sub-models, we firstly sort the image patches in P according to their aspect ratios and split
wopt ¼ ðw1opt ; . . . ; wtopt ; . . . ; wTopt Þ is a vector of the positive weights, and wtopt is the tth positive weight (1 6 t 6 T). Then, we place the first part filter to cover the highest energy region and set the energy of the covered portion to be zero. In this analogy, we search the next highest energy region, until n parts have been placed for each sub-model.
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
4.4. Updating the mixture model We combine the root filters with the part filters to obtain a mixture model. In this case, the sub-model labels and the part locations are all the latent variables for each sample. To update the model we should construct new training samples. For each image patch in the positive samples, we apply the existing model at all positions and levels with at least a 50% overlap with the given image patch. Among these sliding-windows we select the placement with the highest response as the new positive sample. Negative samples are selected by finding sliding-windows in images, which have high response, but do not contain the target object. A new mixture model is then retrained by running stochastic coordinate descent algorithm (Felzenszwalb et al., 2008, 2010) on the new positive and negative samples. We repeatedly update the model using the scheme until Eq. (12) converges to a local minimum (after about 10 times of updating) to obtain the final mixture model. Based on the above procedure, we trained two mixture models for the airport and airplane categories on our training datasets (datasets are described in Section 6.1) respectively, which are shown in Fig. 4. In Fig. 4a and b, each mixture model consists of six sub-models and each row corresponds to a sub-model. Each sub-model is composed of a coarse root filter, eight higher resolution part filters, and eight deformation models. 5. Object detection 5.1. Object detection in RSI using mixture model Given a test remote sensing imagery, the object detection is performed using the following steps: (1) Extract the HOG features of the test imagery using the technique described in Section 3.1. (2) Compute the response of the mixture model. Specifically, compute the response of each sub-model independently and look for the maximum responses over all sub-models at each position and each level of the HOG feature pyramid. Then, object detection is carried out by thresholding the response. (3) Apply the post-processing to eliminate repeated detections via non-maximum suppression. As can be seen from Eq. (2), once the response of each sub-model is computed, we can obtain the final response of the mixture model. Consequently, we will next focus on how to compute the response of a single sub-model. To acquire the response of sub-model Mi, we need to compute the responses at each root location. The root location is specified by a sliding-window whose dimension equals the size of the root filter. For a fixed root location, each part filter can be regarded as a function of the root location specified by its deformation model. Thus, the responses at each root location can be obtained via searching the best placement of all part filters, which can be formulated as: i
i
i
i
Responseðd0 Þ ¼ max Responseðd0 ; d1 ; . . . ; dn Þ di1 ;...;din i
i
i
ð13Þ
where Responseðd0 ; d1 ; . . . ; dn Þ is the response of Mi at placement i i i ðd0 ; d1 ; . . . ; dn Þ which can be computed using Eq. (3), and i i Responseðd0 Þ is the response of Mi at root location d0 . The i Responseðd0 Þ can be computed using dynamic programming and the generalized distance transform algorithm (Felzenszwalb and Huttenlocher, 2004, 2005), which spreads the highest responses of each part filter to nearby locations to obtain their maximum contrii bution to Responseðd0 Þ. The maximum response over all sub-models is assigned to the corresponding mixture model. In this way, the responses of the mixture model at each position and each level of the HOG feature
37
pyramid can be obtained. Finally, object detection is carried out by thresholding the response and each detection is defined by a response and a bounding box derived from corresponding root filter location. 5.2. Post-processing In practice, when we use the above described detection approach solely, a number of sliding-windows (image patches) near each instance of an object are likely to be detected as the targets, which results in multiple overlapping detections for a single object. We therefore adopt a post-processing procedure to eliminate repeated detections by using a non-maximum suppression strategy (Felzenszwalb et al., 2010). We sort the detections by their responses and greedily select the detections with highest responses. Those detections that are at least 50% covered by a bounding box of a previously selected detection are skipped. 6. Results and discussion 6.1. Datasets We evaluated the proposed work using two different types of RSI databases: a low spatial resolution airport imagery database (http://datamirror.csdb.cn/index.jsp) from Landsat-7 Enhanced Thematic Mapper Plus (ETM+) sensor with a repeat interval of 16 days and a high spatial resolution airplane imagery dataset from Google Earth. The first database is from Landsat-7 satellite, which consists of 65 30-m-spatial-resolution shortwave-infrared (SWIR) imageries and 31 15-m-spatial-resolution panchromatic imageries of China. Among these 96 imageries, 60 SWIR imageries and 26 panchromatic imageries contain airport targets and the rest 10 imageries do not contain airport targets. These 96 imageries were taken with a time spanning from December 9, 1999 to March 19, 2003 and have gone through standard terrain correction (Level 1T) (i.e. systematic radiometric and geometric accuracy by incorporating ground control points) (LPD, 2013). In our experiment, 25 imageries containing airports were selected as the training data and the remaining 71 imageries were used for testing. From the training imageries, we manually labeled 56 airports as positive samples and randomly selected 200 image patches not containing airports as negative samples for the airport model training. In addition, 125 airports from test imageries were manually labeled to use as ground truth. The second database is from Google Earth, which consists of 71 imageries in which 61 imageries contain airplane targets. These 71 imageries are not radiometrically corrected and they are satellite imageries of a number of airports such as LHR (London, UK), CDG (Paris, France), FRA (Frankfort, Germany), LAX (Los Angeles, USA), ATL (Atlanta, USA), DEN (Denver, USA) and HND (Tokyo, Japan). Among them, 18 imageries were randomly selected for training and the rest 53 imageries were used for testing. From the training imageries, we manually labeled 160 airplanes as positive samples and randomly selected 200 image patches not containing airplanes as negative samples for the airplane model training. Moreover, 366 airplanes from 53 test imageries were manually labeled as ground truth. Most of these airplane imageries have different orientations, colours and sizes. Fig. 5 shows some target samples from these two RSI datasets. 6.2. Evaluation criterions By following the works of Santosh et al. (2010) and Felzenszwalb et al. (2010), a detection is considered to be correct if its bounding box overlaps more than 50% with the ground truth
38
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
bounding box, otherwise the detection is considered as a false positive. In addition, if several bounding boxes overlap with a same single ground truth bounding box, only one is considered as true positive and the others are considered as false positives. Similar to Bhagavathy and Manjunath (2006) and Felzenszwalb et al. (2008), we adopted the standard Precision–Recall curve (Buckland and Gey, 1994) to quantitatively evaluate the developed framework. The precision measures the fraction of detections that are true positive and the recall measures the fraction of positive examples that are correctly identified. Let Nc, Nf, and Nt denote the number of true positives (i.e. the number of correctly detected targets), the number of false positives (i.e. the number of false alarms), and the number of total positives (i.e. the number of actual targets) in test imageries. The Precision (‘‘1 Precision’’ is the false alarm rate) and Recall (‘‘1 Recall’’ is the miss alarm rate) can be formulated as:
Precision ¼ Nc =ðNc þ Nf Þ
ð14Þ
Recall ¼ Nc =Nt
ð15Þ
In addition, we also adopted the Average Precision (AP) (Everingham et al., 2007) to evaluate an object detection approach, which is a standard metric used by PASCAL challenge and can be
obtained by computing the area under the Precision–Recall curve. The higher the AP value is, the better the performance and vice versa. 6.3. Parameter analysis In the developed framework, there are two critical parameters: m (number of sub-models) and n (number of part filters). We constructed experiments on the airplane and airport imagery databases, respectively, to evaluate how detection performance is affected by the values of these two parameters. Fig. 6a shows the airplane and airport detection results when varying m (m = 2, 4, 6, 8, 10) and fixing the value of n to be 8. Fig. 6b shows the detection results when varying n (n = 4, 6, 8, 10) and fixing the value of m to be 6. As can be seen from Fig. 6a and b, the values of parameters m and n influence the detection performance moderately. Specifically, the detection results measured by AP values were improved in a certain range with the increase of m and n, and then dropped off. When m = 6 and n = 8, the best performance can be achieved in both airplane detection and airport detection. Consequently, we empirically set m = 6 and n = 8 in our all evaluations. In addition, as can be seen from Fig. 6, there is an obvious tradeoff between Recall and Precision because the threshold used for thresholding the responses of the mixture model can considerably affect the number of correct detections and the number of false
Fig. 5. A number of target samples from the airport and airplane imagery databases: (a) airport samples with their locations and scales specified using white texts; (b) airplane samples from Google Earth. Most of these airplane targets have different orientations, colours and sizes. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
39
Fig. 6. Precision–Recall curves by varying m and n: (a) airplane and airport detection results when varying m (m = 2, 4, 6, 8, 10) while fixing the value of n to be 8; (b) airplane and airport detection results when varying n (n = 4, 6, 8, 10) while fixing the value of m to be 6.
detections. Specifically, a low threshold can achieve a good Recall but a poor Precision and vice versa. The optimal threshold is considered to have the highest F1-measure and F1-measure can be calculated as:
F1-measure ¼
2 Recall Precision Recall þ Precision
ð16Þ
The developed object detection framework was implemented on a 24-core Lenovo Server with 2.8 GHz Intel Xeon CPU, 64 GB RAM and LINUX operation system. Our programming platform is MATLAB R2010b. A fixed set of optimal parameter values were utilized for all test imageries. For a remote sensing imagery with the size of 1000 800, the running time is about 6 s. The running time of object detection is dominated by the cost of matching each part filter to the imagery. Therefore, we can accelerate the detection speed by (1) sharing parts between sub-models to reduce the overall number of model parameters; (2) improving the computational efficiency by using other programming platform such as Visual C++. 6.4. Comparison with previous works The object detection framework was evaluated on the airport RSI database and the airplane RSI database, respectively. Fig. 7
shows a number of airport detection examples and corresponding sub-models. The three sub-models correspond to airports with approximately vertical, horizontal, and diagonal orientation, respectively. The first column (from left to right) of Fig. 7 shows the placements of root filters and part filters, where the red rectangles correspond to root filters shown in the third column and the blue rectangles correspond to part filters shown in the fourth column. The second column shows the final detection results. Fig. 8 shows a number of airplane detection results using the developed framework. As can be seen from Fig. 8, although the airplanes have different orientations and different sizes, the developed framework has successfully detected and located most of them. However, there is a miss alarm in Fig. 8a because the contrast between the target and the background is quite low. There is a false alarm which is actually a passenger terminal shown in Fig. 8b because the building is quite similar to airplanes in structure and shape. To quantitatively evaluate the proposed work, we compared it with some state-of-the-art classification-based algorithms: Tao’s method (Tao et al., 2011), Xu’s method (Xu et al., 2010), Sun’s method (Sun et al., 2012), Cheng’s method (Cheng et al., 2013). It should be pointed out that Tao’s method (Tao et al., 2011) was only used for airport detection in our comparison experiments because this algorithm was customized for the specific target of airports in which some unique texture and shape features derived from
40
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
Fig. 7. Airport detection examples (locations are specified using white text): (a) vertical airport detection and corresponding sub-model; (b) horizontal airport detection and corresponding sub-model; (c) diagonal airport detection and corresponding sub-model.
airports were utilized. For fair comparison, we (1) adopted the same training dataset and test dataset for various approaches; (2) used the same libsvm toolbox (downloaded from http:// www.csie.ntu.edu.tw/~cjlin/libsvm/) to train SVMs for all four methods (K-nearest neighbor was replaced by SVM for Cheng’s method to obtain Precision–Recall curve); (3) implemented these four algorithms by adopting multi-scale scanning windows for each imagery which was similar to our multi-scale image pyramid. According to the references (Xu et al., 2010; Sun et al., 2012; Cheng et al., 2013), the vocabulary size was set to 450, 400, and 1800 for Xu’s method, for Sun’s method, and for Cheng’s method, respectively. Fig. 9a and b show the quantitative comparison results for airport and airplane targets, respectively, and the AP values are denoted in parentheses. 6.5. Discussion It can be seen from Fig. 9 that the developed framework outperforms the other four state-of-the-art approaches on both target classes and improves the AP values significantly (on average our method improved previous algorithms by 19.05% and 22.89% for airport detection and airplane detection, respectively). With the same Recall, the Precision of our method is higher than that of the other four approaches, which means that the false alarm rate of our method is lower with the same true positives. With the same Precision, the Recall of our method is also higher than that of the other four approaches, which means that our method can detect more actual targets with the same false alarm rate. The promising comparison results of our developed object detection framework may be explained using the following reasons: (1) Tao’s approach locates the possible regions of candidates first, and then performs airport detection using a texture classier for each candidate regions. Note that the candidates detection is based on one-to-one SIFT feature matching between the training samples and the test imageries. It is effective to detect airports in imageries with simple background. However, for imageries with cluttered background, a large number of false non-airport features are produced, which significantly reduces the matching accuracy. Moreover, the one-to-one matching may fail to handle the case
that the target is partially covered by the matched SIFT points. These factors severely reduce the detection performance. (2) The common mechanism of Xu’s method, Sun’s method and Cheng’s method is that they represent each image patch as a histogram vector using the statistics of the occurrence of visual words (i.e. BOVW model). A SVM classifier (Xu et al., 2010; Sun et al., 2012) or probability Latent Semantic Analysis (pLSA) model (Cheng et al., 2013) is then trained based on the obtained histogram to implement object detection. However, most of BOVW-modelbased methods ignore the spatial contextual relationships among SIFT features. Consequently, their detection performances are severely limited for objects with significant appearance variations. (3) Our developed framework performs object detection using a mixture of part-based models. Each part-based model represents an object by two levels of filters: a lower-resolution root filter and a set of higher-resolution part filters arranged in a flexible spatial configuration. Comparing with the above-mentioned classification-based methods, the proposed method has three major advantages: First, two layers of HOG features are used in each part-based model. The first layer corresponds to part filters that capture local appearance properties of an object while the second layer corresponds to root filter that captures the global shape of an object. The special structure of combination of two layers of filters provides more detailed and powerful description for a geospatial object, which can lead to a better detection performance. Second, the part filters in each part-based model are arranged in a deformable spatial configuration, which can effectively handle target nonrigid deformations, appearance variations, and occlusions. Third, each part-based model detects object in a certain range of orientation. The mixture model by combining of a number of independent part-based models is robust to target rotation variations. 7. Conclusion An effective object detection framework based on a discriminatively trained mixture model has been developed in this paper. The mixture model is composed of a number of independent multiscale deformable part-based models and each part-based model can detect objects in a certain range of orientation. Thus, the
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
41
Fig. 8. Some airplane detection results (red rectangle corresponds to correct detection, yellow rectangle corresponds to false alarm, and green rectangle corresponds to miss alarm): (a) airplane detection results of Denver international airport from Google Earth dated 8 October 2012. The scene centre is 39°510 3100 N, 104°400 0700 W; (b) airplane detection results of London Heathrow international airport from Google Earth dated 5 March 2006. The scene centre is 51°280 2500 N, 0°270 4100 W. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 9. Quantitative comparisons of the proposed framework and some state-of-the-art approaches: (a) Precision–Recall curves of the developed framework and four classification-based approaches for airport detection; (b) Precision–Recall curves of the developed framework and three classification-based approaches for airplane detection.
42
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43
combination of all independent part-based models into a mixture model can reduce the problem that conventional methods are not robust to target rotation variations. The performance of the developed framework has been evaluated qualitatively and quantitatively using two different types of RSI datasets. Comprehensive comparisons with state-of-the-art approaches have demonstrated the effectiveness of the developed object detection framework. Although our developed method has been shown to be effective to handle target rotation problem and obtained promising detection performance, some problems still exist: (1) Some geospatial objects, having similar appearances to targets in structure and shape (e.g. straight roads and coastlines for airport, passenger terminals for airplane), may result in the false alarms; (2) cluttered background and low contrast between the target and the background may lead to miss alarms; (3) since each sub-model in the mixture model has independent part filters, the number of model parameters and the computational resources increase linearly with the number of part filters and sub-models. Our future work may include the following issues: (1) integrate contextual cues with the developed framework to help reduce misclassification and improve object detection performance; (2) share parts between sub-models to reduce the overall number of model parameters and improve the computational efficiency; (3) extend the developed framework to a large number of target classes (e.g. building, ship, vehicle, etc.) to form an ‘‘object bank’’, which can be regarded as a high-level image representation for semantic feature extraction and scene understanding. Acknowledgements The authors appreciate the constructive suggestions from Dr. Alistair Sutherland. J. Han was supported by the National Science Foundation of China under Grant 61005018 and 91120005, NPUFFR-JC20120237, and Program for New Century Excellent Talents in University under Grant NCET-10-0079. X. Hu was supported by the National Science Foundation of China under Grant 61103061, the China Postdoctoral Science Foundation under Grant 20110490174, and Special Grade of the Financial Support from the China-Postdoctoral Science Foundation under Grant 2012T50819. References Andrews, S., Tsochantaridis, I., Hofmann, T., 2002. Support vector machines for multiple-instance learning. In: Proceedings of Neural Information Processing Systems, Vancouver, Canada, 9–14 December, pp. 561–568. _ Düzgün, Sß., 2012. Unsupervised building detection Aytekın, Ö., Erener, A., Ulusoy, I., in complex urban environments from multispectral satellite imagery. International Journal of Remote Sensing 33 (7), 2152–2177. Bar-Hillel, A., Hertz, T., Weinshall, D., 2005. Object class recognition by boosting a part-based model. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), San Diego, CA, 20–26 June, pp. 702–709. Benediktsson, J.A., Palmason, J.A., Sveinsson, J.R., 2005. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Transactions on Geoscience and Remote Sensing 43 (3), 480–491. Bhagavathy, S., Manjunath, B., 2006. Modeling and detection of geospatial objects using texture motifs. IEEE Transactions on Geoscience and Remote Sensing 44 (12), 3706–3715. Blaschke, T., 2010. Object based image analysis for remote sensing. ISPRS Journal of Photogrammetry and Remote Sensing 65 (1), 2–16. Buckland, M., Gey, F., 1994. The relationship between recall and precision. Journal of the American society for information science 45 (1), 12–19. Cao, G., Yang, X., 2007. Man-made object detection in aerial images using multistage level set evolution. International Journal of Remote Sensing 28 (8), 1747– 1757. Cheng, G., Guo, L., Zhao, T., Han, J., Li, H., Fang, J., 2013. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. International Journal of Remote Sensing 34 (1), 45– 59. Corbane, C., Najman, L., Pecoul, E., Demagistri, L., Petit, M., 2010. A complete processing chain for ship detection using optical satellite imagery. International Journal of Remote Sensing 31 (22), 5837–5854.
Crandall, D., Huttenlocher, D., 2006. Weakly supervised learning of part-based spatial models for visual object recognition. In: Proceedings of the ninth European Conference on Computer Vision (ECCV 2006), Graz, Austria, 7–13 May, pp. 16–29. Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), San Diego, CA, 20–26 June, pp. 886–893. Durieux, L., Lagabrielle, E., Nelson, A., 2008. A method for monitoring building construction in urban sprawl areas using object-based analysis of Spot 5 images and existing GIS data. ISPRS Journal of Photogrammetry and Remote Sensing 63 (4), 399–408. Eikvil, L., Aurdal, L., Koren, H., 2009. Classification-based vehicle detection in highresolution satellite images. ISPRS Journal of Photogrammetry and Remote Sensing 64 (1), 65–72. Everingham, M., Van, G.L., Williams, C.K.I., Winn, J., Zisserman, A., 2007. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
. Felzenszwalb, P., Huttenlocher, D., 2004. Distance Transforms of Sampled Functions. Cornell University Computing and Information Science Technical Reports, TR2004-1963. Felzenszwalb, P.F., Huttenlocher, D.P., 2005. Pictorial structures for object recognition. International Journal of Computer Vision 61 (1), 55–79. Felzenszwalb, P., McAllester, D., Ramanan, D., 2008. A discriminatively trained, multiscale, deformable part model. In: Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), Anchorage, AL, 24–26 June, pp. 1–8. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D., 2010. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (9), 1627–1645. Grabner, H., Nguyen, T.T., Gruber, B., Bischof, H., 2008. On-line boosting-based car detection from aerial images. ISPRS Journal of Photogrammetry and Remote Sensing 63 (3), 382–396. Huang, X., Zhang, L., 2013. An SVM ensemble approach combining spectral, structural, and semantic features for the classification of high-resolution remotely sensed imagery. IEEE Transactions on Geoscience and Remote Sensing 51 (1), 257–272. Inglada, J., 2007. Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features. ISPRS Journal of Photogrammetry and Remote Sensing 62 (3), 236– 248. Kim, J., Muller, J.P., 2011. Tree and building detection in dense urban environments using automated processing of IKONOS image and LiDAR data. International Journal of Remote Sensing 32 (8), 2245–2273. Kumar, M.P., Zisserman, A., Torr, P.H.S., 2009. Efficient discriminative learning of parts-based models. In: Proceedings of the twelfth IEEE International Conference on Computer Vision (ICCV 2009), Kyoto, Japan, 27 September–4 October, pp. 552–559. Landgrebe, D.A., 2003. Signal Theory Methods in Multispectral Remote Sensing. Wiley, Hoboken, NJ. Lei, Z., Fang, T., Huo, H., Li, D., 2012. Rotation-invariant object detection of remotely sensed images based on Texton forest and Hough voting. IEEE Transactions on Geoscience and Remote Sensing 50 (4), 1206–1217. Li, P., Xu, H., Guo, J., 2010a. Urban building damage detection from very high resolution imagery using OCSVM and spatial features. International Journal of Remote Sensing 31 (13), 3393–3409. Li, X., Zhang, S., Pan, X., Dale, P., Cropp, R., 2010b. Straight road edge detection from high-resolution remote sensing images based on the ridgelet transform with the revised parallel-beam Radon transform. International Journal of Remote Sensing 31 (19), 5041–5059. Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), 91–110. LPD (Landsat Processing Details), 2013. (accessed 18.07.13). Mallinis, G., Koutsias, N., Tsakiri-Strati, M., Karteris, M., 2008. Object-based classification using Quickbird imagery for delineating forest vegetation polygons in a Mediterranean test site. ISPRS Journal of Photogrammetry and Remote Sensing 63 (2), 237–250. Santosh, K.D., Larry, Z., Ashish, K., Simon, B., 2010. Detecting Objects using Unsupervised Parts-based Attributes. Carnegie Mellon University Robotics Institute Technical, Report, CMU-RI-TR-11-10. Segl, K., Kaufmann, H., 2001. Detection of small objects from high-resolution panchromatic satellite imagery based on supervised image segmentation. IEEE Transactions on Geoscience and Remote Sensing 39 (9), 2080–2083. Sirmacek, B., Unsalan, C., 2009. Urban-area and building detection using SIFT keypoints and graph theory. IEEE Transactions on Geoscience and Remote Sensing 47 (4), 1156–1167. Sun, H., Sun, X., Wang, H., Li, Y., Li, X., 2012. Automatic target detection in highresolution remote sensing images using spatial sparse coding bag-of-words model. IEEE Geoscience and Remote Sensing Letters 9 (1), 109–113. Tao, C., Tan, Y., Cai, H., Tian, J., 2011. Airport detection from large IKONOS images using clustered SIFT keypoints and region information. IEEE Geoscience and Remote Sensing Letters 8 (1), 128–132. Tello, M., López-Martínez, C., Mallorqui, J.J., 2005. A novel algorithm for ship detection in SAR imagery based on the wavelet transform. IEEE Geoscience and Remote Sensing Letters 2 (2), 201–205.
G. Cheng et al. / ISPRS Journal of Photogrammetry and Remote Sensing 85 (2013) 32–43 Tong, X., Hong, Z., Liu, S., Zhang, X., Xie, H., Li, Z., Yang, S., Wang, W., Bao, F., 2012. Building-damage detection using pre-and post-seismic high-resolution satellite stereo imagery: a case study of the May 2008 Wenchuan earthquake. ISPRS Journal of Photogrammetry and Remote Sensing 68, 13–27. Tournaire, O., Paparoditis, N., 2009. A geometric stochastic approach based on marked point processes for road mark detection from high resolution aerial images. ISPRS Journal of Photogrammetry and Remote Sensing 64 (6), 621–631. Tzotsos, A., Karantzalos, K., Argialas, D., 2011. Object-based image analysis through nonlinear scale-space filtering. ISPRS Journal of Photogrammetry and Remote Sensing 66 (1), 2–16. Walter, V., 2004. Object-based classification of remote sensing data for change detection. ISPRS Journal of Photogrammetry and Remote Sensing 58 (3), 225– 238.
43
Wang, X., Han, T.X., Yan, S., 2009. An HOG-LBP human detector with partial occlusion handling. In: Proceedings of the twelfth IEEE International Conference on Computer Vision (ICCV 2009), Kyoto, Japan, 27 September–4 October, pp. 32–39. Xie, Z., Roberts, C., Johnson, B., 2008. Object-based target search using remotely sensed data: a case study in detecting invasive exotic Australian Pine in south Florida. ISPRS Journal of Photogrammetry and Remote Sensing 63 (6), 647–660. Xu, S., Fang, T., Li, D., Wang, S., 2010. Object classification of aerial images with bagof-visual words. IEEE Geoscience and Remote Sensing Letters 7 (2), 366–370. Zhu, Q., Yeh, M.C., Cheng, K.T., Avidan, S., 2006. Fast human detection using a cascade of histograms of oriented gradients. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), New York, 17–22 June, pp. 1491–1498.