L1-norm latent SVM for compact features in object detection

L1-norm latent SVM for compact features in object detection

Neurocomputing 139 (2014) 56–64 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom L1-norm l...

2MB Sizes 15 Downloads 121 Views

Neurocomputing 139 (2014) 56–64

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

L1-norm latent SVM for compact features in object detection Min Tan a, Gang Pan a, Yueming Wang b,n, Yuting Zhang a, Zhaohui Wu a a b

Department of Computer Science, Zhejiang University, Hangzhou 310027, P.R. China Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou 310027, P.R. China

art ic l e i nf o

a b s t r a c t

Article history: Received 12 April 2013 Received in revised form 30 August 2013 Accepted 15 September 2013 Available online 3 April 2014

The deformable part model is one of the most effective methods for object detection. However, it simultaneously computes the scores for a holistic filter and several part filters in a relatively highdimensional feature space, which causes the problem of low computational efficiency. This paper proposes an approach to select compact and effective features by learning a sparse deformable part model using L1-norm latent SVM. A stochastic truncated sub-gradient descent method is presented to solve the L1-norm latent SVM problem. Convergence of the algorithm is proved. Extensive experiments are conducted on the INRIA and PASCAL VOC 2007 datasets. A highly compact feature in our method can reach the state-of-the-art performance. The feature dimensionality is reduced to 12% of the original one in the INRIA dataset and less than 30% in most categories of PASCAL VOC 2007 dataset. Compared with the features used in L2-norm latent SVM, the average precisions (AP) have almost no drop using the reduced feature. With our method, the speed of the detection score computation is faster than that of the L2-norm latent SVM method by 3 times. When the cascade strategy is applied, it can be further speeded up by about an order of magnitude. & 2014 Elsevier B.V. All rights reserved.

Keywords: Object detection Feature reduction Sparse L1 optimization Stochastic gradient descent Cascade

1. Introduction Object detection is one of the key problems in the computer vision research. It plays an important role in image retrieval, video surveillance and smart cars, etc. [1]. The target of object detection is to decide whether or not an object is present in an image and give the bounding box if present. It is a challenging task as most objects in images have large intra-class variations due to illumination variations, view changes, occlusion, and different intrinsic appearances [2]. Many efforts have been made to design discriminative features and powerful model structures to improve the detection performance [3–8]. Recently, part-based deformable models have achieved strong performance on the difficult PASCAL datasets [6,7] and become one kind of the most effective models for object detection. However, the detection has to be evaluated on a large amount of windows when the sliding sub-window scheme is used. The holistic detector and multiple part detectors in part-based deformable methods can lead to highly expensive computation. Feature reduction is one scheme to improve the computational efficiency. Felzenszwalb et al. [8] proposed to project the Histograms of Oriented Gradients (HOG) features and the weight vectors in the part/root filters to a low dimensional space by n

Corresponding author. E-mail addresses: [email protected] (M. Tan), [email protected] (G. Pan), [email protected] (Y. Wang), [email protected] (Y. Zhang), [email protected] (Z. Wu). http://dx.doi.org/10.1016/j.neucom.2013.09.054 0925-2312/& 2014 Elsevier B.V. All rights reserved.

Principal Component Analysis (PCA), which needs additional projecting computation. Directly selecting the effective elements from the feature vectors and the corresponding elements in the weight vectors is another way of feature reduction. We are interested in the answer to the question: how much sparse the features and the weight vectors in part/holistic filters can be, in order to perform comparable performance for the sparse deformable part model to that of the state-of-the-art methods. Recently, Tan et al. [9] preliminarily proposed an idea of using L1-norm latent SVM (ℓ1-LSVM) in the deformable part model to construct compact part/root filters for efficient detection [9]. This paper is an extension of [9], and the contribution of our work is three-fold:

 We propose ℓ1-LSVM by formulating ℓ1-regularization term 



into the latent SVM [6], and use it to learn a deformable part model. A stochastic truncated sub-gradient descent method is presented to solve the problem of ℓ1-LSVM, which requires a small amount of memory, thus can be applied to large-scale problems. A proof of its convergence is also given. We discuss the tradeoff between the speedup and the overhead to index the compact part/root filters. The speedup can be controlled by the degree of the model sparsity.

We test our method on the INRIA and PASCAL VOC 2007 datasets. All parameters in our method are evaluated in detail. The experimental results show that

M. Tan et al. / Neurocomputing 139 (2014) 56–64

 Keeping the comparable performance to the state-of-the-art,





the features can be very sparse. Compared with the HOG features used in L2-norm latent SVM (ℓ2-LSVM) [6], our model only needs 12% of the feature elements without sacrificing the performance on the INRIA database. On all the categories of the PASCAL VOC 2007 dataset, the proportions of effective feature elements are less than 20% on 12 datasets, 20–30% on 5 datasets, and 30–50% on 2 datasets. Detection procedure can be greatly speeded up due to sparseness. The sparse weight vectors of the filters improve the efficiency of the detection score computation by more than three times in the INRIA dataset compared with [6]. Our method is complementary to the cascade method for the improvement of computational efficiency. When a cascade is constructed by our sparse filters, the speedup factor in computing detection score can be more than 40. It is twice as fast as the recent cascaded deformable part model [8], which is the state-of-the-art performance in object detection. The feature shrinkage by our ℓ1-LSVM model does trivial harm to the detection performance. The AP of our method are among the best performance of existing work.

2. Related work Existing research on object detection can be summarized as follows:

 Powerful features: Powerful features have been extensively





studied for object detection. Viola proposed the “Haar þ Adaboost” algorithm for face detection [10], which made it possible to detect face in real time. As for pedestrian, the Haar feature has been proven to be ineffective [3]. Various effective features were proposed to improve the performance, such as edgelet [11], covariance matrix [12], HOG [3], and the augmented HOG-LBP feature [13,14]. Divide and conquer scheme: In multi-view face detection, faces are categorized into different subclasses for training by manual assignments [15]. However, the majority of natural objects are not easy to assign labels. Accordingly, several work has been proposed to automatically categorize objects in the learning procedures [16,4,17]. Other methods combining classifiers to handle variations include mixtures of boosted classifiers [18] and SVM-KNN method [19]. Part-based models: This scheme first attempts to detect object parts and then learns a second stage classifier to estimate the co-occurrences of these parts [20,21]. Recently, two effective part-based approaches were proposed, the deformable part model [6] and Poselet [5]. The success of the deformable part model inspired some extension work [7,22,23].

While the progress of object detection research significantly improves the detection accuracy, the methods become more and more complex, which leads to relatively low computational efficiency. There are at least three schemes to improve the computational efficiency. The first one is to reduce the number of evaluating windows in search space such as “Branch-and-Bound Search” in [24]. The second one is to apply a cascade of part/holistic detectors to reject most simple non-object windows quickly, including one extension of the deformable part model [7]. Besides the above schemes, one can improve the computational efficiency of feature vectors by a more efficient algorithm [25] or feature reduction scheme so as to simplify part/holistic detectors [26]. This paper focuses on reducing the feature dimension in the deformable part model to help improve computational efficiency.

57

Selecting an appropriate subset of feature elements is one way of feature reduction. It is not only valuable for computation efficiency, but also for the model's high separability power as many feature elements might be ineffective [27]. There are at least two feature selection schemes. One is to discard the ineffective feature elements based on the user-defined criteria, including the AdaBoost algorithm where the criteria are based on the classification error [28]. But it is unable to generalize well [29]. Another scheme is to construct a sparse model, which is always related to L1 optimization. Recently, a cascade structural of L1-SVM was proposed to learn a sparse human detector, and it was solved by interior point and integer optimization [26]. Serious consumption of time and memory is the main drawback of this method. Moreover, since it does not involve latent variables, the method is only suitable for classical model structures which do not contain part filters. In this paper, we construct a sparse deformable part model using latent SVM (LSVM) to reduce the feature dimension and solve it using an efficient algorithm.

3. Review of deformable part model A deformable part model (DPM) [6] is a star-structured model including one root filter and several part filters. The score of a test sample x is computed by f β ðxÞ ¼ max β  Φðx; zÞ; z A ZðxÞ

ð1Þ

where β is a weight vector composed of the root filter, the part filters, and deformation cost weights, z is the latent value specifying the object configuration (the location for each part filter related to the root filter), ZðxÞ is the configuration set for x, Φðx; zÞ is a concatenation of holistic features, part features, and part deformation features determined by z. A mixture model may contain several components, each being a model as in (1). Given training data ðxi ; yi Þ, where yi ¼ 7 1 denotes the class label, the DPM is learned by solving the following latent SVM problem: ! N 1 J β J 22 þC ∑ maxð0; 1  yi f β ðxi ÞÞ : βn ¼ arg min ð2Þ 2 β i¼1 As stated in [6], the above problem is convex if the latent information of the positive samples is specified. It can be solved by the gradient descent method. We call the problem ℓ2-LSVM. The training framework of the DPM is shown in Fig. 1. Refer to [6] for details. The DPM is a complex model consisting of several filters, thus the dimension of the feature vector Φðx; zÞ can be high. If the HOG feature is used and eight parts are specified in a deformable part model, the dimension of Φðx; zÞ is more than 20,000. This high dimension causes the problem of expensive computation in detection. However, the high dimensional feature vectors may be redundant as shown in [30], where experiments demonstrated that many elements of the SVM weight vectors in ℓ2-LSVM are small, and the smallest 50% of the elements typically carry only about 20% of the total weight. That is to say, many elements in the feature vector are ineffective, and it is important to suppress the small elements which may represent noise.

4. Our method: ℓ1-LSVM To reduce the dimension of feature vectors in ℓ2-LSVM, we design a sparse version of the LSVM [6] for both the computational efficiency and the detection accuracy. The common way of sparse modeling is constructing L1 regularized optimization directly. We

58

M. Tan et al. / Neurocomputing 139 (2014) 56–64

Input

Feature map

Root

Part

Root response

Part response

Transformed map

Combined score

Deformation cost Latent SVM learning

Mining hard negative samples

Relabeling positive samples Deformable part model Fig. 1. The learning framework of the deformable part model.

apply this strategy to construct a sparse DPM, and we call our model ℓ1-LSVM.

efficient stochastic truncated sub-gradient descent (STSGD) method to solve our ℓ1-LSVM problem.

4.1. ℓ1-LSVM

4.2. STSGD algorithm: solving ℓ1-LSVM

Our ℓ1-LSVM is constructed via L1 regularized optimization by formulating the ℓ1-regularization term into (2). Replacing the regularization term in (2), we learn a sparse DPM by solving the following objective function: !

STSGD is a modification of the stochastic sub-gradient descent (SSGD) method. SSGD works in an iterative manner. In each step, a subset of the training set is selected to estimate the sub-gradient and the variables are updated by the negative direction of the subgradient. The sub-gradient of the objective function in (6) is

N

βn ¼ arg min J β J 1 þ C ∑ maxð0; 1  yi f β ðxi ÞÞ : β

ð3Þ

i¼1

N

∇β Rðβ; CÞ ¼ signðβÞ þ C ∑ ∇Lβ ðβ; ζ i ðβÞÞ;

Let

!

φi ðβÞ ¼ Φ xi ; arg maxðβ  Φðxi ; zÞÞ

ð4Þ

z A Zðxi Þ

be the feature vector of the sample xi having maximal score among all z and Lðβ; ζ i ðβÞÞ ¼ maxð0; 1  yi f β ðxi ÞÞ;

ð5Þ

where ζ i ðβÞ ¼ ðφi ðβÞ; yi Þ. Then (3) can be rewritten as N

βn ¼ arg min Rðβ; CÞ; Rðβ; CÞ ¼ J β J 1 þ C ∑ Lðβ; ζ i ðβÞÞ: β

ð7Þ

i¼1

ð6Þ

i¼1

We would like to learn a sparse deformable model by (6) using the framework of [6] (see Fig. 1). The problem is semi-convex. As it becomes convex once latent information is specified for the positive training samples. To solve (6), we use the sub-gradient descent method since (6) is not differentiable everywhere. Since the training data and feature dimensions in the learning framework are large, the stochastic scheme may be the best choice as it can reduce the memory requirements and make the training possible on a common PC. However, common stochastic sub-gradient descent methods cannot achieve sparsity in βn because the elements in βn are floats and very few operations between float values result in exactly zero (or any other default value). Recently, there is a growing research on algorithm making results sparse for L1 regularized models with the extensive studies on sparse models [31–33], including the L1 regularized hinge-loss SVM [31]. However, none of the existing models had taken latent variables into consideration. Thus, we extend [31] to our model and present an

where ∇Lβ ðβ; ζ i ðβÞÞ ¼

(

0;

yi f β ðxi Þ Z1

 yi φi ðβÞ

otherwise:

ð8Þ

We select one sample to estimate the sub-gradient, and separate the update based on (7) to two stages according to the hinge loss term and the regularization term: ( β^ t ¼ βt  ηt CN∇Lβ ðβt ; ζ it ðβt ÞÞ ð9Þ βt þ 1 ¼ β^ t  ηt signðβ^ t Þ: Along with the iteration step, a part of elements in β approach zero progressively, but they cannot be purely zero because of the numerical precision problem. In practice, we find that simply rounding small elements to zero does not work well. This is because an element may be small due to too few updates. The simple rounding is too aggressive and lack of theoretical guarantee [31]. An alternate is to update entries of βt þ 1 in the second stage by a less-aggressive truncation function. For the jth entry 8 j j > maxð0; β^ t  ηt Þ; β^ t A ½0; θ > > < j j j ð10Þ βjt þ 1 ¼ Hðβ^ t ; ηt ; θÞ ¼ minð0; β^ t þ ηt Þ; β^ t A ½  θ; 0 > > > : β^ j otherwise: t If the sign of one entry changes after the update, it is set to zero. This is less aggressive than the simple rounding method. Practically, the truncation is performed once every K steps to avoid degenerate local optimum. When the algorithm terminates, a simple rounding on the entries of β is applied because, at this

M. Tan et al. / Neurocomputing 139 (2014) 56–64

time, rounding small values to zero is relatively safe: ( βj ; jβj j4 αM f ðβj Þ ¼ 0 otherwise;

59

convex loss function. Similar to [31], we establish a regret bound for this method, showing it converges to the optimal solution ð11Þ

where M is the largest absolute value among all entries in β and α is a weight tuning the degree of the sparsity. This STSGD method works well in both achieving the sparsity of β and outputting good classifiers. We apply the method to solve the ℓ1-LSVM and learn a sparse DPM, shown in Algorithm 1. Algorithm 1. Learning of ℓ1-LSVM by STSGD. Input: P ¼ fðxi ; zpi Þj1 ri r ng: positive training samples with determined latent variables zpi; N ¼ fðxi ; zni1 ; …; zniki Þj1 ri r mg: negative training samples znij;

with undetermined latent variables ki: number of latent variables for sample i. K: the truncation stride; θ: the parameter in the truncation function. T: the number of the iteration. η0: the initial learning rate. 1 Initialization: 2 β≔β0 . 3 Learning: 4 for t¼1, 2, …, T do 5 Randomly select a sample xi. pffiffi 6 Set learning rate η≔η0 = t . 7 if xi is positive, 8 φ≔Φðxi ; zpi Þ, y ≔þ1. 9 else 10 φ≔Φðxi ; arg maxðβ  Φðxi ; znij ÞÞÞ; y≔  1. 1 r j r ki

11 end 12 if y  βφ o 1 13 β≔β þ CNηyφ. 14 end 15 if modðt; KÞ ¼ 0 16 β≔Hðβ; Kη; θÞ. 17 end 18 Use (11) to round small entries of β to zero. Out the sparse deformable model β

4.3. Convergence proof At first sight, the procedure appears to be an ad hoc way to force the coefficients to zero in (10). However, we will show its convergence to the optimal solution by giving a regret bound that approaches zero as the iteration number grows. Proof on the convergence of our algorithm is inspired by [31], where the L1 regularized objective function is ! 1 N n β ¼ arg min g J β J 1 þ ∑ Lðβ; ζ i Þ : ð12Þ Ni¼1 β If the weight g is set to be 1=ðCNÞ, it is similar to our ℓ1-LSVM objective (6) except that LðÞ here is a general convex loss function, and ζ i ¼ ðΦðxi Þ; yi Þ which does not involve latent variable. The above objective is solved via stochastic truncated gradient descent (STGD), and they proved the convergence of STGD provided that L is convex in β. In the following, we show that Lðβ; ζ i ðβÞÞ in (6) is also convex although it involves latent variable. For positive samples, as the latent variables are determined, ζ i ðβÞ ¼ ðΦðxi ; zpi Þ; 1Þ, thus Lðβ; ζ i ðβÞÞ ¼ maxð0; 1  β  Φðxi ; zpi ÞÞ is convex due to linearity; for negative samples, Lðβ; ζ i ðβÞÞ ¼ maxð0; 1 þmax1 r j r ki ðβ  Φðxi ; znij ÞÞÞ is also convex in β. Thus Lðβ; ζ i ðβÞÞ is a

2

η J β J NC Ei1;…iT ½Rðβ~ T ; CÞ  Rðβ; CÞ r t BNC þ ; 2ηt T 2

ð13Þ

where B ¼ maxf J φi ðβÞ J 2 g, β is the optimal solution, Ei1;…iT ½ is the expectation from iteration 1 to T. Eq. (13) implies convergence of thepalgorithm if limt-1 ηt ¼ 0; limt-1 ηt t ¼ 1, e.g. we take ηt ¼ ffiffi η0 = t . Thus β~ T is an approximated solution for our optimization. We can also take βT as an approximation [31].

5. Analysis of sparsity and speedup In this section, we discuss the relationship between model sparsity and speedup in detection, also, we study the influence of feature dimension over the relationship. The speedup comes from the faster convolution since a sparse β needs fewer multiplication operations in computing the detection score. We store both the dense weight vector in ℓ2-LSVM and the sparse one in ℓ1-LSVM in an ordinary array and make an additional array for the sparse one to index non-zero entries, therefore, additional cost is required to index non-zero elements. For easy explanation, we define the model sparsity as γ ¼ 1  N1 =N, namely the proportion of zero entries in a weight vector, where N and N1 are the number of total and non-zero feature elements respectively. For a model with a sparsity of γ, the speedup factor is defined as the ratio of detection time by traditional convolution and faster one as follows: Sγ ¼

T dt þT dot ; T dt þð1 γÞðT dot þ T index Þ

ð14Þ

where Tdt, Tdot, Tindex denote the time consumptions of the distance transform [34], filter convolutions, and nonzero-element indexing, respectively. Though Tdt is independent of the feature dimension, Tdot and Tindex are relevant to the feature dimension because of the memory reference, therefore the speedup factor Sγ is affected by the dimension of the feature vector. Thus, we conduct experiments on the INRIA dataset using three different features and generate three curves describing the trend of the speedup against the sparsity. For each feature, we run five models on testing images with different sparsity levels (0, 0.2, 0.4, 0.6, and 0.8). Then the three coefficient parameters are estimated by averaging the corresponding time cost. The three used features are HOG, Local Binary Pattern (LBP), and the combination of HOG and LBP, which are arranged in ascending order of dimensionality. Fig. 2 shows the speedup-sparsity curves. We can find that (1) For each feature, there is a minimum sparsity rate γ0, which is the starting sparsity to obtain an improvement of computational efficiency (SðγÞ Z 1). That is to say, the proposed ℓ1-LSVM performs faster than ℓ2-LSVM if and only if γ Zγ 0 . (2) There is an upper bound of the speedup, even when the sparsity rate reaches 1. This is because the distance transform takes up a small portion of computational cost. (3) The higher the feature dimension is, the larger the γ0 will be. The same happens to the upper bound of the speedup. As shown in Fig. 2, there may be a different trend of the speedup against feature dimension under different sparsity levels. The reason is as follows. As we do not speed up the distance transform procedure, there should be an inverse correlation between the speedup and the relatively importance of distance transform during detection. Tdot and Tindex will change along with the change of dimensionality. If the sparsity is high, ð1  γÞ ðT dot þ T index Þ becomes small. Its change becomes less important and Tdt is dominant. Increasing the feature dimensionality weakens the contribution of Tdt, thus increasing the dimensionality

60

M. Tan et al. / Neurocomputing 139 (2014) 56–64

6. Experiments

contributes to more visible speedup. However, if the sparsity level is low, ð1  γÞðT dot þ T index Þ keeps large and Tdt becomes less important and Tindex became the main hindrance to speedup. In this case, we got the opposite results. In addition, as γ0 is positioned at low sparsity levels, the higher the feature dimension is, the larger the γ0 will be.

In this section, we evaluate the detection accuracy and efficiency of our ℓ1-LSVM based DPM. A mixture DPM is learned and tested on INRIA person and PASCAL VOC 2007 datasets. Firstly, a two-component DPM using HOG features is built and tested on INRIA person dataset [3] for pedestrian detection. Based on this setup, we illustrate the sparseness of ℓ1-LSVM coefficients, investigate several important parameters, and compare the proposed approach with that in [6]. Then, we employ the cascade technique in [8] for our approach, and report the experiment results. Finally, we build six-component DPMs on all the 20 categories of PASCAL dataset [35] to make a more comprehensive comparison with other algorithms. Our method can also be deployed to be multi-thread, similar to ℓ2-LSVM. We conduct all the experiments in a single thread setup on a PC with one Intel Core i3 530 2.93 Hz CPU and 3G RAM to make a fair comparison. The baseline codes are downloaded from [36].

25

HoG+LBP(γ0=0.330) LBP(γ0=0.262)

speedup S(γ)

20

HOG(γ0=0.191)

15 γ0 γ0 γ0

10

5

6.1. Coefficient properties of ℓ1-LSVM S(γ)=1

0 0

0.2

0.4

0.6

0.8

Fig. 3 illustrates the DPMs trained on INRIA dataset using ℓ2- and ℓ1-LSVM. The root filter, part filters and deformation cost weights of a DPM are visualized in a triple. We draw the SVM filter coefficients at the same positions and orientation as their

1

sparsity degree γ

Fig. 2. The speedup-sparsity curves using three different features. Note the different starting point γ0.

1

1

0.9

0.9

0.9

0.8

0.8

0.8

0.7 0.6 0.5 0.4

0.7

0.8

0.85 recall

0.7 0.6

0.7 η0=1e−3 (0.863, 0.819, 1.273)

0.6

η0=5e−4 (0.868, 0.875, 1.305) η0=1e−4 (0.873, 0.878, 1.312)

C=1/32 (0.834, 0.754, 1.211) C=1/16 (0.848, 0.601, 1.148) C=1/8 (0.876, 0.554, 1.153) C=1/4 (0.875, 0.417, 1.083) C=1/2 (0.877, 0.324, 1.039) C=1 (0.863, 0.223, 0.974)

0.75

precision

1

precision

precision

Fig. 3. Visualization of the learned DPMs by (a) ℓ2-LSVM; (b) ℓ1-LSVM. From left to right, they are the root filter, part filters, and deformation cost weights in a sub-figure. The number of non-zero feature elements is 23232 and 2882.

0.5 0.4

0.9

0.95

0.7

0.5 θ=1 (0.876, 0.554, 1.153) θ=0.1 (0.862, 0.525, 1.125) θ=0.01 (0.854, 0.521, 1.114) θ=0.001 (0.840, 0.322, 1.001)

0.75

0.8

0.85 recall

η0=5e−5 (0.877, 0.859, 1.306) η0=1e−5 (0.881, 0.790, 1.276)

0.4

η0=5e−6 (0.873, 0.736, 1.241) η0=1e−6 (0.876, 0.554, 1.153)

0.9

0.95

0.7

0.75

0.8

0.85

0.9

0.95

recall

Fig. 4. Precision-recall curves of our method with different parameters on INRIA person dataset: (a) varying C (θ ¼ 1; η0 ¼ 10  6 ); (b) varying θ (C ¼ 0:125; η0 ¼ 10  6 ); (c) varying η0 (C ¼ 0:125; θ ¼ 1). The three numbers in the parentheses in the legend are the AP, sparsity, and tradeoff score.

M. Tan et al. / Neurocomputing 139 (2014) 56–64

corresponding HOG features, where the stroke strength represents the coefficient magnitude. Unlike ℓ2-LSVM, the filter coefficients of ℓ1-LSVM are very sparse, i.e., only a small portion (about 12%) of the HOG feature elements are selected. It can be seen that most of the selected elements lie at the border of the human shape which is commonly acknowledged as the most discriminative region for pedestrian detection. 6.2. Parameters of our approach Our approach has four adjustable parameters C, θ, η0, and α. We investigate how they impact the detection accuracy and sparsity of our approach on INRIA person dataset. We first study C, θ, and η0 with fixed α ¼ 10  3 . In Fig. 4, the precision-recall curves for pedestrian detection under different parameter combinations are shown. The AP and the model sparsity (γ, the ratio of zero entries) are also listed in the legend (the first and the second column in parentheses). As shown in Fig. 4a, smaller C results in higher sparsity, and the detection accuracy is highest when C ranges in about ½18; 12. As shown in Fig. 4b, θ has little impact on sparsity when it is greater than 0.01. However, large θ is preferred for the sake of detection accuracy. Unlike θ, detection accuracy is not sensitive to η0, but η0's impact on sparsity is somewhat noticeable as shown in Fig. 4c. The detection accuracy and the sparsity usually show an inverse correspondence, so a tradeoff has to be made between them. We define a tradeoff score as AP þ 0:5  Sparsity:

ð15Þ

The tradeoff scores are also listed in the legend (the third column in parentheses). To maximize the tradeoff score, we choose C¼ 0.125, θ ¼ 1, and η0 ¼ 10  4 as the default parameters of our approach. In Table 1, we investigate the shrinkage α under several different combinations of ðC; θ; η0 Þ. It can be seen that α almost does not affect detection accuracy, and does not have very significant impact on sparsity, either. We empirically choose 0.001 for it. 6.3. Comparisons with ℓ2-LSVM on INRIA We compare the detection accuracy and efficiency of DPMs using ℓ2- and ℓ1-LSVM for pedestrian detection. Table 1 Impact of the shrinkage α to AP and sparsity under different ðC; θ; η0 Þ on INRIA person dataset. ðC; θ; η0 Þ (0.25,1,10  6) α AP γ

(0.125,1,10  6)

(0.125,1,10  4)

0.0001 0.001 0.01 0.0001 0.001 0.01 0.0001 0.001 0.01 0.877 0.875 0.872 0.869 0.876 0.869 0.877 0.873 0.872 0.407 0.417 0.502 0.544 0.554 0.581 0.794 0.878 0.894

61

6.3.1. Accuracy As shown in Fig. 5, the two precision-recall curves are similar to each other, so no significant difference exists between the accuracy of ℓ1- and ℓ2-LSVM detectors.

6.3.2. Computational efficiency As shown in Table 2, the AP of the two DPMs are also similar. While ℓ1-LSVM filter holds a dense set of coefficients, more than 85% of the ℓ1-LSVM filter coefficients are zero. Due to the sparsity of our model, it is four times as computationally efficient as ℓ2-LSVM. Besides, the training time of our method is close to that of ℓ2-LSVM though ℓ1 optimization usually needs more iterations to converge. The main reason is that the relabeling step (refer to Fig. 1) of our method works much faster than that of ℓ2-LSVM. 6.4. Performance of cascade DPMs In [8], Felzenszwalb et al. proposed a cascade technique to improve the detection efficiency of the star-structured DPM and keep its high detection accuracy. With concern about feature redundancy, they further use PCA to reduce the HOG feature of each cell to 5 dimensions. As previously mentioned, these techniques can be naturally integrated into our method. We compare the performance of different DPMs using the cascade technique with and without PCA. Table 3 shows the results. To ensure the consistency among different methods, the final classifier always uses the precision-equals-recall threshold, which leads to an experimental result different from Section 6.3 (use fixed classifier threshold) for the non-cascade baseline. Note that this strategy is also used in [8]. From the table, we can see that AP of different approaches are at the same level. As we expected, the cascade technique can boost the efficiency for both the ℓ2 and ℓ1 cases. Unlike ℓ2-LSVM, cascade model without PCA is more efficient than that with PCA for ℓ1-LSVM. One possible reason is that transforming the filter coefficients with PCA may break the sparseness. Thus, ℓ1-LSVM detector should be only integrated with the non-PCA cascade strategy. The cascade ℓ1-LSVM detector without PCA is about five times as computationally efficient as the cascade ℓ2-LSVM detector without PCA, and still more than two times as efficient as the cascade ℓ2-LSVM detector with PCA. Table 2 Comparisons with ℓ2-LSVM based DPM on INRIA person dataset. Method

AP

Timea (s)

Sparsity

ℓ2-LSVM ℓ1-LSVM (ours)

0.884 0.873

5.015 1.251

0 0.878

a

Time is the detection time consumption per image (for all the tables).

1 0.95

Table 3 Comparisons with ℓ2-LSVM with cascade DPMs on INRIA person dataset.

precision

0.9

Method

ℓ2-LSVM

ℓ1-LSVM (ours)

Speed-upa

0.804 4.638

0.809 1.229

3.774

0.805 0.462

0.809 0.093

4.966

0.805 0.203

0.809 0.177

1.149

0.85 0.8 0.75 0.7 0.5

INRIA−L2(0.884) INRIA−L1(0.873) 0.6

0.7

0.8

0.9

recall

Fig. 5. Precision-recall curves of DPMs on INRIA person dataset: inside the parentheses in the legend is AP.

Non-cascade AP Time (s) Cascade (no PCA) AP Time (s) Cascade (PCA) AP Time (s) a

Speedup is the speed-up factor of ℓ1-LSVM with regard to ℓ2-SVM.

62 Table 4 Comparisons between ℓ1- and ℓ2-LSVM with non-cascade DPMs on PASCAL VOC 2007 datasets: γ is the sparsity, ℓp denotes ℓp-LSVM, Δ is the AP gap of ℓ1-LSVM relative to ℓ2-LSVM, and speedup is the speedup factor of ℓ1-LSVM with regard to ℓ2-LSVM. The classifier uses a fixed threshold. Method

Aero

Bike

Bird

Boat

Bottle

Bus

Car

Cat

Chair

Cow

Table

γ ℓ2 ℓ1

.002 .889

.000 .898

.000 .728

.001 .707

.000 .821

.001 .906

.000 .772

.001 .505

.000 .803

.001 .815

.001 .839

AP ℓ2 ℓ1 Δ

.289 .291  .002

.595 .593 .002

.100 .102  .002

.152 .136 .017

.255 .237 .019

.496 .491 .005

.579 .573 .006

.193 .171 .022

.224 .224 .000

.252 .192 .060

.233 .225 .008

Speedup

Method

3.71

4.32

Dog

2.39

2.37

3.17

Horse

Mbike

Person

4.12

2.88

Plant

Sheep

1.70

Sofa

2.89

Train

3.30

TV

3.54

Avg.

.001 .731

.001 .859

.002 .820

.000 .496

.001 .682

.002 .810

.000 .830

.000 .846

.000 .846

.001 .780

AP ℓ2 ℓ1 Δ

.111 .115  .004

.568 .554 .014

.487 .441 .046

.419 .383 .036

.121 .121 .001

.172 .195  .023

.336 .280 .056

.448 .429 .019

.416 .393 .023

.322 .307 .015

Speedup

2.50

3.93

3.06

1.68

2.15

2.88

3.23

3.19

3.27

3.01

Table 5 Comparisons between ℓ1- and ℓ2-LSVM with cascade DPMs on PASCAL VOC 2007 datasets: The AP gap (Δ AP) and speedup are both relative to non-cascade ℓ2-LSVM, ℓ2 denotes non-cascade ℓ2-LSVM, ℓ2-C denotes cascade ℓ2LSVM with PCA, ℓ1-C denotes cascade ℓ1-LSVM without PCA, and ℓ1 =ℓ2 is the speed-up ratio of ℓ1-C to ℓ2-C. The classifier uses a threshold that makes precision equal to recall. Method AP ℓ2 Δ AP ℓ2 -C ℓ1 -C Speed-up ℓ2 -C ℓ1 -C ℓ1 ℓ2 Method

Aero

Bike

Bird

Boat

Bottle

Bus

Car

Cat

Chair

Cow

Table

.279

.513

.091

.147

.223

.419

.497

.164

.193

.203

.182

 .001  .018

.001  .003

.000 .000

.018  .022

 .001  .014

 .001  .017

.000 .000

 .004  .040

 .001 .003

.028  .035

 .002  .009

11.2 48.0 2.42

24.1 45.8 1.90

10.0 21.1 2.11

10.7 29.9 2.78

20.1 56.6 2.82

10.6 14.5 1.38

8.69 24.1 2.78

16.4 23.8 1.45

15.5 41.2 2.66

Dog

7.18 16.1 2.24

Horse

Mbike

Person

AP ℓ2

.091

.510

.411

.373

Δ AP ℓ2 -C ℓ1 -C

.000 .011

.000  .012

.046  .019

.000  .218

Plant

.115

.000  .005

14.9 15.7 1.05 Sheep

Sofa

Train

TV

Avg.

.183

.288

.380

.373

.282

 .003 .011

 .006  .040

 .001 .005

 .001  .009

.004  .022

M. Tan et al. / Neurocomputing 139 (2014) 56–64

γ ℓ2 ℓ2

13.3 32.3 2.43

21.4 33.1 1.55

10.4 24.7 2.37

13.9 27.7 1.96

M. Tan et al. / Neurocomputing 139 (2014) 56–64

63

6.5. Comparisons on PASCAL 2007 dataset We test ℓ1- and ℓ2-LSVM with both non-cascade and cascade DPMs on all the 20 categories of PASCAL 2007 dataset. All the DPMs are trained with six components. For ℓ2-LSVMs, we directly use the PASCAL 2007 model available in [36]. Table 4 summarizes AP and model sparsity of both ℓ1- and ℓ2-LSVM with non-cascade DPMs as well as the speedup of ℓ1-LSVM. At similar AP level, ℓ1-LSVM shows a model sparsity higher than 0.7 for most categories, and gets a 2–3.5 times speedup with regard to ℓ2-LSVM on these categories. Note that a fixed classifier threshold is used as in Section 6.3. In Table 5, we compare the best cascade strategies for ℓ2-LSVM (with PCA) and ℓ1-LSVM (without PCA). Non-cascade ℓ2-LSVM based DPM is taken as the baseline. For the other methods, their detection accuracy is measured by the AP gap to the baseline, and their efficiency is also measured by the speedup ratio to the baseline. As it is in INRIA person dataset, the two DPMs' AP are similar, and the cascade technique also significantly benefits both of them. Especially, the cascade DPM learned by ℓ1-LSVM is still about twice as computationally efficiency as that by ℓ2-LSVM on average. Note that the classifier threshold is chosen to strike the equivalence between precision and recall as in Section 6.3.

19.0 29.9 1.57

24.0 31.8 1.33

11.3 16.7 1.48

6.78 11.1 1.64

12.1 20.6 1.70

7. Conclusion This paper proposes a method ℓ1-LSVM to learn a sparse deformable part model for efficient object detection. It is solved by a stochastic truncated sub-gradient descent method. The proof of its convergence is presented. Besides, the tradeoff between the speedup and the overhead in the construction and index of the sparse deformable part model is also discussed. The experimental results on the INRIA and PASCAL VOC 2007 datasets show that our model is highly sparse and a small amount of feature elements can reach detection performance comparable to that of ℓ2-LSVM. The speedup of our method is three times compared with state-of-theart methods. Besides, our method is complementary to the cascade method for further speedup. We believe that our method is valuable not only for detection efficiency, but also for insightful understanding of feature capability. Our work can be extended to many other fields, e.g., 3D face detection/recognition [37,38], object tracking, and human parsing, etc.

Acknowledgment This work was partly supported by National 973 Program (2013CB329504), National Natural Science Foundation of China (No. 61103107), and Research Fund for the Doctoral Program of Higher Education of China (No. 20110101120154).

Speed-up ℓ2 -C ℓ1 -C ℓ1 ℓ2

9.91 16.2 1.63

References [1] J. Sun, Z. Wu, G. Pan, Context-aware smart car: from model to prototype, J. Zhejiang Univ. Sci. A 10 (7) (2009) 1049–1059. [2] D. Geronimo, A. Lopez, A. Sappa, T. Graf, Survey of pedestrian detection for advanced driver assistance systems, IEEE Trans. Pattern Anal. Mach. Intell. 32 (7) (2010) 1239–1258. [3] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR, 2005. [4] Z. Tu, Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering, in: ICCV, 2005. [5] L. Bourdev, J. Malik, Poselets: body part detectors trained using 3D human pose annotations, in: ICCV, 2009. [6] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach. Intell. (2009) 1627–1645.

64

M. Tan et al. / Neurocomputing 139 (2014) 56–64

[7] L. Zhu, Y. Chen, A. Yuille, W. Freeman, Latent hierarchical structural learning for object detection, in: CVPR, 2010. [8] P. Felzenszwalb, R. Girshick, D. McAllester, Cascade object detection with deformable part models, in: CVPR, 2010. [9] M. Tan, Y. Wang, G. Pan, Feature reduction for efficient object detection via L1norm latent SVM, in: IScIDE, 2012. [10] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: CVPR, 2001. [11] B. Wu, R. Nevatia, Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors, in: ICCV, 2005. [12] O. Tuzel, F. Porikli, P. Meer, Human detection via classification on Riemannian manifolds, in: CVPR, 2007. [13] X. Wang, T. Han, S. Yan, An HOG-LBP human detector with partial occlusion handling, in: ICCV, 2009. [14] J. Zhang, K. Huang, Y. Yu, T. Tan, Boosted local structured HOG-LBP for object localization, in: CVPR, 2011. [15] C. Huang, H. Ai, Y. Li, S. Lao, High-performance rotation invariant multiview face detection, IEEE Trans. Pattern Anal. Mach. Intell. 29 (4) (2007) 671–686. [16] Y. Shan, F. Han, H.S. Sawhney, R. Kumar, Learning exemplar-based categorization for the detection of multi-view multi-pose objects, in: CVPR, 2006. [17] B. Wu, R. Nevatia, Cluster boosted tree classifier for multi-view, multi-pose object detection, in: ICCV, 2007. [18] J. Meynet, V. Popovici, J.-P. Thiran, Mixtures of boosted classifiers for frontal face detection, Signal Image Video Process. 1 (1) (2007) 29–38. [19] H. Zhang, A. Berg, M. Maire, J. Malik, SVM-KNN: Discriminative nearest neighbor classification for visual category recognition, in: CVPR, 2006. [20] R. Ronfard, C. Schmid, B. Triggs, Learning to parse pictures of people, in: ECCV, 2006. [21] B. Leibe, E. Seemann, B. Schiele, Pedestrian detection in crowded scenes, in: CVPR, 2005. [22] M. Pedersol, A. Vedaldi, J. Gonzalez, A Coarse-to-fine approach for fast deformable object detection, in: CVPR, 2011. [23] Y. Yang, D. Ramanan, Articulated pose estimation with flexible mixtures-ofparts, in: CVPR, 2011. [24] C. Lampert, M. Blaschko, T. Hofmann, Beyond sliding windows: object localization by efficient subwindow search, in: CVPR, 2008. [25] Y. Zhang, Y. Wang, G. Pan, Z. Wu, Efficient computation of histograms on densely overlapped polygonal regions, Neurocomputing 118 (2013) 141–149. [26] R. Xu, B. Zhang, Q. Ye, J. Jiao, Cascaded L1-norm Minimization Learning (CLML) classifier for human detection, in: CVPR, 2010. [27] Z. Sun, G. Bebis, R. Miller, Object detection using feature subset selection, Pattern Recognit. 37 (11) (2004) 2165–2176. [28] Y. Wang, X. Tang, J. Liu, G. Pan, R. Xiao, 3D face recognition by local shape difference boosting, in: ECCV, 2008. [29] Y. Chen, C. Lin, Combining SVMs with various feature selection strategies, in: I. Guyon (Ed.), Feature Extraction, Foundations and Applications, Springer, Berlin, 2006. [30] S. Hussain, W. Triggs, et al., Feature sets and dimensionality reduction for visual object detection, in: BMVC, 2010. [31] J. Langford, L. Li, T. Zhang, Sparse online learning via truncated gradient, J. Mach. Learn. Res. 10 (2009) 777–801. [32] G. Yuan, K. Chang, C. Hsieh, C. Lin, A comparison of optimization methods and software for large-scale L1-regularized linear classification, J. Mach. Learn. Res. 11 (2010) 3183–3234. [33] J. Zhu, S. Rosset, T. Hastie, R. Tibshirani, 1-norm support vector machines, in: NIPS, 2003. [34] P.F. Felzenszwalb, D.P. Huttenlocher, Distance transforms of sampled functions, Theory Comput. 8 (2012) 415–428. [35] M. Everingham, L. Van Gool, C. Williams, J. Winn, A. Zisserman, The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 2007. [36] P. Felzenszwalb, R. Girshick, D. McAllester, Discriminatively Trained Deformable Part Models 〈http://www.cs.brown.edu/  pff/latent-release4/〉, 2008. [37] G. Pan, Z. Wu, Y. Pan, Automatic 3D face verification from range data, in: ICASSP, 2003. [38] G. Pan, S. Han, Z. Wu, Y. Wang, 3D face recognition using mapped depth images, in: CVPR Workshops, vol. 3, June 2005, pp. 175–181.

Min Tan received the B.S. degree in School of Mathematical Science and Computing Technology from Central South University, Changsha, China, in 2009. She is currently working toward the Ph.D. degree at the CCNT Biometrics Lab in the college of Computer Science and Technology, Zhejiang University, China. Her research interests include computer vision, pattern recognition and machine learning.

Gang Pan received the B.Sc. and Ph.D. degrees in computer science from Zhejiang University, Hangzhou, China, in 1998 and 2004, respectively. He is currently a Professor with the College of Computer Science and Technology, Zhejiang University. He was with the University of California, Los Angeles as a visiting scholar during 2007–2008. His research interests include pervasive computing, computer vision, and pattern recognition. Dr. Pan co-authored more than 100 refereed papers. He has served as a Program Committee Member for more than ten prestigious international conferences and as a reviewer for various leading journals.

Yueming Wang received the Ph.D. degree from Zhejiang University, China, in 2007. From 2007 to 2010, he was a postdoctoral fellow in the Department of Information Engineering, the Chinese University of Hong Kong. Since 2011, he has been an Associate Professor in the Qiushi Academy for Advanced Studies, Zhejiang University, China. His research interests include 3D face processing and recognition, object detection, brainmachine interface, and statistical pattern recognition.

Yuting Zhang is a Ph.D. candidate in the Department of Computer Science at Zhejiang University, P.R. China, advised by Gang Pan. He was also a Junior Research Assistant in the Advanced Digital Sciences Center (Singapore), University of Illinois at Urbana-Champaign, during 2012. He received his B.E. degree in Computer Science from Zhejiang University, P.R. China in 2009. His research interests include computer vision and pattern recognition.

Zhaohui Wu received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, in 1993. From 1991 to 1993, he was with the German Research Center for Artificial Intelligence (DFKI) as a joint Ph.D. student in the area of knowledge representation and expert system. Currently he is a Professor of computer science with Zhejiang University and the Director of the Institute of Computer System and Architecture. He has authored 5 books and more than 200 refereed papers. His major interests include intelligent systems, semantic grid, and ubiquitous embedded systems. He is on the editorial boards of several journals and has served as PC member for various international conferences.