Training mixture of weighted SVM for object detection using EM algorithm

Training mixture of weighted SVM for object detection using EM algorithm

Neurocomputing 149 (2015) 473–482 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Trainin...

1MB Sizes 0 Downloads 49 Views

Neurocomputing 149 (2015) 473–482

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Training mixture of weighted SVM for object detection using EM algorithm De Cheng, Jinjun Wang n, Xing Wei, Yihong Gong The Institute of Artificial Intelligence and Robotic, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xian Ning West Road No. 28, Shaanxi 710049, PR China

art ic l e i nf o

a b s t r a c t

Article history: Received 22 March 2014 Received in revised form 19 May 2014 Accepted 7 August 2014 Communicated by Yongzhen Huang Available online 3 September 2014

Inspired by the idea of divide-and-conquer approach and discriminatively trained SVM model for object detection, we introduce a method of training mixture of weighted SVM models using EM algorithm. In this paper, we introduce a new part weighted SVM with logistic function to convert its prediction score into pseudo-probability. The part weight is computed by an energy estimation method to reflect the discriminative power of different object parts, and the conversion of prediction score to probability enables the input to be assigned to a proper SVM based on unbiased prediction scores among multiple SVM models. More importantly, the two modifications fit the joint training process of multiple SVMs into the EM framework, where we could iteratively reassign the object examples into different sub-regions of the entire input space, and then retrain the SVM models corresponding to that sub-region. In this way, the mixture of SVM models becomes a set of “experts” to form the mixture of DPMs. Experimental results show that our proposed method made noticeable improvements over the baseline method, which demonstrates the advantage of our proposed method for training MDPM based models for object detection. Crown Copyright & 2014 Published by Elsevier B.V. All rights reserved.

Keyword: Mixture weighted SVMs EM detection

1. Introduction Over the past several years, the Mixture of Deformable Part Model (MDPM) algorithm [9] has emerged as the leading method for object detection. The MDPM consists of two major components: First a single Deformable Part Model (DPM) describing an object using one coarse root filter and several smaller part filters with their relative geometrical layout; and second a discriminative training strategy that uses several components of SVM classifiers to train the multiple DPMs. Improvements are made from various perspectives. Some focused on proposing better configurations of the parts [11] and a few others focused on the training process to obtain better mixture models [8]. Currently the discriminatively trained method uses the idea of “divide-and-conquer” where the complex problem is divided into a set of sub-problems, each of which may be simpler to solve than the original one. This architecture allows a set of “experts” to specialize in different regions of the input space, with their outputs selectively combined. The use of “divide-and-conquer” principle has led to many elegant yet efficient solutions in practice, such that for classification and regression problems [1,7,13,6]. When applying the “divide-and-conquer” idea for object detection, the network of “expert” in the MDPM method exists at two

n

Corresponding author.

http://dx.doi.org/10.1016/j.neucom.2014.08.048 0925-2312/Crown Copyright & 2014 Published by Elsevier B.V. All rights reserved.

different levels of the task. First, in any single DPM, the collaboration between the root filter and the part filters forms the set of “experts”. Each filter focuses on either the global level appearance of the object (the root filter) or a salient object part (the part filter); second, in the mixture of model combination among multiple DPMs form the set of “experts”. Each DPM focuses on different appearance, shape, pose, etc. attributes of the same object class. E.g. in the seminal work of [9], the HOG feature was extracted at the root filter level and at the part filter level under different resolutions, and the features from these regions are concatenated together to train the root filter and the part filters together, as well as the optimal geometrical layout of the parts. But in general, there are a small number of works reported on applying the idea for object detection. Part of the reasons is that, this form of mixture models needs a gate function as the local constrain to partition the input space, while popular discriminative classifier such as SVM works on the entire space. In this paper, we are interested in improving the training process for MDPM. As explained above, for object detection problem, the “divide-and-conquer” principle leads to the MDPM at two different levels of “experts” networks. Any image sent for detection is firstly clustered to the region of a proper “expert”, and then the filter response is calculated to measure the confidence for the existence of the stated object. To optimize the set of “experts”, how to properly partition the entire input space, and how to train powerful discriminative classifiers in each partition, so that they

474

D. Cheng et al. / Neurocomputing 149 (2015) 473–482

could best complement each other, are crucial. As explained later, existing methods, such as the LSVM [9] algorithm, usually applied some heuristics for data partition, which may not give optimal results. To tackle the problem, we propose several methods in this paper to fit the MDPM training problem into the EM framework where both the assignment of data point to an “expert” and the parameter learning of the “expert” are jointly optimized. As shown in our experiment, without incorporating other strategies to improve the MDPM method, by only performing EM training for the mixture of weighted SVMs using our proposed method, we could get significantly improved object detection results. Our contributions in the paper are three folds: First, at the mixture of multiple DPM levels, we propose to use logistic function to convert the SVM scores into pseudo-probabilities, and then combine the scores of multiple subcategory detectors in a Bayesian way, such that the entire input space can be better partitioned by multiple DPMs. Second, at the single DPM level, we propose to re-weight different object parts based on their discriminating power, such that these multiple part filters could distinguish the object from the background more discriminatively. Third, we formulate the above two improvements under a strict EM framework so that the partition of the input space and learning of the classifier in each region are jointly optimized for best performance. The remainder of the papers is organized as follows: in Section 2 we review some closely related works in the line; in Section 3 we start the introduction of our method by firstly revisiting the MDPM algorithm; Section 4 presents our major ideas in the paper with supporting experimental results listed in Section 5. Finally in Section 6 we conclude this work and discuss some future research issues.

2. Related work In the literature of discriminatively trained part-based object detection, improvements have been reported from various aspects. In this section we briefly review some of the representative ones in the following four broad categories:

 Discriminatively trained SVM models. Since Pedro Felzensz-





walb et al. proposed the method for discriminatively trained object detection model [9], many literatures followed this method, such as [29,21] . This discriminatively trained method could overcome different shapes and attitudes of the object, and lead to the state-of-the-art detection performance. Extension to this work includes adding more components of linear SVM to form the mixture models [10,8], or to apply nonlinear cascaded SVMs [25]. Part-based model. Recently one of the successful progresses in object detection is learning the part-based model [2,20,14], although they have made good use of learning method [4], it still leaks in-depth study of what are good parts and good part configurations. Since the part configuration might have important effect on the performance of the part model, some researchers aimed at finding more powerful configuration between each part, such as the grammar model [11] where they configured each part elaborately, and the geometric And– Or quantization method was proposed for scene modeling recently in [26,28], also [22,21] adopt the idea of generative learning in object detection and proposed And–Or Tree models for object detection. While some other researchers also learn to find what are good parts, such as [17], they used shared parts and designed weights for each part in their part-based model, which has made great improvement for object detection. Using strong supervised learning procedure. Since the deformable parts is one of the major factors improving the detection



performance, there are some extensions to this DPM method by exploring strong supervision for parts, instead of treating them as latent variables. Such as some work using annotated parts to optimize the model structure and other occlusion [12], or the keypoint annotations are used by the poselets in [3]. Incorporating additional features to form strong description of an object. Some work extended DPM by using several features together, such as HOG and LBP (local binary pattern) features, and other color attributes or contexture information [24,23,5], some even studied the description of object structure from feature level [27]. These have made great improvement of the detection performance, however, these works inarguably increase the detection computation when extracting more complex features.

In this paper, we propose a mixture of weighted SVM model using EM training algorithm for object detection and obtained better performance than [8].

3. Mixture of Deformable Part Model (MDPM) Currently the state-of-the-art object detection performance is made by the MDPM method [9]. In this section, we briefly introduce the MDPM method and discuss one key limitation for the training process of MDPM method. To begin with, we first revisit the DPM algorithm. 3.1. Deformable Part Model (DPM) An object in the image can be detected by calculating the response of the image feature with the DPM filter. A DPM is defined by one coarse root filter that approximately covers the entire object and several higher resolution part filters that cover multiple smaller parts of the object. Denoting the image region covered by the current detection window as I, the score f(I) from the detection window is given by the scores of each filter at their respective location minus a deformation cost. Specifically, K

K

k¼0

k¼0

>

f ðIÞ ¼ ∑ wk> ϕðI; pk Þ  ∑ dk ϕd ðdxk ; dyk Þ þ b;

ð1Þ

where K is the number of parts (including root), wk is the weight of the kth filter (either root or part), ϕðI; pk Þ is the image features such as HOG [16], and ϕd ðdxk ; dyk Þ is the deformation feature based on ðdxk ; dyk Þ, i.e. the displacement of the kth part relative to its anchor position. In Eq. (1), dk is the trained coefficient of the deformation, and b is the bias term. For simplicity, Eq. (1) can be expressed in terms of a dot product as f ðIÞ ¼ w > ψ ðIÞ;

ð2Þ

where w denotes the concatenation of model parameters, and ψ ðIÞ is the object features based on Eq. (1). Specifically w ¼ ½w0 ; …; wK ; d1 ; …; dK ; b;

ð3Þ

and

ψ ðIÞ ¼ ½ϕðI; p0 Þ; …; ϕðI; pK Þ;  ϕd ðdx1 ; dy1 Þ; …;  ϕd ðdxK ; dyK Þ; 1:

ð4Þ

3.2. Mixture of Deformable Part Model (MDPM) To further improve the detection performance of DPM, the MDPM method was proposed [9] to extend the capacity of single DPM. MDPM utilizes a mixture of multiple DPMs, each focusing on a different appearance, shape, pose, etc. aspect of the same object class. E.g. in Fig. 1 we can see that an aeroplane can be represented by three DPMs of different appearance.

D. Cheng et al. / Neurocomputing 149 (2015) 473–482

475

Fig. 1. An aeroplane represented with three different prototype templates (PASCAL VOC2007), each modeled by a DPM.

Fig. 2. A single linear model cannot separate the date well into two classes, then we partition the similar instances into subcategories, and good models can be learned per subcategory, which then combined together to separate the two classes well. In our multiple SVM models, we combine each subcategory model together to get better performance.

It is noticed that, in the example reassignment of the training process, one requires the SVM scores to be comparable across different SVM models. Though the above-mentioned iteration process learns the weights of multiple sub-categories jointly and there are interactions among the sub-category models, there is no guarantee that different SVM models are comparable in their respective prediction scores, only in [8] they used a max-components regulation to make the margin requirement for different examples more compatible and avoid empty training samples for a component model sometimes. Moreover it is often observed in the experiments that, due to various reasons, such as different ratios of positive/ negative training examples or SVM dimensions, the M SVM prediction scores are often biased as shown in Fig. 3. This has motivated us to propose a method to make each SVM prediction scores comparable and train multiple SVM models jointly. We borrow the idea from the Gaussian Mixture Model (GMM) training process [18] that also solves the data assignment and model retraining problem iteratively under the EM framework. In this paper, we present a new mixture of SVM training algorithm that focuses on two aspects of the problem: The assignment of training examples based on comparable SVM prediction scores among multiple SVM models, and a modified SVM object function that considers the difference among multiple object parts. The two proposed improvements are formulated in an EM framework as presented in the next section.

The score of MDPM is calculated as ^ ¼ max f ðIÞ ¼ max w > ψ ðIÞ; f ðIÞ m m m mAM

mAM

ð5Þ

where M is the total number of components (DPM) used, and the subscript m denotes the mth component, the reason for using multiple SVM models is to train good model for each subcategory just as Fig. 2 shows, then we combined each model for the subcategory together to better separate the two classes. We can see that for any positive example, there exists at least one classifier that can classify it as positive, while for all the negative example all the classifiers can classify it as negative. By using this way we can learn better-performing classifiers. To train the M DPMs, an iterative method was proposed in [9] by first dividing the entire training example set into M subsets according to aspect ratio of the object's bounding box, and then training one linear SVM model from each subset. In the next iteration, they reassign the object example based on the prediction score of the SVM model from previous iteration. An object example is assigned to the mth SVM when fm(I) is the greatest of all the SVM scores, and the predicted bounding box overlaps the ground-truth to a certain threshold. Then the M SVMs are retrained using new subsets of object examples. The iteration terminates in a fixed number of steps.

4. Training mixture of weighted SVM Applying the EM algorithm for training GMM is relatively easy as the prediction by each Gaussian component in GMM is directly a generative probability score. Applying the same strategy for training mixture of SVM has two major challenges: First, SVM is discriminative rather than generative, and second, the prediction score is not a probability. This section discusses how the EM framework can be fit into the mixture of SVM training process by addressing the above two challenges. 4.1. Mixture of weighted SVM model For an image I, the mixture of SVM model calculates M

Ψ ðIÞ ¼ ∑ π m f 0m ðIÞ;

ð6Þ

m¼1

where πm is the importance of the corresponding SVM component. We have M

∑ π m ¼ 1:

m¼1

ð7Þ

476

D. Cheng et al. / Neurocomputing 149 (2015) 473–482

6

6

svm−score

svm−score

mean−score

mean−score 4

Score

Score

4

2

2

0

0 0

20

40

0

60

20

40

60

Example

Example

Fig. 3. Showing the bias between each SVM scores. The 2 components SVM scored on the same positive training data of the HOG features of the aeroplane. While (a) is SVM1 whose mean is 1.8 and (b) is SVM2 whose mean is 2.7, they are bias, then often more positive examples are assigned to the SVM2 because of the SVM2 scores are often higher.

1

1 SVM2−score into probability 0.9

0.8

0.8

probability

probability

SVM1−score into probability 0.9

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0

20

40

60

80

0.4

0

20

40

Example

60

80

Example

Fig. 4. Showing normalized SVM scores by logistic regression method. When converted the 2 components SVM scores in Fig. 3 into probabilities on the same positive examples, we can see that they are comparable. 0

The term f m ðIÞ in Eq. (6) is different from fm(I) in Eq. (5) because, as explained before, the latter has bias between different fm in Fig. 3. To make fm(I) comparable among multiple SVMs, we need to normalize them into the same range. To further use it for data assignment in the EM algorithm, we require it to be a probability.

4.2. Transforming SVM score into probability There are several previous works that have proposed solutions to convert the SVM prediction score into a pseudo-probability. In this paper we adopt the idea of [15,19] by utilizing a logistic function: 0

f m ðIÞ ¼

1 >

1 þeβm f m ðIÞ þ γ m

:

ð8Þ

The parameter βm and γm can be obtained through a logistic regression process by minimizing the following negative

log-likelihood function: N

βnm ; γ nm ¼ arg min ∑ ðσ i yi ðβ > f m ðIi Þ þ γ Þ β;γ

i¼1

log ð1 þ eβ

>

f m ðI i Þ þ γ

ÞÞ;

ð9Þ

where Ii denotes the ith training example, and yi is the corresponding label, f m ðI i Þ denotes the SVM prediction score based on the mth component, and σi is the weight for the example. By using the logistic function to normalize the prediction of SVM into a probability score, the bias of SVM prediction shown in Fig. 3 can now be normalized into the same range of [0,1], such that different SVM predictions are comparable, as shown in Fig. 4. In this way, the assignment of training example is now purely based on how well it can be recognized by a specific SVM model. This allows the entire feature space to be partitioned by multiple SVM models, where each data point will be assigned to a SVM model that can best distinguish it from the opposite classes. As can be seen from the experiment, even without the EM framework to iteratively improve the trained models,

D. Cheng et al. / Neurocomputing 149 (2015) 473–482

by utilizing the proposed score normalizing strategy, we can achieve noticeably improved objection detection performance. Before presenting our EM training framework, next we discuss a modified weighted part SVM objective function to consider the different discriminative power of different parts in a DPM. 4.3. Training weighted part model In traditional DPM [9], all part filters are treated equally such that even in an image where a certain part is missing due to occlusion or partial object, the corresponding part filter will always find a location that gives certain response. As observed in our experiment, such situation happens very often for some object parts while others are not. In another word, some parts are more discriminative than the others. Hence if the SVM objective function can consider such difference, the obtained DPM can become more discriminative. In this section we present a modified SVM objective function for the purpose. To begin with, we first rewrite Eqs. (3) and (4) as w ¼ ½½w0 ; b; ½w1 ; dK ; …; ½wK ; dK ; ¼ ½w0 ; w1 ; …; wK ;

ð10Þ

and

ψ ðIÞ ¼ ½½ϕðI; p0 Þ; 1; ½ϕðI; p1 Þ;  ϕd ðdx1 ; dy1 Þ; …; ¼ ½ψ 0 ðIÞ; ψ 1 ðIÞ; …; ψ K ðIÞ:

w

ð11Þ

N 1 K ∑ ðα ‖wk ‖2 Þ þ c ∑ ð0; 1  yi w > ψ ðI i ÞÞ; 2k¼0 k i¼1 K

s:t:

∑ αk ¼ K:

parts. At the same time, for the “car” class where the significantly salient parts are wheels of the car, the learned part filters are more important than other parts. In order to calculate suitable part weight αk, we evaluate different importance measures such as energy estimation, ROC based measure, and T-test. We experimentally found that the energy estimation based criteria give similar performance as the ROC based measure, while T-test gives very poor results. Hence in this paper we adopt the energy estimation based measure. To ~ to calculate αk, firstly, we use the un-weighted SVM parameters w derive the energy ρk which is the L2 norm of the magnitude

ρk ¼

‖wk ‖2 : ∑Kk ¼ 1 ‖wk ‖2

ð13Þ

Then we normalize it by

αk ¼

Keλρk ; ∑Kk ¼ 1 eλρk

ð14Þ

where λ is a negative parameter. In the experiment we set λ ¼  10. For the root filter, we default its importance α0 ¼ 1. 4.4. EM training of mixture weighted SVMs

We defined a modified SVM objective function to be argmin

477

ð12Þ

k¼1

Note that in Eq. (12), the constraint is only imposed on the part filters (i.e. k A ½1; K). In Eq. (12), we divide the SVM parameter into Kþ1 parts, and the weight of each part αk is computed according to the importance of part k. αk serves as a balance between how much we should rely on the Hinge loss or on the max-margin penalty. When αk r 1, the corresponding part is a more important part, as its corresponding parameter wk would increase in order to minimize the objective function. Based on this merit, we can highlight the salient parts by setting a lower αk and improve the detection performance. In Fig. 5, we show comparison pictures with/without the above part weighting strategy. As can be seen, for the “aeroplane” class where there are salient object parts at the aerofoil and the head of the aeroplane, the obtained part filters could correctly emphasize the corresponding

Based on the logistic function to convert the SVM prediction score into a pseudo-probability, and the energy estimation based part weighting scheme, in this section we integrate everything together under the EM framework. Initialization: Before the EM iterations, we divide the positive training examples into M subsets according to the aspect ratios of the bounding boxes. Our experimental results show that such data partition criteria are not very sensitive while the value of M is more critical. All the M subsets use the entire background examples as the negative training samples. Each subset training sample is used to train the initial SVM. Our experiments are based on [8] and the initialization is just the same as [8], we set the number of parts K ¼8, and the number of components M ¼3. Then for each positive example Ii: E-step: Evaluate the responsibility pm ðI i Þ on the M classifiers. > Firstly, compute the SVM scores f m ðI i Þ ¼ wm ψ m ðI i Þ; m ¼ 1; …; M, then transform each SVM score f m ðI i Þ into a probability using 0 Eq. (8) to obtain f m ðI i Þ, finally we compute each classifier response pm ðI i Þ by pm ðI i Þ ¼

0 m  f m ðI i Þ 0 M ∑m ¼ 1 m  f m ðI i Þ

π

π

ð15Þ

Fig. 5. Illustration of the property of the part-weighted SVM model. The top pictures of (a) and (b) are templates trained without part weighted, and the bottom are part weighted templates, we can see that for an aeroplane in ðaÞ part weighted SVM classifier highlights the aerofoil and the head of the plane, while in ðbÞ it highlights the wheels of the car which are all the salient part in the object.

478

D. Cheng et al. / Neurocomputing 149 (2015) 473–482

M-step: Assign example Ii to the expert with greatest pm ðI i Þ. Then update πm by

πm ¼

N Pm NP

ð16Þ

P where NPm ¼ ∑N i ¼ 1 pm ðI i Þ, and NP is the total number of positive examples. When getting an update of each subset training examples, then train the SVM model parameters w0 ¼ ðw1 ; …; wM Þ and β0 ¼ ððβ1 ; γ 1 Þ; …; ðβM ; γ M ÞÞ for the corresponding subsets using Eqs. (12) and (9). The EM training process stops if the negative log-likelihood stops decreasing. To show the effectiveness of the above EM training procedure, in Fig. 6(a)–(c) we show the prediction of the positive training examples by the two DPMs with respect to several iteration steps. The prediction probabilities by the first DPM is plotted in red, and that of the second DPM is in green. In Fig. 6, the probabilities by the first DPM are sorted such that they appear in a red curve. As can be seen from Fig. 6(a), in the first iteration, the trends of the two DPMs are very similar because the green dots distribute near the red curve. After 25 iterations, in Fig. 6(c) we see that the prediction between the two DPMs becomes very complementary to each other. Those that obtained very low score by the first DPM can be reliably recognized by the second DPM (e.g. the first 100 examples), while example 100–300 can be reliably recognized by the second DPM. In Fig. 8 we see a monotonic increase of the log-likelihood during EM iterations. Detailed implementation of the EM process is shown in Algorithm 1.

4: for Iter ¼ 1-Maxiteration do 5: for m ¼ 1-M do 6: N m ’ hard negative examples in S n [9] using the current trained model 7: end for 8: EStep 9: for i ¼ 1-N P do 10: Calculate pmi using Eq. (15), m ¼ 1; ‥; M. 11: end for 12: MStep 13: for i ¼ 1-N P do 14: Assign the positive example Ii to P m using eq. (16). 15: end for 16: for m ¼ 1-M do Compute the weight αk based on (13) and (14). Train weighted part SVM parameter w0 ¼ ðw1 ; …; wM Þ based on fN m ; P m g using Eq. (12) 19: Train logistic regression parameter fβm ; γ m g based on fN m ; P m g using Eq. (9) 17: 18:

20: Update 21: end for 22: end for

πm using Eq. (16).

To summarize, by fitting the mixture of weighted SVM training process into the EM framework, we could obtain multiple DPMs that complement each other for object detection. We have also proposed a weighted SVM training objective functions to consider the different discriminative power among object parts. As demonstrated in the next section, without incorporating other recent improvement on top of the traditional MDPM algorithm, by simply improving the training process using our proposed method, we can get superior object detection performance.

Algorithm 1. The training process for object detection. 1: Input Positive examples S p ¼ fI 1 ; …; I NP g Negative images S n ¼ fJ 1 ; …; J NN g Initial model w0 ¼ ðw1 ; …; wM Þ

β0 ¼ ððβ1 ; γ 1 Þ; …; ðβM ; γ M ÞÞ 2: Output New model w0new ¼ ðw1 ; …; wM Þ

β0new ¼ ððβ1 ; γ 1 Þ; …; ðβM ; γ M ÞÞ

5. Experiments We evaluated our method on the detection benchmark of 20 categories in PASCAL VOC2007 datasets, and followed the experimental protocols and evaluation criteria used in the same detection

1

1

0.8

0.8

0.8

0.6

0.4

0.2

PSVM1−iter1

Classifier output

1

Classifier output

Classifier output

3: Initial hard negative set N m ’∅; 8 m, m ¼ 1,…M, positive set 1 P m ’∅; 8 m, π m ¼ M

0.6

0.4

0.2

PSVM1−iter4

PSVM2−iter1 0

100

200

300

Oder of the positive example

0.6

0.4

0.2

PSVM1−iter25

PSVM2−iter4 0

100

200

300

Oder of the positive example

PSVM2−iter25 0

100

200

300

Oder of the positive example

Fig. 6. Illustration of the complementary property between each classifier on the training data. We did experiments of 2 SVM classifiers on the aeroplane feature data, ðaÞ; ðbÞ and ðcÞ are the extracted iterations during the EM training procedure, while the horizontal axis represents the positive example which is ordered according to their scores on classifier 1, the vertical axis represents the classifier outputs, and the red points represent scores on SVM1, green points are scores on SVM2. We can see that the two classifiers tend to cluster to different sub-classes, and they complement each other in the classification result. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Table 1 Performance comparison using average precision (AP) for the 20 object categories in PASCAL VOC2007 dataset. The below 5 models use the HOG feature only, and our proposed model is implemented based on the fourth version of the publicly available DPM software [10], and the performance is obtained with post-processing, such as bounding box prediction which DPM, voc-r4, voc-r5 and our method also used. Object class

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

tble

dog

horse

mbike

pers

plant

sheep

sofa

train

tv

Avg.

DPM [9] voc-r5 [10] 3-layer [29] voc-r4 [8] DPM þWpart EM þDPM EM þWpart þDPM

28.7 33.2 29.4 28.9 32.9 33.5 33.3

55.1 60.3 55.8 59.5 59.6 60.3 60.3

0.6 10.2 9.4 10 10.4 10.4 10.5

14.5 16.1 14.3 15.2 17.7 15.9 17.4

26.5 27.3 28.6 25.5 25.3 27.9 29.1

39.7 54.3 44 49.6 51.7 53.3 53.4

50.2 58.2 54.2 57.9 58.2 58.3 58.6

16.3 23.0 21.3 19.3 23.9 25.4 27.1

16.5 20.0 20 22.4 22.4 22.6 22.9

16.6 24.1 19.3 25.2 24.8 25.3 26.5

24.5 26.7 25.2 23.3 24.7 28.4 27.9

5.0 12.7 12.5 11.1 11.5 11.8 11.5

45.2 58.1 50.4 56.8 58.7 59.0 60.5

38.3 48.2 38.4 48.7 49.1 49.8 49.4

36.2 43.2 36.6 41.9 42.1 42.5 42.6

9.0 12.0 15.1 12.2 13.4 14.5 15.1

17.4 21.1 19.7 17.8 22.2 18.6 19.5

22.8 36.1 25.1 33.6 35.2 34.0 35.7

34.1 46.0 36.8 45.1 46.0 45.8 46.2

38.4 43.5 39.3 41.6 42.4 42.1 42.9

26.8 33.7 29.6 32.3 33.6 34.0 34.5

Object class

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

tble

dog

horse

mbike

pers

plant

sheep

sofa

train

tv

Avg.

voc-r4 [8] Ours

43.9 44.7

49.0 49.5

9.2 9.7

12.3 13.1

27.2 27.6

49.2 49.3

43.2 43.5

24.0 25.1

15.1 15.7

21.6 21.9

11.8 12.8

10.8 17.1

41.6 42.1

43.5 43.6

41.5 41.9

9.1 9.7

27.4 27.7

17.5 17.9

41.1 41.6

33.2 33.1

28.6 29.4

D. Cheng et al. / Neurocomputing 149 (2015) 473–482

Table 2 Performance comparison using average precision (AP) for the 20 object categories in PASCAL VOC2010 dataset. The below 2 models use the HOG feature only, and our proposed model is implemented based on the fourth version of the publicly available DPM software [10], and the performance is obtained with post-processing, such as bounding box prediction which voc-r4 and our method also used.

479

480

D. Cheng et al. / Neurocomputing 149 (2015) 473–482

Fig. 7. Illustration of discriminating capacity of the part weighted SVM. Considering each part and root filter as independent classifier, we plot the ROC on the same aeroplane test data, (c) is the result of training with our proposed part-weighted SVM, while (b) is not, and in (a) we magnified the final DPM classifiers in the local region. We can see that the discriminating power is different of each part and the total discriminative capacity of DPM gets better when using the part-weighted SVM. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

contest. A detection is considered to be correct if the intersect of its bounding box is greater than 50% between the ground truth and the detection result. Finally, we computed the Precision–Recall (PR) curves and scored the average precision (AP) in the datasets, then compared our results with the original MDPM algorithm [9], as well as several other works including [8–10,29]. The results are listed in Table 1. Since our implementation is based on the public code from [8], we regard it as a comparison baseline. As can be seen from Table 1, by considering the different discriminative power of object parts in our proposed SVM objection function (DPMþ Wpart in Table 1), we can improve the mAP score by 1.3%. By incorporating the proposed EM training framework for training MDPM (EMþDPM in Table 1), our detection mAP score is 1.7% higher than the baseline. If we further combine the weighted part model together, another 0.5% mAP improvement was achieved, resulting in 34.5% detection mAP score (EMþWpartþDPM in Table 1). According to the results, our performance is also superior than the other benchmark works [10] (Table 2). As explained before, our contribution is mainly regarding the proposed EM framework to jointly train multiple SVM models for the MDPM algorithm, rather than considering the grammar topology in [11], or utilizing hierarchical deformable part model and structural SVM learning [29,4]. In order to get more insights of our method, in the next paragraphs we discuss various aspects of our framework. Part-weighted SVM: As explained in Section 4.3, by analyzing the discriminative power of different object parts, we introduce a modified SVM objective function which allows more important parts to be more responsive during detection. As shown in Fig. 5, the adopted strategy led to more discriminative DPM, and improved the detection performance by 1.3% mAP. To measure the importance of a part, we evaluated the energy estimation based criteria which are proven effective. To further analyze whether a more important part is made more discriminative, in Fig. 7 we plot the individual classification ROC curve of different

filter(s). As can be seen from Fig. 7(b), without Part-weighted SVM, the ROC curves for different object parts are very similar, while in Fig. 7(c), the difference between the curves is made larger. For instance, the dotted blue line (wpart2) in Fig. 7(b) corresponds to an object part identified as important in our experiments. In Fig. 7 (b) the curve is mixed with others, while in Fig. 7(c), it is made significantly more discriminative. On the contrary, the solid yellow line (wpart5) in Fig. 7(b) corresponded to an object part with inconsistent appearance. As shown in Fig. 7(c), it is made less discriminative such that the response from the part will not confuse the overall detection. In this way, the overall MDPM by part-weighted SVM is more discriminative in Fig. 7(a), and thus leading to improved detection performance. Training mixture model using EM algorithm: The EM algorithm is based on strict Bayesian theory where in each iteration of either the E-step or the M-step, the overall log-likelihood should monotonically increase. As explained before, in this paper our focus is to fit the training process of multiple SVM models into the EM framework. To see whether applying a logistic function to convert the SVM score into a pseudo-probability can satisfy our purpose, we calculate the overall log-likelihood score from the training samples in Fig. 8. Note that in Algorithm 1, since in each iteration different hard-negative examples were re-sampled, we have to collect a fixed set of positive and negative examples to calculate the log-likelihood scores during iteration. As can be seen from Fig. 8, although training for both the aeroplane and the car classes converged after 17 iterations, in Fig. 8(b), the log-likelihood score does not always increase, which means that the proposed training process is not a strict EM process. This is due to two reasons: First, there is no locality constraint for each SVM, such that the actual assignment of each training example is not exclusive for any single SVM. Second, the logistic function only produces a pseudoprobability. Hence the actual improvement is due to the

D. Cheng et al. / Neurocomputing 149 (2015) 473–482

5.25

481

6.31

log−likelihood

5.23 5.22 5.21 5.2 5.19

log−likelihood

6.3

Log−likelihood

Log−likelihood

5.24

6.29 6.28 6.27 6.26

0

5

10

15

20

25

30

Iteration

6.25

0

5

10

15

20

25

30

Iteration

Fig. 8. Illustration of the EM training framework by computing the log-likelihood in each iteration. We did experiments on the aeroplane and car feature data, while (a) is aeroplane and (b) is the car. The horizontal axis represents the EM iteration number, the vertical axis represents the log-likelihood, we can see that the log-likelihood is increasing.

normalization effect by the logistic function, such that different SVM models are comparable in their prediction scores. In this way, the proposed training process can help obtaining better partition of the feature space, so that the “experts” located in different regions can jointly describe the input data more discriminatively.

6. Conclusion This paper presents a method of training mixture of deformable part-weighted SVM model using EM algorithm for object detection. We present major contributions: a part-weighted SVM model to highlight locally salient features which greatly help improving the detection performance; and a EM framework to train the mixture model. Finally, experimental results validate the superior of our method. In our future work, we are going to extend our mixture of SVM model to other classification and regression situations, also we will investigate better strategies for selecting object parts.

Acknowledgements This work is supported by the National Basic Research Program of China (973 program) under Grant no. 2012CB316400, and the National Science Foundation of China (NSFC) under Grant no. 61332018. References [1] O. Aghazadeh, H. Azizpour, J. Sullivan, S. Carlsson, Mixture component identification and learning for visual recognition, in: Proceedings of ECCV, 2012, pp. 115–128. [2] X. Bai, X. Wang, L.J. Latecki, W. Liu, Z. Tu, Active skeleton for non-rigid object detection, in: ICCV, 2009, pp. 575–582. [3] T. Brox, L. Bourdev, S. Maji, et al., Detecting people using mutually consistent poselet activations, in: ECCV, 2010, pp. 168–181. [4] T.J. Chun-Nam John Yu, Learning structural svms with latent variables, in: ICML, 2009, pp. 1062–1069. [5] S.S. Cinbis, R.G. Contextual object detection using set-based classification, in: ECCV, 2012, pp. 43–57. [6] Y. Bengio, R. Collobert, S. Bengio, A parallel mixture of svms for very large scale problems, in: Proceedings of the Neural Computation, 2002, pp. 1105–1114. [7] D. DeSieno, Adding a conscience to competitive learning, in: IEEE International Conference on Neural Networks, 1988, pp. 117–124. [8] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, Discriminatively trained deformable part models, release 4 〈http://people.cs.uchicago.edu/  pff/laten t-release4/〉. [9] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach. Intell. 32 (9) (2010) 1627–1645. [10] R.B. Girshick, P.F. Felzenszwalb, D. McAllester, Discriminatively trained deformable part models, release 5 〈http://people.cs.uchicago.edu/ rbg/latent-release5/〉.

[11] R.B. Girshick, P.F. Felzenszwalb, D.A. Mcallester, Object detection with grammar models, in: Advances in Neural Information Processing Systems, 2011, pp. 442–450. [12] I.L.H. Azizpour, Object detection using strongly supervised deformable part models, in: ECCV, 2012, pp. 836–849. [13] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts, Neural Comput. 3 (1) (1991) 79–87. [14] A.K. Jain, Y. Zhong, S. Lakshmanan, Object matching using deformable templates, IEEE Trans. Pattern Anal. Mach. Intell. 18 (3) (1996) 267–278. [15] H.-T. Lin, C.-J. Lin, R.C. Weng, A note on Platt's probabilistic outputs for support vector machines, Mach. Learn. 68 (3) (2007) 267–276. [16] B.T. Navneet Dalal, Histograms of oriented gradients for human detection, in: CVPR, 2005, pp. 886–893. [17] M.E. Patrick Ott, Shared parts for deformable part-based models, in: CVPR, 2011, pp. 1513–1520. [18] D.A. Reynolds, Gaussian mixture models, Encyclopedia of Biometrics (2009) 659–663. [19] M.H. Santosh, K. Divvala, Alexei A. Efros, How important are “deformable parts” in the deformable parts model? in: Computer Vision-ECCV Workshops and Demonstrations, 2012, pp. 31–40. [20] P. Schnitzspan, M. Fritz, S. Roth, B. Schiele, Discriminative structure learning of hierarchical representations for object detection, in: CVPR, 2009, pp. 2238– 2245. [21] X. Song, T. Wu, Y. Jia, S.C. Zhu, Discriminatively trained and-or tree models for object detection, in: Proceedings of the Conference on Computer Vision and Pattern Recognition, 2013, pp. 3278–3285. [22] X. Song, T. Wu, Y. Xie, Y. Jia, Learning global and reconfigurable part-based models for object detection, in: 2012 IEEE International Conference on Multimedia and Expo (ICME), 2012, pp. 13–18. [23] Z. Huang, Z. Song, Q. Chen, et al., Contextualizing object detection and classification, in: CVPR, 2011, pp. 1585–1592. [24] A. Torralba, K.P. Murphy, W.T. Freeman, Contextual models for object detection using boosted random fields, in: Advances in Neural Information Processing Systems, 2004, pp. 1401–1408. [25] M. Varma, A. Vedaldi, V. Gulshan, Multiple kernels for object detection, in: Proceedings of ICCV, 2009, pp. 603–613. [26] S. Wang, Y. Wang, S.C. Zhu, Hierarchical space titling for scene modeling, in: Proceedings of ACCV, 2013, pp. 796–810. [27] Y. Yu, J. Zhang, K. Huang, et al., Boosted local structured hog-lbp for object localization, in: CVPR, 2011, pp. 1393–1400. [28] J. Zhu, T. Wu, S.-C. Zhu, X. Yang, W. Zhang, Learning reconfigurable scene representation by tangram model, in: IEEE Workshop on Applications of Computer Vision (WACV), 2012, pp. 449–456. [29] L. Zhu, Y. Chen, A. Yuille, W. Freeman, Latent hierarchical structural learning for object detection, in: CVPR, 2010, pp. 1062–1069.

De Cheng received the B.S. degree in automation control from Xi'an Jiaotong University, Xi'an, China, in 2011, and is currently a Ph.D. candidate in the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University. His research interests include pattern recognition and machine learning, specifically in the areas of object detection and image classification.

482

D. Cheng et al. / Neurocomputing 149 (2015) 473–482 Jinjun Wang (M'10) received the B.E. and M.E. degrees from Huazhong University of Science and Technology, China, in 2000 and 2003 respectively, and received the Ph.D. degree from Nanyang Technological University, Singapore, in 2006. Currently he is with Xi'an Jiaotong University (XJTU), China, as a professor. Prior to join XJTU, he worked in NEC Laboratories America and Epson Research USA from 2006 to 2013 as research scientist and senior research scientist. His research interests include computer vision, pattern classification, image/ video enhancement and editing, and content-based multimedia retrieval.

Xing Wei received the B.S. degree in automation control from Xi'an Jiaotong University, Xi'an, China, in 2013, and is currently a Ph.D. candidate in the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University. His research interests include machine learning and pattern recognition, object detection and 3D reconstruction.

Yihong Gong (SM'10) received the B.S., M.S., and Ph.D. degrees in electronic engineering from the University of Tokyo, Tokyo, Japan, in 1987, 1989, and 1992, respectively. In 1992, he joined Nanyang Technological University, Singapore, as an assistant professor with the School of Electrical and Electronic Engineering. From 1996 to 1998, he was a Project Scientist with the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA. From 1999 to 2009 he was the Head of the Department of Information Analysis and Management, NEC Laboratories America, Inc., Cupertino, CA, USA. Currently he is with Xi'an Jiaotong University (XJTU), China, as a professor. His research interests include image and video analysis, multimedia database systems, and machine learning.