Masked face detection via a modified LeNet

Masked face detection via a modified LeNet

Author’s Accepted Manuscript Masked Face Detection via A Modified LeNet Shaohui Lin, Ling Cai, Rongrong Ji www.elsevier.com/locate/neucom PII: DOI: ...

720KB Sizes 6 Downloads 181 Views

Author’s Accepted Manuscript Masked Face Detection via A Modified LeNet Shaohui Lin, Ling Cai, Rongrong Ji

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)30952-3 http://dx.doi.org/10.1016/j.neucom.2016.08.056 NEUCOM17490

To appear in: Neurocomputing Received date: 4 October 2015 Revised date: 24 May 2016 Accepted date: 11 August 2016 Cite this article as: Shaohui Lin, Ling Cai and Rongrong Ji, Masked Face Detection via A Modified LeNet, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.08.056 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Masked Face Detection via A Modified LeNet Shaohui Lina,b , Ling Caia,b , Rongrong Jia,b,∗ a Fujian

Key Laboratory of Sensing and Computing for Smart City, Fujian, P. R. China of Information Science and Engineering, Xiamen University, P. R. China

b School

Abstract Detecting masked faces in the wild has been emerging recently, which has rich applications ranging from violence video retrieval to video surveillance. Its accurate detection retains as an open problem, mainly due to the difficulties of low-resolution and arbitrary viewing angles, as well as the limitation of collecting sufficient amount of training samples. Such difficulties have been significantly challenged the design of effective handcraft features as well as robust detectors. In this paper, we tackle this problems by proposing a learn-based feature design and classifier training paradigm. More particularly, a modified LeNet, termed MLeNet, is presented, which modifies the number of units in output layer of LeNet to suit for a specific classification. Meanwhile, MLeNet further increases the number of feature maps with smaller filter size. To further reduce overfitting and improve the performance with a small quantity of training samples, we firstly increase the training dataset by horizontal reflection and then learn MLeNet via combining both pre-training and fine-tuning. We evaluate the proposed model on a real-world masked face detection dataset. Quantitative evaluations over several state-of-the-arts and alternative solution have demonstrated the accuracy and robustness of the proposed model. Keywords: Masked Face Detection, Sliding Window, Convolutional Neural Networks, Non-maximum Suppression, LeNet, Pre-training.

∗ Corresponding

author Email addresses: [email protected] (Shaohui Lin), [email protected] (Ling Cai), [email protected] (Rongrong Ji)

Preprint submitted to Elsevier

August 31, 2016

1. Introduction Detecting video clips related to potential terrorists retains as a fundamental demand in the management of massive-scale video corpus, which is highly beneficial to the applications of public security. A variety of definitions exist to 5

define a person as the terrorist in a given video clip, among which one obvious definition comes from the masked faces. As a specific task of face detection, the detection of masked faces poses significant difficulties, which differs from the traditional face detection (potentially with partial occlusions) that has been studied intensively decades long. On one hand, it encompasses challenges such as

10

variational poses, and lighting that have historically hampered traditional face detection paradigms. On the other hand, its severe occlusion has significantly challenged the existing face detection algorithms, since the severely missing of face structure. Tracking back to the literature, previous works in face detection mainly rely

15

on handcraft feature designs, such as the well-known Fisher Face [1], Harr-like features with cascade detector [2], and Gabor-like high dimensional features with AdaBoost detector [3]. One essential limitation lies in the need of sufficient amount of training samples to achieve a satisfactory detection accuracy. Recently, exemplar-based face detection [4] has been shown to be effective, be-

20

cause a large exemplar database was leveraged to cover all possible visual variations. However, it requires a large face database for detection and tends to produce false alarms in the presence of highly cluttered backgrounds. In order to reduce the number of required exemplars, the efficient boosted exemplar-based face detector [5] was proposed to further improve the detection accuracy and

25

make the detector faster and more memory efficient by discriminatively training and selectively assembling exemplars as weak detectors in the boosting framework. However, these methods would be failed if they rely on a small quantity of face training dataset. Recently, deep learning architectures have been studied as well, which make use of CNN with GPU based computing architecture

30

to bring breakthrough in benchmark evaluations, such as Labeled Faces in the

2

Wild(LFW) [6][7][8], Face Detection Data Set and Benchmark(FDDB) [9][10]. In particular, the convolutional network can automatically learn effective feature representation of object from training data [11][12]. Most notably, the Alexnet [13] shows ground breaking performance on the ImageNet 2012 clas35

sification challenge. After that, The CNNs have had a lead performance in ImageNet classification and object detection benchmark, such as GoogleNet [14] with about 6.8M parameters, ResNet-18 network [15] with about 11.6M parameters, VGG-19 [16] with about 144M parameters. However, the models with large number of parameters would be overfitting when these models are trained

40

on a small quantity of training dataset, especially our real-world masked face detection dataset with about 1000 training samples. To further tackle the challenge induced by limited training data, Hinton and Salakhutdinov [17] introduce pre-training to generate a good initialization for large deep neural networks. In contrast, The LeNet introduced in [18] shows good performance in recogniz-

45

ing hand-written digit characters with relatively few parameters. However, the need of large amount of training data still hesitates its direct application in our scenario of masked face detection. In view of this issue, in this paper, we introduce a modified LeNet, termed MLeNet, which modifies the number of units in output layer of LeNet to suit for a specific classification with a small quantity

50

of training samples. Meanwhile, MLeNet further increases the number of feature maps with smaller filter size as shown in Table 1 which further improves the performance of classification with comparable network overhead to LeNet. Combining MLeNet with sliding window, the detection of masked faces is done in a multi-scale fashion.

55

The parameters of MLeNet can be learned via stochastic gradient descent, and we introduce some tricks which are combining both pre-training and finetuning to prevent the MLeNet from overfitting. Notably, pre-training is done by directly borrowing the models from LeNet. Fine-tuning is done by adapting the network structure to a very limited number of training instances. In additional,

60

we increase the number of data set to two times via horizontal reflection. Well known schemes like sliding window and non-maximal suppression [19] are also 3

Figure 1: The model of our proposed modified convolutional neural networks (MLeNet). MLeNet has two units in the output layer compared with ten units, and further increases the number of feature maps with smaller filter’ s size in all convolutional layers.

integrated into the proposed MLeNet based detector. Quantitatively, the experimental comparisons to a set of state-of-the-art (e.g. LeNet[18], RFD[4]) and classic (e.g. Harr-like features with AdaBoost [2]) detectors have demonstrated 65

that the proposed model can achieve superior performance on detecting masked faces. The rest of this paper is organized as follows: Section 2 describes the proposed detection model (MLeNet) for masked man; Section 3 presents the detailed quantitative evaluation with comparisons to a set of state-of-the-arts; We

70

conclude this paper in Section 4 and discuss our future work.

2. The Proposed Method In this section, we introduce the MLeNet for the face detection of possible terrorist. Firstly, we introduce the structure and learning weights of MLeNet which is different with LeNet. Secondly, we combine pre-training and fine-tuning 75

with data augmentation to further improve the performance of MLeNet with a very limited number of training samples. Finally, detecting masked faces is done via combining sliding window and non-maximum suppression. 2.1. MLeNet LeNet [18] comprises 5 layers, each of which contains both convolution and sub-sampling operation. Considering the binary classification of masked faces, we modify the filter size at the convolution layers, the number of feature maps

4

and the number of nodes of the fully-connected layer in LeNet. Compared to LeNet, the filter size of MLeNet in every convolutional layer is reduced to 3 × 3, while the number of feature maps increase layer by layer: In particular, the number of nodes in the first fully-connected layer (FC4) increases from 84 to 500, and the number of nodes in the output layer are reduced from 10 to 2 which is suitable to our binary classification task. The details of each layer of MLeNet and LeNet are shown in Table 1 while the resulted MLeNet model is depicted in Figure 1. We train the MLeNet using a set of N labeled images {Xi , yi }N i=1 , where label yi is a discrete variable indicating the class category (in our case either 0 or 1). The softmax loss function is used to measure the loss between the true label yi and the predicted label ybi based on a probability distribution over the 2 classes. It is defined as: J(W, b) , −

N 2 1 XX `{yi = j} log ybi , N i=1 j=1

(1)

where `(x = i) is the indicator function defined as  1, `(x = i) = 0,

x=i otherwise.

W and b are the set of filters and biases of all layers, respectively. The pa80

rameters of network (i.e., the filters in the convolutional operation, the weight matrices in the fully-connected layers and all biases) are learned by Stochastic Gradient Descent (SGD) with back-propagating (BP) the derivative of the loss with respect to the parameters throughout the whole network. At the subsampling layer S1, the output z(1) is calculated as follows: z(1) = f (W(1) ∗ x(1) + b(1) ),

(2)

where x(1) is the original image, W(1) and b(1) are the bank of multi-dimensional filters and biases, respectively. f and “ ∗ ” are the max pooling operator and the convolutional operator, respectively. At the subsampling layer S2, the input 5

Table 1: MLeNet and LeNet Model. Each model contains five convolutional layers and two fully-connected layers. The details of each model are given in two sub-rows: the first specifies the number of feature maps and their filters size as “num × size × size”; the second indicates the convolution stride (“st.”) and spatial padding (“pad”); the third indicates the downsampling factor max-pooling. FC4-5 are fully-connected layers. We specify the number of nodes of FC4 in the LeNet and MLeNet as 84 and 500, respectively, while the FC5 acts as a softmax classifier.

Model LeNet

MLeNet

conv1 6×5×5 st. 1, pad 0 ×2 pool 20 × 3 × 3 st. 1, pad 0 ×2 pool

conv2 16 × 5 × 5 st. 1, pad 0 ×2 pool 50 × 3 × 3 st. 1, pad 0 ×2 pool

conv3

FC4

120 × 5 × 5 st. 1, pad 0

84 nodes

500 × 3 × 3 st. 1, pad 0

500 nodes

FC5 10 softmax 2 softmax

is the output map of the previous subsampling layer. Analogically, the output z(2) of the subsampling layer S2 is computed as z(2) = f (W(2) ∗ z(1) + b(2) ).

(3)

In the recursive way, we can compute all of layer output as z(3) = ReLU (W(3) ∗ z(2) + b(3) )

(4)

= max(W(3) ∗ z(2) + b(3) , 0),

z(4) = W(4) ∗ z(3) + b(4) ,

(5)

(4)

bi = y

(4) sof tmax(zi )

ezi

=P 2

j=1

(4)

ezj

.

(6)

We drop the subscribe i for simplicity, when describing Eq.(2) to Eq.(5). 85

The details of training about learning parameters are given in Section 3. 2.2. Pre-training and Fine-tuning with Data Augmentation In this section, we introduce a trick to further improve the performance of MLeNet with a very limited number of training instances. We firstly consider data augmentation to increase the number of samples, and then use pre-training

6

90

and fine-tuning to reduce overfitting and the error rate. Alex Krizhevsky et al. [13] introduce a form of data augmentation consisting of generating image translations and horizontal reflections, while considering the symmetric structure of masked faces, we only use horizontal reflections to increase the data set to two times.

95

We can learn the MLeNet via SGD and combine both pre-training and finetuning to further improve the accuracy of classifier. Firstly, we train LeNet on the MNIST database [18], and then initialize the parameters of MLeNet by using the parameters of learned LeNet. Secondly, Fine-tune the parameters of MLeNet via stochastic gradient descent.

100

3. Detecting Masked Faces Based on the MLeNet described in Section 2.2, we generate a detector to classify whether a given fixed window is masked face or not. but how to generate a given fixed window which is a candidate of masked faces? Recently, there are two frameworks generating these candidates. the one is followed by R-

105

CNN[20] in which selective search region proposals [21] are generated. the other is followed by DPM [22] in which the candidates of masked faces are generated by sliding window. The first framework (R-CNN) can accelerate the object detection, while the detector may fail if the generated proposals do not contain masked faces. The other framework (DPM) generates candidates of masked

110

faces by sliding window which is simple and all-inclusive to generate candidates of masked faces, especially, with our low-resolution training dataset. Based on DPM, neither the multi-scale problem nor detection overlapping problem is directly handled. Following a classic setting, we resort to pyramid matching scheme with non-maximal suppression based post-processing to handle above

115

problems. In brief, to conduct pyramid matching, we sample the target image at multiple scales and positions, each of which is sent to MLeNet for masked face detection. For a given window, the MLeNet generates the softmax output as

7

the detection score. We further apply non-maximum suppression that reject a 120

sub-window if it has an intersection-over-union overlap [23] with a higher scoring window larger than a given threshold.

4. Experiments We verify the proposed work on the masked man dataset cropped from violence videos. The dataset consists of 1140 images including 240 positive 125

ones and 900 negative ones. We randomly select 150 positive samples and 750 negative samples as the training set, 50 positive samples and 50 negative samples as the validation set, and the remaining 140 images as the test set. To reduce overfitting and the error rate of detector, we increase the number of training instances to two times via horizontal reflection which is described in section 2.2.

130

To be compatible with the input of MLeNet, we crop original image and resize to 28 × 28 image size. Figure 2 illustrates some positive and negative training samples. In the detection step, we use the original test images to evaluate the performance of the proposed model. 4.1. Training the MLeNet with pre-training We employ the stochastic gradient descent algorithm (a batch size of 20 samples, momentum of 0.9, weight decay of 0.0005, learning rate of 0.001) to train our model, and the number of training epochs is 100. The update rule for weight W and bias b is Vi+1 = 0.9Vi − 0.0005 · 0.001 · Wi − 0.001 · h

ui+1 = 0.9ui − 0.0005 · 0.001 · bi − 0.001 · h Wi+1 = Wi + Vi+1 , bi+1 = bi + ui+1 ,

8

∂J |W i D , ∂W i i ∂J |b iD , ∂b i i

Figure 2: Partial training images for MLeNet, top row: positive images; bottom row: negative images.

135

∂J |Wi iDi where i is the iteration index, u, V are the momentum variables, h ∂W

is the derivative of the objective over the i-th batch Di with respect to W, evaluated at Wi . h ∂J ∂b |bi iDi is the derivative of the objective over the i-th batch Di with respect to b, evaluated at bi . We initialize the weights by directly borrowing the models from LeNet. Fine140

tuning the parameters of MLeNet via stochastic gradient descent with the above parameters setting (e.g. batch size, momentum, weight decay, etc.). We train the MLeNet for 100 epochs. Without using GPU, which takes about 10 minutes on an ordinary laptop with 6GB memory and 1.90GHz AMD A8-4500M APU. we use “MatConvNet” [24] toolbox to train the MLeNet.

145

4.2. Results and Analysis To evaluate the proposed model, we compare it with two baselines which are LeNet and MLeNet without pre-training. Modifying the number of units from 10 to 2 in the output layer of LeNet makes it suit for a binary classification as a baseline. In addition, we use MLeNet without pre-training that is directly

150

trained via stochastic gradient descent as the other baseline. The classification results on the masked man dataset are shown in Figure 3. It can be seen that the MLeNet without pre-training can make the network more stable and have lower error rate than LeNet, because the MLeNet without pre-training reduce the filter size and increase the number of feature maps that can learn more

155

discriminative features without increasing the capacity of model. Moreover, it also can be seen that pre-training the MLeNet can further reduce the error rate as compared with the MLeNet without pre-training, because pre-training can

9

error

objective

0

10

0.8 train val

pos neg

0.7 0.6

−1

10

error

energy

0.5 0.4 0.3

−2

10

0.2 0.1 −3

10

0

20

40

60

training epoch

80

0 0

100

20

40

60

80

100

training epoch (a)

error

objective

0

10

0.8

train val

pos neg

0.7 0.6

error

energy

0.5 −1

10

0.4 0.3 0.2 0.1

−2

10

0

20

40

60

80

0

100

0

20

training epoch

40

60

80

100

training epoch

(b)

error

objective

0

10

0.4

train val

pos neg

0.35 0.3

error

energy

0.25 −1

10

0.2 0.15 0.1 0.05

−2

10

0

20

40

60

training epoch

80

0 0

100

20

40

60

training epoch

80

100

(c) Figure 3: The results of LeNet, MLeNet without pre-training and MLeNet with pre-training. left column: loss value of training and validation; right column: error rate of positive and negative images. (a) Result of LeNet; (b) Result of MLeNet without pre-training and finetuning; (c) Result of pre-training and fine-tuning MLeNet.

10

Table 2: The results of detecting masked faces on the masked man dataset. “Ours” means MLeNet with pre-training; “MLeNet” means MLeNet without pre-training.

Recall Precision F1 − score

Ours 0.925 0.71 0.803

AdaBoost[2] 0.75 0.6 0.667

RFD[4] 0.87 0.71 0.782

LeNet[18] 0.82 0.64 0.719

MLeNet 0.85 0.68 0.756

assign the model a better initial weights. We employ the detection criterion in the PASCAL VOC [23] to evaluate the predicted bounding boxes, which describes as follows: score =

area(RT ∩ RG ) , area(RT ∪ RG )

(7)

where RT is the predicted bounding box, and RG is the ground truth bounding 160

box. if score > 0.5, then the predicted bounding box is labeled as the correct one. Compared to the AdaBoost[2] approach based on the haar-like features [2], RFD[4] and LeNet [18], we can achieve the best performance of 92.5% recall, 71% precision on the masked man dataset, while the AdaBoost approach only

165

obtains 75% recall, 60% precision. RFD[4] obtains 87% recall, 71% precision and LeNet obtains 82% recall, 64% precision, and MLeNet without pre-training obtains 85% recall, 68% precision. The experimental results demonstrate that the haar-like features cannot describe the face structure of masked man well, and AdaBoost detector lies in the need of sufficient amount of training samples.

170

RFD [4] requires many exemplar face images under different viewpoint, poses, expression and lighting conditions. Under our masked man dataset, it is difficult to collect many exemplar image. Therefore, RFD may fail on the masked man dataset. Moreover, the MLeNet without pre-training is more robust to detect the masked faces by increasing the number of convolution filters and reducing

175

their receptive filter size which compared with LeNet. Our detector via combining pre-training and fine-tuning show the most robust and the highest accuracy compared to other baselines. pre-training via directly borrowing from LeNet

11

Figure 4: Partial results of face detection of masked terrorist. (In order to protect the privacy, the face region of masked man has been processed with the mosaic.)

Figure 4: Partial results of face detection of masked terrorist. (In order to protect the privacy, the face region of masked man has been processed with the mosaic.) can provide our model a good initial weights, and fine-tuning makes our model suitable for specific dataset. The detailed results are summarized in Table 2

can provide our model a good initial weights, and fine-tuning makes our model 180

and the partial testing results of face detection of terrorist are shown in Figure

suitable 4.for specific dataset. The detailed results are summarized in Table 2 180

and the partial testing results of face detection of terrorist are shown in Figure 4.3. Efficiency

4.

Running Time: Adaboost [2] can achieve the fastest detection speed with about 40 fps, because integral image is used to rapidly calculate the haar-like

4.3. Efficiency rectangular feature. 185

while our method also obtains preferable running test with

about 35 fps which can attain real-time. RFD [4]can obtain the performance of

Running Time: Adaboost [2] can achieve the fastest detection speed with running time which is nearly 10 fps.

about 40 fps, because integral used image, to rapidly the haar-like Memory Usage: Given image the sameistesting for the calculate memory footprint 185

size, our detector requires 3MBalso memory, Adaboost [2] detector acquires rectangular feature. while ouraround method obtains preferable running test with 190

10MB memory, while the RFD [4] detector acquires 87MB memory. Obviously,

about 35 fps which can attain real-time. RFD [4]can obtain the performance of our detector is more practical for real-world applications.

running time which is nearly 10 fps. 5. Conclusion Memory Usage: Given the same testing image, for the memory footprint

size, our detector requires arounda novel 3MBmodel memory, Adaboost [2] faces detector In this paper, we propose for localizing masked in im-acquires 190

ages. Our while model the is based on [4] the detector convolutional neural networks (MLeNet) and 10MB memory, RFD acquires 87MB memory. Obviously, 195

the sliding window. MLeNet achieves the satisfactory performance on detecting

our detector is more practical for real-world applications. 12

5. Conclusion In this paper, we propose a novel model for localizing masked faces in images. Our model is based on the convolutional neural networks (MLeNet) and 195

the sliding window. MLeNet achieves the satisfactory performance on detecting 12

masked faces. In addition, to further reduce overfitting and improve the performance with a small quantity of training samples, we firstly increase the training dataset by horizontal reflection, and then learn MLeNet via combining both pre-training and fine-tuning. Quantitative evaluations over several state-of-the200

arts and alternative solution have demonstrated the accuracy and robustness of the proposed model. In the future, we will design a novel network architecture to further improve the detection performance. In addition, we will explore inherent redundancy in neural networks to achieve drastic reductions in the model sizes, such that

205

the deeper networks get more extensive applications with a small amount of training samples.

References [1] P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, Pattern Analysis 210

and Machine Intelligence, IEEE Transactions on 19 (7) (1997) 711–720. [2] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, IEEE, 2001, pp. I–511. [3] C. Liu, H. Wechsler, Gabor feature based classification using the enhanced

215

fisher linear discriminant model for face recognition, Image processing, IEEE Transactions on 11 (4) (2002) 467–476. [4] X. Shen, Z. Lin, J. Brandt, Y. Wu, Detecting and aligning faces by image retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3460–3467.

220

[5] H. Li, Z. Lin, J. Brandt, X. Shen, G. Hua, Efficient boosted exemplar-based face detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1843–1850.

13

[6] Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000 classes, in: Proceedings of the IEEE Conference on Computer 225

Vision and Pattern Recognition, IEEE, 2014, pp. 1891–1898. [7] Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse, selective, and robust, arXiv preprint arXiv:1412.1265. [8] Y. Sun, X. Wang, X. Tang, Hybrid deep learning for face verification, in: Proceedings of the IEEE Conference on Computer Vision, IEEE, 2013, pp.

230

1489–1496. [9] V. Jain, E. G. Learned-Miller, Fddb: A benchmark for face detection in unconstrained settings, UMass Amherst Technical Report. [10] H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in: Proceedings of the IEEE Conference on

235

Computer Vision and Pattern Recognition, 2015, pp. 5325–5334. [11] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8) (2013) 1798–1828. [12] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional net-

240

works, in: Proceedings of Europeon Conference on Computer Vision, Springer, 2014, pp. 818–833. [13] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of Advances in neural information processing systems, 2012, pp. 1097–1105.

245

[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

14

[15] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni250

tion, arXiv preprint arXiv:1512.03385. [16] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv preprint arXiv:1409.1556. [17] G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507.

255

[18] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324. [19] A. Neubeck, L. Van Gool, Efficient non-maximum suppression, in: Proceedings of International Conference on Pattern Recognition, Vol. 3, IEEE,

260

2006, pp. 850–855. [20] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

265

[21] J. R. Uijlings, K. E. van de Sande, T. Gevers, A. W. Smeulders, Selective search for object recognition, International journal of computer vision 104 (2) (2013) 154–171. [22] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, Pattern Analysis

270

and Machine Intelligence, IEEE Transactions on 32 (9) (2010) 1627–1645. [23] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2) (2010) 303–338. [24] A. Vedaldi, K. Lenc, Matconvnet – convolutional neural networks for mat-

275

lab, CoRR abs/1412.4564. 15