Learning second-order statistics for place recognition based on robust covariance estimation of CNN features

Learning Second-order Statistics for Place Recognition based on Robust Covariance Estimation of CNN Features Communicated by Dr Shen Wei Journal Pr...

Download PDF

21MB Sizes 0 Downloads 81 Views

Report

Full Text

Learning Second-order Statistics for Place Recognition based on Robust Covariance Estimation of CNN Features

Communicated by

Dr Shen Wei

Journal Pre-proof

Learning Second-order Statistics for Place Recognition based on Robust Covariance Estimation of CNN Features Weiqi Zhang, Zifei Yan, Qilong Wang, Xiaohe Wu, Wangmeng Zuo PII: DOI: Reference:

S0925-2312(20)30176-4 https://doi.org/10.1016/j.neucom.2020.02.001 NEUCOM 21869

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

6 September 2019 31 January 2020 1 February 2020

Please cite this article as: Weiqi Zhang, Zifei Yan, Qilong Wang, Xiaohe Wu, Wangmeng Zuo, Learning Second-order Statistics for Place Recognition based on Robust Covariance Estimation of CNN Features, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.02.001

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Learning Second-order Statistics for Place Recognition based on Robust Covariance Estimation of CNN Features Weiqi Zhanga , Zifei Yana , Qilong Wangb , Xiaohe Wua , Wangmeng Zuoa a

b

Harbin Institute of Technology, Harbin, China College of Intelligence and Computing, Tianjin University, Tianjin, China

Abstract Appearance based loop closure detection plays an important role in visual simultaneous localization and mapping systems (vSLAM) by measuring similarity of the places and checking loops to reduce the accumulated error. Traditional loop closure methods execute place recognition by image retrieval with Bag-of-Word model, which forms an orderless representation of local feature descriptors. Convolutional neural networks (CNNs) based features have been investigated for place recognition, where the final descriptors usually are generated by first-order pooling, limiting the representation ability in challenging scenarios. To handle above issue, we introduce high-order statistics into place recognition by developing a novel adaptively normalized covariance pooling method for learning place representations in an endto-end manner. The proposed method provides robust covariance matrix estimation of high-dimensional and small-size deep features by adaptive covariance normalization (AdaCN). Experimental results on place recognition in the urban environment and image retrieval tasks show that second-order representation is effective, especially for discriminating places with confusing objects, changes in viewpoint and illumination. Besides, the proposed adaptive normalization performs favorably against its counterparts based on Log-Euclidean Riemannian metric and Power-Euclidean metric, while our method is superior to the state-of-the-art place recognition approaches. Keywords: Place recognition; Covariance estimation; Convolutional neural network; Second-order statistics

Preprint submitted to Neurocomputing

February 14, 2020

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

1. Introduction Visual place recognition has gained significant popularity in the past years, which can be applied in simultaneous localization and mapping (SLAM) systems [1–5] to supply candidates for matching a previously visited location (namely loop closure), and reduce accumulated errors and pose drifts. It is usually cast as an image retrieval task at city-scale, where image representations are computed by global [6, 7] or local features [8–10] for matching. Previous place recognition methods are usually developed based on handcrafted features, running as a pure image retrieval process [11] with Bagof-Word (BoW) model, where tree structure is used to reduce the memory requirement or formulated as an optimization problem [12, 13] to explore the similarity learning with feature fusion. Benefitting from the great advance of feature representation learning of convolutional neural networks (CNNs) in wide range of vision tasks such as image retrieval [14, 15], object detection [16, 17] and scene recognition [18, 19], which show great improvement on discrimination with concentration on the region of interest, CNN-based descriptors have been investigated to overcome the bottleneck of hand-crafted representations [20–22] in place recognition task. In [23], an end-to-end trainable CNN architecture is proposed to extract proper representation for place recognition by integrating the layer of generalized vector of locally aggregated descriptors (VLAD), which captures the first-order statistics of the deep features. Kim et al. suggest a contextual reweighting network (CRN) [24] based on NetVLAD to predict the importance of each region in the feature map, but such method makes little improvement on standard urban environment with confusing objects, which may be caused by lack of variability in training images, as shown in Fig. 1. Different from learning better representations using deeper and wider architectures, our work focuses on extracting the powerful high-order representations of images for place recognition using CNNs. Second-order statistics in deep architectures have been exploited for visual recognition in both small-scale and large-scale scenarios, which are able to capture richer statistical information and so improve the discriminative and generalization abilities of deep CNNs. In [25], a method motivated by the propagation of Fisher vector (FV) [26] is accomplished for image classification task, capturing both first- and second-order information. DeepO2 P [27] first derives and instantiates the methodology to learn the structured layer for second-order pooling, based on theorems on variations of singular 2

Figure 1: Examples of Google Street View Pittsburgh taken at the nearby location, without much large variability.

38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

63 64

value decomposition (SVD) and eigenvalue decomposition (EIG). G2 DeNet [28] plugs a Gaussian distribution into deep CNNs by developing a trainable Gaussian embedding layer, which shows the effectiveness of incorporation of probability distribution (i.e., Gaussian with first- and second-order statistics) into deep CNNs. MPN-COV [29] presents a matrix power normalized covariance method, which shows great improvement on large-scale visual recognition task by considering robust covariance estimation and geometry of covariances. Although second-order statistics with proper matrix normalizations have achieved considerable advance in image classification, second-order statistics have not been fully explored in place recognition. Therefore, we make the attempt to explore second-order statistics for effective place representation. Besides, existing matrix normalizations are all based on pre-defined non-linear mappings, fixing for all covariance representations. In this paper, we develop a parametric covariance pooling method with Adaptive Covariance Normalization (AdaCN) in deep CNN architectures. Comparing with hand-crafted normalizations, our AdaCN shows better adaptability for place recognition, where the same place usually suffers from changes of viewpoint, illumination and seasonal variation, as shown in Fig. 2. We evaluate our proposed network on both street view datasets and the image retrieval datasets, and the results show that our method can achieve favorable performance against the state-of-the-art approaches on place recognition task under challenging scenarios. Moreover, the learned representation obtains better performance in image retrieval task without further training, showing the generalization ability of our method. The contributions of this paper are in three-fold: • We make the attempt to explore a second-order statistics embedding method based on robust covariance estimation on high-dimensional 3

Figure 2: Google Street View Time Machine examples taken at the nearby location, including the changes in viewpoint, occlusions and illumination.

65 66 67

68 69 70 71 72 73

74 75 76 77 78

79 80 81 82 83 84 85 86 87

and small-size deep features for exploiting the representation for place recognition task, which can be integrated into any CNN architecture in an end-to-end manner. • We develop an Adaptive Covariance Normalization (AdaCN) to adaptively learn normalizations for different covariance representations, which shows superiority over existing normalizations based on Log-Euclidean (Log-E) Riemannian metric and Power-Euclidean (Pow-E) metric for place recognition, where images usually suffer from changes of viewpoint, illumination and seasonal variation. • The extensive experiments on place recognition and image retrieval show our adaptively normalized covariance pooling network has the ability to accurately locate places with significant changes in viewpoint and illumination, while performing favorably against its counterparts and the state-of-the-art place recognition approaches. The paper is organized as follows. In Section 2, we briefly give an overview of the traditional methods with hand-crafted features utilized in place recognition for localization, and introduce deep learning based place representations and deep architectures for extracting second-order statistics. Section 3 presents the proposed AdaCN for second-order pooling in deep CNN architectures. In Section 4, we evaluate the proposed method on street view datasets and image retrieval datasets, while comparing with the state-of-theart methods. Finally, Section 5 gives conclusion and discussion of our method based on the experimental results.

4

88

89 90 91

92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115

116 117 118 119 120 121 122

2. Related Work Visual place recognition and appearance-based loop detection in SLAM are often cast as image retrieval, in which descriptors play an important role in characterizing similarity between frames. 2.1. Visual Place Recognition for Localization Visual place descriptions used for localization can be roughly divided into two categories: local features such as SIFT [30] and SURF [8], and global features such as GIST [6] and BRIEF-GIST [31]. Local features require a detection phase and focus on the interesting parts of the image, which are invariant to rotation and scale. To exploit the characteristic of local features and ensure the search efficiency, BoW [32] model is often used to quantize the descriptor space. Comparing with local features, global features are pose dependent but robust to changing conditions. BoW model with local descriptors (e.g., SURF) is used in FAB-MAP [11], which is a probabilistic approach for recognizing places and is successfully used in LSD-SLAM [5]. For other SLAM methods, the appearance-based loop closure associates the metric information with image features extracted for visual odometry, e.g., CenSurE feature [33] in FrameSLAM [34] and ORB feature [9] in ORB-SLAM [4]. By considering the merits and limitations of global and local descriptors, raw images and different descriptors (e.g., GIST and HOG [7]) are combined as place description [12]. Instead of searching and matching, the most similar pair of images is found by solving a `1 -norm based sparse optimization problem. Deep features extracted using faster RCNN [17] are used in [13] together with the LDB [10], GIST and ORB feature, where a robust multimodal sequence-based method (ROMS) is proposed to optimize the `2,1 -norm problem with structured sparsity. The improvement made by the ROMS method verifies that the deep features can be useful for place recognition task. 2.2. Deep Learning based Visual Place Recognition CNNs have been utilized to extract deep features as image representations for image retrieval [14, 15], object detection [16, 17] and scene recognition [18, 19, 35] tasks. Motivated by these methods, S¨ underhauf et al. [20] have shown better performance on place recognition by using the representations generated by deep CNNs trained on object recognition datasets without finetuning. Chen et al. [22] train two CNN architectures and adopt multi-scale 5

123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138

139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159

feature encoder to learn condition-invariant and viewpoint-invariant features. On the other hand, several studies have been given to exploit representations for place recognition training with triplet ranking loss on the street view datasets with GPS-tags [23, 24, 36–38]. Arandjelovi´c et al. propose NetVLAD [23] to train a network directly for place recognition by inserting a trainable generalized VLAD [39] layer, where the features of last convolutional layer are viewed as dense descriptors, and the generalized VLAD layer pools the descriptors into a fixed representation. A weakly supervised triplet ranking loss is applied for training the Google street view dataset, and the results show great improvement over off-the-shell CNNs. CRN [24] integrates context-aware feature reweighting to focus on discriminative regions in large scale image geo-localization, which extends NetVLAD by generating the contextual reweighting mask using the convolution layers. Similarly, the attention block following a spatial pyramid pooling in APANet [38] weights the regional features with regard to the distinctiveness, which compresses representation generated by NetVLAD. 2.3. Second-order Statistics in Deep Visual Recognition High-order pooling has demonstrated effectiveness in improving performance of image classification [25, 27–29, 40]. B-CNN [40] aggregates the outer products of deep features with sum-pooling when two CNN-based extractors are the same, resulting in second-order non-central statistics. In contrast to B-CNN, DeepO2 P [27] presents the second-order statistics of CNN descriptors by leveraging the covariance matrix with a tangent space mapping on the basis of the Riemannian manifold, and establishes the matrix of backpropagation for integrating structured matrix layer in an end-to-end manner. Wang et al. [28] introduce a Gaussian embedding strategy to make the first attempt to integrate parametric probability distribution into deep CNNs, which is decomposed into two sub-layers, i.e., one decouples covariance matrix and mean vector while the other computes square root of symmetric positive definite (SPD) matrix. It outperforms deep second-order networks, (e.g., DeepO2 P and B-CNN), demonstrating the effectiveness of plugging parametric probability distribution in CNNs by capturing first- and second-order statistics. Unlike DeepO2 P adopting Log-E metric for exploiting the geometry of space of covariance matrix, Li et al. [29] propose a matrix power normalized covariance (MPN-COV) method which considers both robust covariance estimation and usage of Riemannian geometry of covariances, showing promising performance on large-scale classification task. 6

CNN

Adaptively Normalized Covariance Pooling

Image EIG

Conv

X

W×H×D Feature Map

Z

…

… N×D Descriptors

D-dimensional Eigenvalues

Normalized Covariance

(D+1)×D/2dimentional vector

Figure 3: Illustration of our proposed adaptively normalized covariance pooling method. The core of our method is to introduce an Adaptively Normalized Covariance Pooling method to effectively capture the second-order statistics of convolution activations as global place representation. Specifically, we embed the covariance pooling after the last convolution layer, and perform an adaptive normalization on the eigenvalues of estimated covariance matrix.

160

161

162 163 164 165 166 167

168 169 170 171 172 173

3. Proposed Method In this section, we introduce our adaptively normalized covariance pooling method for better exploiting second-order statistics in place recognition. We first present the difference between our covariance pooling and other pooling methods, then introduce our method which achieves robust covariance estimation with an adaptive matrix normalization. Finally, we show how to integrate our method with deep CNNs. 3.1. Efficient Pooling with High-order Statistics When describing a place, BoW encodes the zero-order statistics of the distribution of descriptors, resulting in lack of information. As a descriptor pooling method to form an orderless representation, the VLAD [41] accumulates the residual of each descriptor with respect to its assigned cluster given a codebook C = {c1 , . . . , cK } with k-means, X vi,j = xj − ci,j , (1) x:N N (x)=ci

174 175

where N N (x) = ci denotes the set of descriptors associated to the nearest visual word ci . VLAD only considers the first-order statistics, though 7

176 177 178 179

trainable layer can be easily inserted into deep architecture. FV [26] aggregates local descriptors with an universal generative Gaussian mixture mode (GMM), including the first- and second-order statistics with respect to the variance, Gµk Gσk

180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199

200 201 202

T 1 X xt − µk =√ γt (k)( ), wk t=1 σk

T 1 X 1 (xt − µk )2 =√ γt (k) √ ( − 1), wk t=1 σk2 2

(2) (3)

where wk , µk and σk denote the weight, mean vector and covariance matrix of the k-th GMM, respectively. Here, the covariance matrix is restricted to be a diagonal matrix to decrease the number of parameters in deep network [25]. Following the pipeline of generating efficient descriptors in image retrieval, we try to express the images of places by covariance representations through inserting a global parametric second-order pooling layer after the last convolution layer of deep CNNs without any coding preprocess. It is well known that the second-order pooling (O2 P) [42] is achieved by second-order analogues of average and max pooling for regions together with non-linearities, which operates over each spatial position of the symmetric matrices. In contrast to the above method, our pooling is performed by covariance matrix estimator on the descriptors of last convolutional activations, taking the correlation of different dimensions (i.e., channels) into account and learning global statistics as representation. Besides, normalization plays an important role in global covariance pooling as illustrated in MPN-COV [29]. We introduce an adaptive covariance normalization to learn various normalization functions for different covariance representations. Note that our AdaCN can also be regarded as robust covariance estimation and approximate usage of Riemannian geometry of covariances. 3.2. Robust Covariance Estimation Given a set of N D-dimensional features X = [x1 , . . . , xN ], its Gaussian distribution can be expressed as p(x) =

1 exp(− (x − µ)T Σ−1 (x − µ)), 2 (2π) Σ 1

d 2

1 2

8

(4)

203 204

205 206 207 208 209 210 211

where the mean vector and covariance matrix can be estimated by classical MLE: P µ = N1 N (5) i=1 xi , P T (6) Σ = N1 N i=1 (xi − µ)(xi − µ) .

However, high-dimensional features with small number of samples can not be well handled by classical MLE, especially for the deep features. Besides, the sample covariance matrix computed by Eqn. (6) is typically not wellconditioned, for which the robust estimators of covariance matrix are developed [43–45]. Among them, the shrinkage principle for eigenvalues is first given by Stein [43], then Ledoit and Wolf [44] propose a simple and efficient formula, i.e., ˆ − Σk2 , min kΣ

212 213 214 215

ˆ = ρ1 I + ρ2 S. s.t. Σ

The estimator is a convex linear combination of sample covariance matrix S with the identity matrix I. Recently, Wang et al. [46] impose a constraint to encourage the estimated covariance towards structure of the identity matrix by the von Neumann (vN) divergence regularizer: ˆ + tr(Σ ˆ −1 S) + βDvN (I, Σ), ˆ min log |Σ| ˆ Σ

216 217 218 219

221 222 223 224 225

(8)

where tr(·) indicates the trace of matrix. Note that the EIG of the sample covariance matrix is S = Udiag(δk )UT , where diag(δk ) is the diagonal matrix of the eigenvalues δk , k = 1, . . . , D in decreasing order. The optimal solution of Eqn. (8) can be computed as

λk = 220

(7)

s

(

ˆ = Udiag(λk )UT , Σ

(9)

1 − β 2 δk 1 − β ) + − , 2β β 2β

(10)

where 0 < β < 1. Generally, the robust estimation is given by transformation of the empirical sample covariance matrix, which is performed by reducing the ratio between the smallest and the largest eigenvalues. The shrinkage is done by shifting every eigenvalue to a given offset [44], which is settled by the learned parameter or the selected normalization function, e.g., Log-E metric 9

226 227 228

and Pow-E metric. Different from previous works, the shrinkage offset in our method is learned through AdaCN, which is regularized by matrix power normalization due to its effectiveness, min LAdaCN = kfAdaCN (Λ) − Λα k2 ,

fAdaCN 229 230 231 232 233 234 235 236 237

ˆ is the sample covariance matrix with EIG of S ˆ = UΛUT and where S ˆ can be 0 < α < 1. Given fAdaCN (Λ), the estimation of the covariance Σ T ˆ obtained by Σ = UfAdaCN (Λ)U . The core of adaptive normalization is to learn normalization functions on eigenvalues for different covariance representations, while previous works use fixed non-linear mappings for all covariance representations. Besides, covariance normalization is required to be a differentiable non-linear operator with positive output. To this end, we achieve the adaptive normalization model based on multi-layer perceptron (MLP), of which one AdaCN layer is given by l l fAdaCN (Λ) = fAdaCN (diag(δ)) = diag(max(Wδ + b, 0)),

238 239 240 241 242 243 244

245 246 247 248 249 250

252

(12)

where δ = [δ1 , . . . , δD ] is the vector of the eigenvalues, and W and b are the learned parameters and bias, respectively. The proposed covariance pooling method is shown in Fig. 3, where the sample covariance computation and adaptive normalization are integrated into a block. Note that our AdaCN learns shrinkage functions with constraint of matrix power normalization, so it shares similar philosophy with MPN-COV. Therefore, our proposed method can be regarded as a robust covariance estimator. 3.3. Deep Architecture for Covariance Pooling In our architecture, the outputs of convolutional layers are treated as features X = [x1 , . . . , xN ], which is a set of D-dimensional descriptors at W × H = N spatial locations. Based on CNNs cropped at the last convolutional layer, the first sub-layer in our method is to decouple Y, which is the sample covariance matrix and can be explicitly written as the function of X, i.e., Y = X¯IXT ,

251

(11)

(13)

where ¯I = 1/N (I − 1/N 11T ). Y is a symmetric positive semi-definite matrix and has eigenvalue decomposition Y = UΛUT , 10

(14)

(a) Query

(b) Positives

(c) Negatives

Figure 4: Triplet examples of Tokyo Time Machine with hard negatives, including (a) query images, (b) positive images, and (c) negative images.

253 254 255 256 257

where Λ is a diagonal matrix of eigenvalues in decreasing order, and U is an orthogonal matrix of which each column is a eigenvector. These eigenvalues are the inputs of the AdaCN layers, and we compute the normalized eigenvalues through three 1 × 1 convolutional layers with 256 channels followed by ReLU, i.e., fAdaCN (Λ) = diag(fAdaCN (λ1 ), . . . , fAdaCN (λD )).

258 259

The adaptive normalized covariance matrix is computed by normalizing the eigenvalues: Z = fAdaCN (Y) = UfAdaCN (Λ)UT .

260 261 262 263 264 265 266

(15)

(16)

As a pooling layer, we concatenate the upper triangular part of the estimated -dimensional vector for the final repcovariance matrix, producing a D(D+1) 2 resentation. To compute the partial derivative of loss function l with respect to the ∂l input matrix of the proposed layers, given ∂Z propagated from the top layer, T and dZ = dUfAdaCN (Λ)U + UdfAdaCN (Λ)UT + UfAdaCN (Λ)dUT according to Eqn. (16), the variations of Z is given by 0

dZ = 2(dUfAdaCN (Λ)UT )sym + UfAdaCN (Λ)dΛUT , 11

(17)

0.92 0.98

0.985

0.87

0.96

0.82

0.97 NetVLAD-VGG16 NetVLAD-VGG16+white MPN-COV-VGG16+white MPN-COV-VGG16 Ours-VGG16+white Ours-VGG16

0.965 0.96 0.955 5

10

15

20

0.94

Recall@N

Recall@N

Recall@N

0.98 0.975

0.77 NetVLAD-vgg16 NetVLAD-vgg16+white MPN_COV-vgg16 MPN_COV-vgg16+white Ours-VGG16 Ours-VGG16+white

0.72 0.67

25

0

5

10

15

20

0.9

NetVLAD-VGG16+white MPN-COV-VGG16+white MPN-COV-VGG16 NetVLAD-VGG16 Ours-VGG16 Ours-VGG16+white

0.88 0.86

0.62

N-Number of top database candidates

0.92

0.84

25

0

5

N-Number of top database candidates

(a) TokyoTM-val (V)

10

15

20

25

N-Number of top database candidates

(b) Tokyo 24/7 (V)

(c) Pitts250k-test (V)

0.75 0.95

0.98

0.7 0.97

0.9

NetVLAD-AlexNet NetVLAD-AlexNet+white MPN-COV-AlexNet MPN-COV-AlexNet+white Ours-AlexNet Ours-AlexNet+white

0.95

0.94

Recall@N

Recall@N

Recall@N

0.65 0.96

0.6 NetVLAD-AlexNet NetVLAD-AlexNet+white MPN_COV-AlexNet MPN_COV-AlexNet+white Ours-AlexNet Ours-AlexNet+white

0.55 0.5

5

10

15

20

25

0

5

10

15

20

25

N-Number of top database candidates

N-Number of top database candidates

(d) TokyoTM-val (A)

NetVLAD-AlexNet NetVLAD-AlexNet+white MPN_COV-AlexNet MPN_COV-AlexNet+white Ours-AlexNet Ours-AlexNet+white

0.8

0.45

0.93

0.85

(e) Tokyo 24/7 (A)

0.75 0

5

10

15

20

25

N-Number of top database candidates

(f) Pitts250k-test (A)

Figure 5: Place recognition recalls on TokyoTM-val, Tokyo 24/7 and Pitts250k-test using representation generated by our network, MPN-COV, and NetVLAD based on (a)(b)(c) (V)GG-16 and (d)(e)(f) (A)lexNet architecture.

267 268

where Qsym = 12 (Q + QT ). Thus the partial derivatives of normalized eigenvalues have ∂l ∂fAdaCN (Λ)

269 270

= UT

∂l U. ∂Z

The backpropagation for the covariance pooling layer is illustrated in [27], and for Y = UΛUT , the partial derivatives can be achieved by ∂l ∂l ∂l = U(KT ◦ (UT +( )diag ))UT , ∂Y ∂U ∂Λ

271 272 273

(18)

(19)

where ◦ denotes the matrix Hadamard product and K = Kij with Kij = ∂l 1/(λi − λj ) if i 6= j, and Kij = 0 if i = j. Given the chain rule ∂Z : dZ = ∂l ∂l : dU + ∂Λ : dΛ, we have ∂U ∂l ∂l = 2( )sym UfAdaCN (Λ), ∂U ∂Z 12

(20)

(a) Query

(b) Ours

(c) MPN-COV

(d) NetVLAD

Figure 6: Examples of retrieval results of place recognition for (a) queries on Tokyo 24/7 using representation generated by (b) ours, (c) MPN-COV and (d) NetVLAD (based on VGG16 with PCA whitening).

274

275 276

∂l ∂l 0 = fAdaCN (Λ)UT U. (21) ∂Λ ∂Z Finally, the gradient of the loss function with respect to the input matrix X can be computed as ∂l ∂l = 2¯IX( )sym . ∂X ∂Y

(22)

277

278 279 280

3.4. Discussion The comparison results in RAID-G [46] have shown that the F-norm metric outperforms Log-E metric. MPN-COV and G2 DeNet with the power 13

281 282

283 284 285 286 287 288 289

normalization significantly outperform DeepO2 P, where the solution of the formers has p λ k = δk , (23)

which is the unique solution to vN-MLE and equals to β = 1 in Eqn. (9). As shown in MPN-COV [29], the Pow-E metric equals the Log-E metric as α approaches 0, approximately exploiting the Riemannian geometry. In our method, fAdaCN (δk ) is regularized by matrix power normalization, i.e., λk ≈ δkαk . Moreover, we can briefly prove that for any two covariance matrices Σ1 and Σ2 , the limit of the AdaCN metric dAdaCN (Σ1 , Σ2 ) = kA1 fAdaCN (Σ1 ) − A2 fAdaCN (Σ2 )kF , equals the Log-E metric as αk approaches 0, i.e., lim dAdaCN (Σ1 , Σ2 ) = k log(Σ1 ) − log(Σ2 )kF ,

αk →0 290 291 292

where A = diag( α11 , . . . , α1D ). Note that dAdaCN (Σ1 , Σ2 ) = kA1 (fAdaCN (Σ1 )− I) − A2 (fAdaCN (Σ2 ) − I)kF , and based on the EIG of covariance matrix, we α α λ D −1 λ 1 −1 have A(fAdaCN (Σ) − I) ≈ Udiag( 1α1 , . . . , DαD )UT . The above claim λ

293 294 295

296 297 298

(24)

αk

−1

easily follows because for any eigenvalue λk , limαk →0 kαk = log(λk ). Thus our proposed normalization approximately exploits the Riemannian geometry of covariances.

3.5. Selection of Training Triplets Given image triplets {Iq , Ipi , Inj } generated with GPS-tags, the triplet ranking loss [47, 48] is presented as X Lf = max(0, min kf (Iq ) − f (Ipi )k2 + δ − kf (Iq ) − f (Inj )k2 ), (25) i j

299 300 301 302 303 304

where δ is a constant parameter called margin. This loss ensures that the representation of the positive image Ipi (the same place) is closer to the query image than the negative images Inj (different places) by at least a margin. The hinge function in Eqn. (25) is to avoid correcting triplets that are already correct, and instead of ignoring the correct ones, the soft-margin formulation replaces the hinge function using the softplus function, i.e., X Lf = log(1+ exp(min kf (Iq ) − f (Ipi )k2 − kf (Iq ) − f (Inj )k2 )). (26) i j

14

Figure 7: Examples of retrieval results of place recognition for queries on Pittsburg using representation generated by NetVLAD, MPN-COV and ours (based on VGG16 with PCA whitening) from top to the bottom.

305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321

It has similar behavior but decays exponentially in contrast to a hard cut-off. As the representations are learned quickly to map most triplets correctly, attentions are paid for mining the hard triplets. The image representations for the entire query and database are computed and cached in NetVLAD, which provides the candidates of nearest positive image and negative images but costs large amounts of time and memory. Besides, learning only based on the hardest positive and negatives can lead to a bad local minima, making it unable to work with normal associations [47]. Instead of generating a pool of 1,000 randomly sampled negatives and recomputing the cached representations every 500 to 1,000 training queries, we compute the hardest negatives during each epoch and just keep a candidates list. For each query image, the positive images are defined within 10m-25m from the given GPS location. We compute the image representations in the positive set during training, and the nearest positive image is selected based on the Euclidean distance of the features. The number of images in positive set is no less than kp , and larger distance can be used if no positive image can be chosen. The negative images are chosen from the positions that are at least 225m away from the 15

326

given GPS location, and during training we randomly select a negative set of kn images, then compute the representations and choose the nearest 10 as hard negatives. We keep a small list of the negative candidates index, where the number kn hard < kn , and the hardest ones remain in the list for the next epoch. We show examples of training triplets in TokyoTM dataset in Fig. 4.

327

4. Experiments and Discussion

322 323 324 325

328 329 330 331 332

333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349

350 351 352 353 354 355

In this section we describe the datasets, implementation details, and give place recognition results to validate our method. The programs run on PC equipped with Intel(R) Xeon E3-1231 @3.4GHz CPU and 32G RAM, and NVIDIA GTX1080Ti with 11GB memory. The programs are implemented using MatConvNet package and Matlab 2017a on 64-bit Windows 10. 4.1. Datasets We use two publicly available datasets for training and test, Pittsburg250k (Pitts250k) [49], TokyoTimeMachine (TokyoTM) and Tokyo 24/7 [50]. Pitts250k dataset contains 250k database images generated from Google Street View panoramas, which are equally divided into training, validation and testing sets. For test query set, 24,000 perspective images are generated from Google Pittsburgh Research Dataset. A smaller subset Pittsburg30k (Pitts30k) is used for faster training, including 10k database images in each of the train/validation/test set. TokyoTM dataset is also generated from Google Street View panoramas, and each panorama is represented by 12 perspective images sampled in different orientations. It is split into training and validation sets, each of which contains 50k database images and 7k query images. Tokyo 24/7 is a challenging dataset which contains 76k database images and 315 query images captured by iPhone5s and SonyXperia smartphones, and at each location the images are captured at 3 different viewpoints and at 3 different times of day. We train the networks with TokyoTM and test with Tokyo 24/7, which are geographically disjoint. 4.2. Implementation Details Two basic architectures AlexNet [51] and VGG-16 [52] trained on ImageNet 2012 classification dataset are used, of which the number of channels in the last convolutional layer are reduced to 256 as suggested in [29], then our proposed pooling layer follows. We implement the EIG algorithm on CPU, while the forward and backward propagations of the remaining operations 16

0.99

0.9 MPN_COV-softloss MPN_COV

0.98

0.85

Recall@N

Recall@N

0.97 0.96 0.95 0.94

MPN-COV-VGG16 (r-w) MPN-COV-VGG16 (r-w/o) MPN-COV-VGG16 (f) MPN-COV-VGG16 off-the-shelf

0.93

0.75

0.92 0

5

10

15

20

0.8

0.7

25

0

N-Number of top database candidates

5

10

15

20

25

N-Number of top database candidates

(a)

(b)

Figure 8: Place recognition recalls (a) on TokyoTM-val using MPN-COV trained by fixed triplets, randomly selected triplets with and without hard negatives, and (b) on Tokyo 24/7 using MPN-COV trained with hinge loss and soft-margin loss.

356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374

375 376 377 378

are performed on GPU. The training (training/validation sets) and evaluation (test sets) are employed following the settings of NetVLAD, and we use the UTM ground-truth in datasets to draw the recall curve with evaluation metric. Following the protocol of [49], the evaluation metric we used is the same as NetVLAD, e.g., d = 25 meters, where a retrieved image is correct if it is within 25m from the ground-truth position of the query. The positive and negative sets are sampled from the database, and have time stamps at least one month away from the time stamp of the query for both training and evaluation in TokyoTM. For Tokyo 24/7, the spatial non-maximal suppression is performed before evaluation, and we do not use the full resolution images which cause out-of-memory errors on our PC. For training, we use the stochastic gradient descent (SGD). We set margin parameter to 0.3, the learning rate of 0.001 with exponential decay, momentum of 0.9, and weight decay of 0.0001 within 30 epochs. When choosing triplets, the parameters are set as kn = 120 and kp = 36. We also perform principal component analysis (PCA) whitening on the final representations for all the methods. We compare with other networks on the TokyoTM-validation (TokyoTM-val), Pitts250k-test and Tokyo 24/7 datasets, where the performance is measured by the recall given the top N candidates in the shortlist. 4.3. Results and Discussion 4.3.1. Place Recognition To evaluate our method, we first compare with representations generated by off-the-shelf networks pretrained on classification task. The baselines are 17

Lowest trained layer conv5 1 conv4 1 conv3 1 conv2 1 conv1 1 off-the-shelf

r@1 r@5 r@10 Pitts30k-test 84.75 92.59 94.45 85.05 92.56 94.47 84.92 92.77 94.87 84.84 92.50 94.51 80.80 90.70 93.47 81.18 90.58 93.32

r@1 r@5 r@10 TokyoTM-val 95.63 97.31 97.63 95.77 97.28 97.61 95.82 97.27 97.51 95.81 97.25 97.51 95.94 97.33 97.68 92.17 95.56 96.59

Table 1: Results of performing backpropagation down to a certain layer of MPN-CPV based on VGG-16 on Pitts30k-test and TokyoTM-val. r@N denotes recall@N.

379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403

networks cropped at MPN-COV layer, and descriptors are obtained without further training. We then fine-tune the MPN-COV network on TokyoTM and Pittsburg datasets for place recognition task, and the learned representations are used for comparison. For the proposed AdaCN layers, we find initializing three convolutional layers randomly easily changes the property of eigenvalues, and results in longer training time and unstable performance. Thus we train these layers to approximate the square root of the eigenvalues first, and initialize with the pretrained parameters. Then the covariance estimation and adaptive normalization are fine-tuned together with the former layers. We compare with state-of-the-art NetVLAD trained on the same datasets for place recognition, using the publicly available source code1 on street view datasets and image retrieval datasets. We utilize the standard procedure to perform dimensionality reduction, i.e., PCA with whitening followed by L2 -normalization for all the obtained descriptors. Note that the PCA is learned on the training set for TokyoTM and Pittsburg, respectively. In the following we discuss the details of our experiment. Triplets and margin. When choosing training triplets, different strategies are used including the fixed triplets, the randomly selected triplets with and without hard negatives. The fixed triplets (f) denote performing offline hard-mining on a subset of data, as it is infeasible to search the whole database. Besides, we also use the randomly selected triplets without any strategy (r-w/o), and with hard mining in each epoch (r-w) following NetVLAD. The very hard negatives are images of which the distance is smaller than the positive to the query image without the margin, which we keep and ignore to find whether these outliers affect the training process. In 1

https://github.com/Relja/netvlad

18

Lowest trained layer conv5 1 conv4 1 conv3 1 conv2 1 conv1 1 AdaCN

r@1 r@5 r@10 Pitts30k-test 84.52 92.08 94.03 84.57 92.08 94.03 84.27 91.95 93.93 83.89 91.70 93.74 83.54 91.33 93.68 84.49 92.05 93.96

r@1 r@5 r@10 TokyoTM-val 95.96 97.47 97.91 95.95 97.69 98.14 95.91 97.44 98.00 95.98 97.51 98.02 95.99 97.57 98.00 95.99 97.58 97.70

Table 2: Results of performing backpropagation down to a certain layer of our network based on VGG-16 on Pitts30k-test and TokyoTM-val. r@N denotes recall@N. 0.88

Recall@N

0.84

0.8

0.76 MLN-COV-VGG16 MPN-COV-VGG16 Ours-VGG16

0.72

0.68 0

5

10

15

20

25

N-Number of top database candidates

Figure 9: Place recognition recalls on Tokyo 24/7 using representation generated by MLN-COV, MPN-COV and ours trained on TokyoTM.

404 405 406 407 408 409 410 411 412 413 414 415 416 417

Fig. 8(a) the results on the TokyoTM-val using MPN-COV trained on TokyoTM based on VGG16 show that for fixed triplets, where the same triplets are utilized during each epoch and training converges quickly (less than 10 epochs), representations perform better than off-the-shell MPN-COV, which reveals the benefits of end-to-end training. Training with the randomly selected triplets outperforms fixed triplets, with variability between query images and reference images during each epoch. The selected triplets with hard negatives perform most efficiently among these methods, which demonstrates the importance of (semi-)hard mining. Besides, the very hard negatives are kept for benefiting the training. Fig. 8(b) shows the results of MPN-COV trained with hinge loss and soft-margin introduced in section 4. We can see that using the soft-margin loss improves the robustness on Tokyo 24/7, though the recall decreases on the TokyoTM-val set. For fair comparison with NetVLAD, we train with hinge loss rather than soft-margin triplet loss.

418 419

Layer-by-layer studies. Considering of question that which layer should

19

0.14

0.6

0.12

0.5

0.8

0.4

hist

0.6

0.08

hist

hist

0.1

1

0.3

0.06

0.4 0.2

0.04 0.02

0.1

0

0 0

10

20

30

40

0.2

0 0

1

2

(a)

3

4

5

6

0

0.5

1

(b)

1.5

2

2.5

3

(c)

Figure 10: Eigenvalue histogram for (a) original covariance matrix, (b) power normalized covariance matrix and (c) ours adaptively normalized covariance matrix of VGG16 extracted descriptor on TokyoTM. 0.95 0.8

Recall@5

Recall@5

0.9 0.7 0.6

0.4 64

128

256

512

1024

2048

NetVLAD(V)+white MPN-COV(V)+white Ours(V)+white

0.8

MPN-COV (V)+white NetVLAD (V)+white Ours (V)+white

0.5

0.85

0.75 64

4096

Number of dimensions

128

256

512

1024

2048

4096

Number of dimensions

(a) Tokyo 24/7

(b) Pitts250k-test

Figure 11: Place recognition accuracy versus dimensionality of proposed method, MPNCOV and NetVLAD on (a) Tokyo 24/7 and (b) Pitts250k-test.

420 421 422 423 424 425 426 427 428 429 430 431 432 433

be trained, we show the place recognition results of training different layers for Pitts30k-test and TokyoTM-val using MPN-COV based on VGG16 in Table 1. From the results on Pitts30k-test, the best performance achieves when training down to conv3 1. It may indicate that training the mid-level and high-level features are more important in second-order statistics based deep network. The backpropagation down to conv1 1 brings bad performance with overfitting occurring, which is almost the same as original network without training. However, the best recall on TokyoTM-val reported in Table 1 is achieved by backpropagation down to conv1 1, which is different from Pittsburg dataset, due to large variablity between the query image and the reference images. For our proposed covariance pooling method, training of three AdaCN layers brings the largest improvements. Training of other layers leads to small improvements, or overfitting for Pitts30k-test, which are reported in Table 2.

20

Dataset Tokyo 24/7 all

Tokyo 24/7 sunset/night

Method NetVLAD [23] MPN-COV [29] Ours NetVLAD [23] MPN-COV [29] Ours

r@1 70.79 72.92 75.56 60.48 63.33 65.71

r@5 80.32 81.49 83.81 70.48 72.86 76.67

r@10 83.81 86.35 86.08 74.76 79.52 80.00

Table 3: Recalls on Tokyo 24/7 all test images and sunset/night-time images using different networks based on VGG16 with PCA whitening. r@N denotes recall@N.

434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461

Matrix normalization. To evaluate our AdaCN, we compare our network with the networks based on Pow-E and Log-E normalizations (without PCA whitening). We can see from Fig. 9 that our AdaCN shows efficiency in learning discriminative representation, preforming better on Tokyo 24/7 dataset compared with the results of matrix power normalized and matrix logarithm normalized covariance, namely MPN-COV and MLN-COV, respectively. The smallest eigenvalues play more crucial roles than the largest ones considering of the original function and its derivative of logarithm, thus the shrinkage process of power normalization is more suitable without changing the order of eigenvalue significances. For our method, the normalization adaptively shrinks larger eigenvalues and stretches smaller eigenvalues compared with power normalization, reducing the ratio between the smallest and the largest eigenvalues of the original sample covariance matrix. Moreover, it varies the eigenvalues around λi = 1, i.e., the eigenvalues less than one can be stretched to some value larger than one. The histogram of eigenvalues of the original sample covariance matrix, power normalized covariance matrix and our adaptively normalized covariance matrix are shown in Fig. 10. Besides, it is observed that the adaptively normalized eigenvalues are not given in strictly descending order for those smaller ones, but the order of significant eigenvalues keeps unchanged. The comparison results of recalls on TokyoTM-val, Pitts250k-test and Tokyo 24/7 based on both VGG16 and AlexNet are shown in Fig. 5, in which our proposed network outperforms MPN-COV, which reveals the effectiveness of the adaptive normalization strategy in place recognition task. PCA whitening. As suggested in NetNLAD, dimensionality reduction of representations after second-order pooling is performed using PCA with whitening followed by L2 normalization. To evaluate the representation ability of proposed network after reduction, Fig. 5 reports the recalls 21

462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499

with reduced representations (+white) on test datasets. It shows that the lower dimensional representations (4096-D) perform even better than the full size vector for MPN-COV and our method, especially on Pittsburg dataset, while the NetVLAD with PCA whitening performs similarly to the full size vector. Moreover, our proposed method and MPN-COV based on VGG16 outperform NetVLAD when the representations are reduced to the same dimensionality on Tokyo 24/7 and Pitts250k-test after PCA whitening in Fig. 11, where the recalls at N = 5 are reported, while the performances decrease gracefully with dimensionality. Table 3 intuitively displays the results of NetVLAD, MPN-COV and our proposed network after PCA whitening based on VGG16 evaluated on the Tokyo 24/7 all test images and challenging sunset/night-time images, including the recalls at N = 1, 5, 10. Place recognition performance reported in Fig. 5 shows that the trained robust covariance pooling network greatly outperforms NetVLAD, which reveals that the second-order statistics are efficient for place representation. Besides, our representation based on VGG16 achieves the best performance, with recall at N =1 exceeding the NetVLAD by 5% on Tokyo 24/7 in Fig. 5 (b), and by 3% on Pitts250k in Fig. 5 (c). Our representation based on AlexNet achieves higher recall, and the margin at N =1 over NetVLAD is 4% on Tokyo 24/7 in Fig. 5 (e), and 1% on Pitts250k-test in Fig. 5 (f). Moreover, the estimated covariance with AdaCN is more robust than MPNCOV with better recognition performance especially on Tokyo 24/7 datasets, with recalls at N =5 exceeding the MPN-COV by almost 2% based on both VGG16 and AlexNet. For pitts250k-test, the results show less improvements over MPN-COV, as there is no enough variability between the query and the reference images in training set. We show examples of place recognition results on Tokyo 24/7 in Fig. 6, where each row corresponds to one test case, and the top retrieved images for each query image are displayed using these networks. The second-order statistics can help recognize the same place despite the changes of illumination, and eliminate the effects of occlusion by cars, trees and people. The query images at the same location captured in daytime but different viewpoints can be correctly recognized using whether the first- or the second-order information, but for the photographs captured at night, notable improvement is made by our method, as shown in Table 3. Moreover, our learned representations correctly recognize the place in some common failure cases including scenarios with very dark lighting conditions and reflective materials. The place recognition examples on Pittsburg dataset are shown in Fig. 22

Dimension 4096-D

512-D

Method NetVLAD [23] CRN+NetVLAD [24] MPN-COV [29] Ours NetVLAD [23] APANet [38] MPN-COV [29] Ours

r@1 85.58 85.50 88.76 88.97 81.13 83.65 85.22 85.24

r@5 93.34 93.50 94.77 95.02 90.87 92.56 93.00 93.33

r@10 95.51 95.50 96.27 96.30 92.61 94.70 94.83 95.06

Table 4: Recalls on Pitts250k-test datasets using different networks based on VGG16 with PCA whitening. r@N denotes recall@N. Note that we use the recalls reported by authors in [24, 38] for CRN+NetVLAD and APANet, respectively.

500 501 502 503 504 505 506 507 508 509

510 511 512 513 514 515 516 517 518 519 520 521 522 523 524

7, where each column corresponds to one test case. The queries are captured in different session and from randomly sampled viewpoints than the database images. Our method makes competitive performance under all the challenging circumstances with great changes in viewpoints and partial occlusions, comparing with NetVLAD and MPN-COV. We also report the recalls on Pitts250k-test for our method compared with MPN-COV, NetVLAD, CRNbased NetVLAD and APANet in Table 4, exceeding the others by 1.6 − 4% margins, which also demonstrates the second-order statistics outperform the context-aware and attention-based methods trained by reweighting regions in feature map. 4.3.2. Image Retrieval To verify the generalization ability of our method, the best performing network trained on Pittsburg with PCA whitening based on VGG16 (without any fine-tuning) is used to extract image representations for object and image retrieval benchmarks, Oxford 5k [53] and Paris 6k [54]. We compare the representations generated by our proposed network with MPN-COV and NetVLAD on dimensionality in Table 5, which shows the accuracy measured by the mean average precision (mAP) for full and cropped images. The image is cropped corresponding to the testing procedures when the query region of interest (ROI) is respected, where the crop is made by extending the ROI by half of the receptive field size. No spatial re-ranking or query expansion is adopted for the results. From Table 5, we can see that the second-order statistics are superior to the first-order ones on Oxford and Paris with 4096-D, 2048-D and 1024-D for the cropped query images. For 512-D representations, the methods based on second-order statistics obtain little 23

Dimension 4096-D 2048-D 1024-D 512-D 4096-D 2048-D 1024-D 512-D

NetVLAD MPN COV Ours Oxford (full/crop) 0.691 0.716 0.689 0.748 0.712 0.762 0.677 0.708 0.679 0.731 0.698 0.748 0.669 0.692 0.667 0.700 0.680 0.725 0.656 0.676 0.657 0.653 0.645 0.683 Paris (full/crop) 0.785 0.797 0.763 0.797 0.774 0.812 0.770 0.783 0.756 0.785 0.769 0.800 0.757 0.765 0.730 0.756 0.746 0.776 0.734 0.749 0.712 0.726 0.722 0.752

Table 5: Image retrieval results for varying dimensionality of our proposed representation comparing with MPN-COV and NetVLAD.

528

improvement, which may be due that excessive compression loses too much effective information lying second-order statistics. Our method achieves the best mAP on Oxford, and performs better than MPN-COV, showing goof generalization ability of our AdaCN method.

529

5. Conclusion

525 526 527

530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545

In this paper, we proposed an adaptively normalized covariance pooling method, which can be flexibly integrated into deep CNNs for effective place recognition. The proposed adaptive covariance normalization can inherit the merits of existing matrix normalization methods (i.e., robust covariance estimation and Riemannian geometry), while adaptively performing normalizations on different covariance representations. The extensive experimental results demonstrate effectiveness of our proposed method on place recognition and image retrieval tasks. In particular, our method has the ability to correctly recognize the places with confusing objects, different illumination and changing viewpoint. Meanwhile, the proposed method achieves better performance than the state-of-the-art NetVLAD and other context-aware and attention-based methods. These results clearly show that high-order statistics offer great potential in improving discriminative ability of deep CNNs, and our proposed AdaCN performs flexible shrinkage on eigenvalues, which is superior to MPN-COV and more suitable for place recognition task, especially in complex scenarios.

24

546

Declaration of interests

549

The authors declare that they have no known competing financial interestsor personal relationships that could have appeared to influence the work reported in this paper.

550

Acknowledgment

547 548

554

This work is supported by the National Natural Science Foundation of China (Grant No. U19A2073, 61671182, 61871381, 61806140.). The authors would like to thank Relja Arandjelovi´c and Akihiko Torii for providing data and code.

555

References

551 552 553

556 557 558

559 560 561

562 563 564

565 566 567

568 569 570

571 572 573

[1] W. Maddern, M. Milford, G. Wyeth, Cat-slam: probabilistic localisation and mapping using a continuous appearance-based trajectory, The International Journal of Robotics Research 31 (4) (2012) 429–451. [2] R. Sim, P. Elinas, M. Griffin, J. J. Little, et al., Vision-based slam using the rao-blackwellised particle filter, in: IJCAI Workshop on Reasoning with Uncertainty in Robotics, Vol. 14, 2005, pp. 9–16. [3] A. J. Davison, I. D. Reid, N. D. Molton, O. Stasse, Monoslam: Real-time single camera slam, IEEE Transactions on Pattern Analysis & Machine Intelligence (6) (2007) 1052–1067. [4] R. Mur-Artal, J. M. M. Montiel, J. D. Tardos, Orb-slam: a versatile and accurate monocular slam system, IEEE transactions on robotics 31 (5) (2015) 1147–1163. [5] J. Engel, T. Sch¨ops, D. Cremers, Lsd-slam: Large-scale direct monocular slam, in: European Conference on Computer Vision, Springer, 2014, pp. 834–849. [6] A. Oliva, A. Torralba, Building the gist of a scene: The role of global image features in recognition, Progress in brain research 155 (2006) 23– 36.

25

574 575 576

577 578 579

580 581

582 583 584

585 586 587

588 589 590 591

592 593 594

595 596 597

598 599 600

601 602 603 604

[7] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2005, pp. 886–893. [8] H. Bay, T. Tuytelaars, L. Van Gool, Surf: Speeded up robust features, in: European conference on computer vision, Springer, 2006, pp. 404– 417. [9] E. Rublee, V. Rabaud, K. Konolige, G. R. Bradski, Orb: An efficient alternative to sift or surf., in: ICCV, Vol. 11, Citeseer, 2011, p. 2. [10] X. Yang, K.-T. T. Cheng, Local difference binary for ultrafast and distinctive feature description, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (1) (2013) 188–194. [11] M. Cummins, P. Newman, Fab-map: Probabilistic localization and mapping in the space of appearance, The International Journal of Robotics Research 27 (6) (2008) 647–665. [12] F. Han, X. Yang, Y. Deng, M. Rentschler, D. Yang, H. Zhang, Sral: Shared representative appearance learning for long-term visual place recognition, IEEE Robotics and Automation Letters 2 (2) (2017) 1172– 1179. [13] H. Zhang, F. Han, H. Wang, Robust multimodal sequence-based loop closure detection via structured sparsity., in: Robotics: Science and systems, 2016. [14] G. Tolias, R. Sicre, H. J´egou, Particular object retrieval with integral max-pooling of cnn activations, in: International Conference on Learning Representations, 2016. [15] A. Gordo, J. Almaz´an, J. Revaud, D. Larlus, Deep image retrieval: Learning global representations for image search, in: European conference on computer vision, Springer, 2016, pp. 241–257. [16] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587. 26

605 606 607

608 609 610

611 612 613

614 615 616 617

618 619 620

621 622 623 624

625 626 627 628

629 630 631

632 633 634

[17] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99. [18] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 million image database for scene recognition, IEEE transactions on pattern analysis and machine intelligence 40 (6) (2018) 1452–1464. [19] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for scene recognition using places database, in: Advances in neural information processing systems, 2014, pp. 487–495. [20] N. S¨ underhauf, S. Shirazi, F. Dayoub, B. Upcroft, M. Milford, On the performance of convnet features for place recognition, in: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2015, pp. 4297–4304. [21] Q. Li, K. Li, X. You, S. Bu, Z. Liu, Place recognition based on deep feature and adaptive weighting of similarity matrix, Neurocomputing 199 (2016) 114–127. [22] Z. Chen, A. Jacobson, N. S¨ underhauf, B. Upcroft, L. Liu, C. Shen, I. Reid, M. Milford, Deep learning features at scale for visual place recognition, in: 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2017, pp. 3223–3230. [23] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, J. Sivic, Netvlad: Cnn architecture for weakly supervised place recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307. [24] H. J. Kim, E. Dunn, J.-M. Frahm, Learned contextual feature reweighting for image geo-localization, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 3251–3260. [25] T. Peng, X. Wang, B. Shi, B. Xiang, W. Liu, Z. Tu, Deep fishernet for object classification, IEEE Transactions on Neural Networks and Learning Systems PP (99) (2016) 1–7.

27

635 636 637

638 639 640

641 642 643 644

645 646 647

648 649 650

651 652 653

654 655 656

657 658 659

660 661 662

663 664 665

[26] J. S´anchez, F. Perronnin, T. Mensink, J. Verbeek, Image classification with the fisher vector: Theory and practice, International journal of computer vision 105 (3) (2013) 222–245. [27] C. Ionescu, O. Vantzos, C. Sminchisescu, Matrix backpropagation for deep networks with structured layers, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2965–2973. [28] Q. Wang, P. Li, L. Zhang, G2denet: Global gaussian distribution embedding network and its application to visual recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2730–2739. [29] P. Li, J. Xie, Q. Wang, W. Zuo, Is second-order information helpful for large-scale visual recognition?, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2070–2078. [30] D. G. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the IEEE International Conference on Computer Vision, Ieee, 1999, p. 1150. [31] N. S¨ underhauf, P. Protzel, Brief-gist-closing the loop by simple means, in: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 2011, pp. 1234–1241. [32] A. Angeli, D. Filliat, S. Doncieux, J.-A. Meyer, A fast and incremental method for loop-closure detection using bags of visual words, IEEE Transactions on Robotics (2008) 1027–1037. [33] M. Agrawal, K. Konolige, M. R. Blas, Censure: Center surround extremas for realtime feature detection and matching, in: European Conference on Computer Vision, Springer, 2008, pp. 102–115. [34] K. Konolige, M. Agrawal, Frameslam: from bundle adjustment to realtime visual mapping, IEEE Transactions on Robotics 24 (5) (2008) 1066–1077. [35] P. Tang, H. Wang, S. Kwong, G-ms2f: Googlenet based multi-stage feature fusion of deep cnn for scene recognition, Neurocomputing 225 (2017) 188–197. 28

666 667 668

669 670 671

672 673 674 675

676 677 678

679 680 681

682 683 684

685 686 687

688 689

690 691 692

693 694 695 696

[36] J. Yu, C. Zhu, J. Zhang, Q. Huang, D. Tao, Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition, IEEE Transactions on Neural Networks and Learning Systems (2019) 1–14. [37] Y. Zhu, J. Wang, L. Xie, L. Zheng, Attention-based pyramid aggregation network for visual place recognition, in: 2018 ACM Multimedia Conference on Multimedia Conference, ACM, 2018, pp. 99–107. [38] M. Lopez-Antequera, R. Gomez-Ojeda, N. Petkov, J. Gonzalez-Jimenez, Appearance-invariant place recognition by discriminatively training a convolutional neural network, Pattern Recognition Letters 92 (2017) 89–95. [39] R. Arandjelovic, A. Zisserman, All about vlad, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2013, pp. 1578–1585. [40] T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear cnn models for finegrained visual recognition, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1449–1457. [41] H. Jegou, F. Perronnin, M. Douze, J. S´anchez, P. Perez, C. Schmid, Aggregating local image descriptors into compact codes, IEEE transactions on pattern analysis and machine intelligence 34 (9) (2012) 1704–1716. [42] J. Carreira, R. Caseiro, J. Batista, C. Sminchisescu, Semantic segmentation with second-order pooling, in: European Conference on Computer Vision, Springer, 2012, pp. 430–443. [43] C. Stein, Lectures on the theory of estimation of many parameters, Journal of Soviet Mathematics 34 (1) (1986) 1373–1403. [44] O. Ledoit, M. Wolf, A well-conditioned estimator for large-dimensional covariance matrices, Journal of Multivariate Analysis 88 (2) (2004) 365– 411. [45] E. Yang, A. Lozano, P. Ravikumar, Elementary estimators for sparse covariance matrices and other structured moments, in: Proceedings of the 31st International Conference on Machine Learning, Vol. 32, PMLR, 2014, pp. 397–405. 29

697 698 699 700

701 702 703

704 705 706 707

708 709 710

711 712 713

714 715 716

717 718

719 720 721 722

723 724 725 726

[46] Q. Wang, P. Li, W. Zuo, L. Zhang, Raid-g: Robust estimation of approximate infinite dimensional gaussian with application to material recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4433–4441. [47] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 815–823. [48] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, Y. Wu, Learning fine-grained image similarity with deep ranking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1386–1393. [49] A. Torii, J. Sivic, T. Pajdla, M. Okutomi, Visual place recognition with repetitive structures, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 883–890. [50] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, T. Pajdla, 24/7 place recognition by view synthesis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1808–1817. [51] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105. [52] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, 2015. [53] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial matching, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2007, pp. 1–8. [54] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Lost in quantization: Improving particular object retrieval in large scale image databases, in: 2008 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–8.

30

Biography of Authors

727

Weiqi Zhang, received the B.S. degree from the Harbin Institute of Technology, Harbin, China, in July 2014, and currently is a Ph.D candidate of Computer Science, in Harbin Institute of Technology, Harbin, China. She has published 2 papers in academic journals. Her research areas focus on machine learning and computer vision.

Zifei Yan received the Ph.D degree in computer application technology from the Harbin Institute of Technology, Harbin, China, in 2010. From May 2007 to August 2007, from December 2008 to January 2009, she was a Research Assistant at the Department of Computing, Hong Kong Polytechnic University, Hong Kong. From September 2014 to September 2015, she was a Visiting Scholar in University of Pittsburgh. She is currently an associate professor in the School of Architecture, Harbin Institute of Technology. She has published more than 20 papers in academic journals and conferences. Her current research interests include machine learning, and computer vision.

Qilong Wang received the Ph.D. degree in the School of Information and Communication Engineering with Dalian University of Technology, China, in 2018.

728

He currently joins Tianjin University as an assistant professor at the College of Intelligence and Computing. His research interests include computer vision and pattern recognition, particularly visual classification and deep probability distribution modeling. He has published more than thirty academic papers in top conferences and referred journal including ICCV/CVPR/NIPS/ECCV/IJCAI and IEEE TPAMI/TIP/TCSVT.

Xiaohe Wu received the Ph.D. degree from the Harbin Institute of Technology, Harbin, China, in 2019. She is currently a post-doctor in the School of Computer Science and Technology, Harbin Institute of Technology. From Oct. 2016 to Oct. 2017, she visited the University of California at Merced. She has published several papers in top-tier academic journals and conferences. Her research interests include machine learning, image restoration and enhancement, single object visual tracking.

Wangmeng Zuo received the Ph.D. degree in computer application technology from the Harbin Institute of Technology, Harbin, China, in 2007. From July 2004 to December 2004, from November 2005 to August 2006, and from July 2007 to February 2008, he was a Research Assistant at the Department of Computing, Hong Kong Polytechnic University, Hong Kong. From August 2009 to February 2010, he was a Visiting Professor in Microsoft Research Asia. He is currently an Associate Professor in the School of Computer Science and Technology, Harbin Institute of Technology. He has published more than 50 papers in top tier academic journals and conferences including IJCV, IEEE T-IP, T-NNLS, T-IFS, Pattern Recognition, CVPR, ICCV, ICML, NIPS, ECCV and ACM MM. His current research interests include image modeling and blind restoration, discriminative learning, biometrics, and computer vision.

Learning second-order statistics for place recognition based on robust covariance estimation of CNN features

Learning second-order statistics for place recognition based on robust covariance estimation of CNN features

Recommend Documents