Accepted Manuscript
Deep learning features exception for cross-season visual place recognition Chingiz Kenshimov, Loukas Bampis, Beibut Amirgaliyev, Marat Arslanov, Antonios Gasteratos PII: DOI: Reference:
S0167-8655(17)30396-3 10.1016/j.patrec.2017.10.028 PATREC 6982
To appear in:
Pattern Recognition Letters
Received date: Revised date: Accepted date:
1 February 2017 6 October 2017 18 October 2017
Please cite this article as: Chingiz Kenshimov, Loukas Bampis, Beibut Amirgaliyev, Marat Arslanov, Antonios Gasteratos, Deep learning features exception for cross-season visual place recognition, Pattern Recognition Letters (2017), doi: 10.1016/j.patrec.2017.10.028
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
CR IP T
ACCEPTED MANUSCRIPT
Research Highlights (Required)
AN US
To create your highlights, please type the highlights against each \item command.
It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.) • A new approach for building a robust season-invariant feature descriptor for vPR • Feature maps from CNN are considered as the smallest units for analysis
M
• Feature descriptors free from season specific information are built
AC
CE
PT
ED
• Baseline performance of CNN feature descriptors are significantly increased
ACCEPTED MANUSCRIPT 1
Pattern Recognition Letters journal homepage: www.elsevier.com
Deep learning features exception for cross-season visual place recognition Chingiz Kenshimova,c,∗∗, Loukas Bampisb , Beibut Amirgaliyeva , Marat Arslanova , Antonios Gasteratosb a Pattern
CR IP T
Recognition and Decision Making Laboratory, The Institute of Information and Computational Technologies, 125 Pushkin St., 050010, Almaty, Kazakhstan b Department of Production and Management Engineering, Democritus University of Thrace, 12 Vas. Sophias, GR-671 32, Xanthi, Greece c Department of Mechanics and Mathematics, al-Farabi Kazakh National University, 71 al-Farabi Ave., 050040, Almaty, Kazakhstan
ABSTRACT
M
AN US
The use of Convolutional Neural Networks (CNNs) in image analysis and recognition paved the way for long-term visual place recognition. The transferable power of generic descriptors extracted at different layers of off-the-shelf CNNs has been successfully exploited in various visual place recognition scenarios. In this paper we tackle this problem by extracting the full output of an intermediate layer and building an image descriptor of lower dimensionality by omitting the activation of filters corresponding to environmental changes. Thus, we are able to increase the robustness of the cross-season visual place recognition. We test our approach on the Nordland dataset, the biggest and the most challenging dataset up to date, where the included four seasons induce great illumination and appearance changes. The experiments show that our new approach can significantly increase, up to 14%, the baseline (single-image search) performance of deep features. c 2017 Elsevier Ltd. All rights reserved.
ED
1. Introduction
AC
CE
PT
Autonomous systems operating in challenging, unstructured and dynamic environments require excessive localization abilities exhibiting robustness to accumulative odometry errors. A common way to increase such robustness is by means of a place recognition engine, i.e. a system which uses the detection of revisited scenes in order to rectify the estimated odometry or recover the robot’s position in localization failure scenarios. A place recognition engine relies on visual sensing algorithms, generally categorized as visual Place Recognition (vPR). With a view to achieving increased robustness, vPR methods need to show invariance over lighting, viewpoint and environmental differences. In particular, long trajectory scenarios, dealing with extreme appearance changes due to different day periods (day and night) or year seasons (summer and winter), have promoted vPR into one of the most challenging tasks in robotic vision. Traditionally, vPR techniques (Cummins and Newman, 2008; Angeli et al., 2008; Schindler et al., 2007; G´alvez-L´opez and Tard´os, 2012; Kostavelis and Gasteratos, 2013) are based
∗∗ Corresponding
author: Tel.: +7 707 702 1119; e-mail:
[email protected] (Chingiz Kenshimov)
on describing the visual content of a given image by using local features, such as SIFT (Lowe, 2004) or SURF (Bay et al., 2006), through the model of “Bag of Visual Words (BoVW)”. According to these methods, a visual vocabulary needs to be trained in order to quantize the respective local descriptor’s space. Given a query image, a set of features is extracted and quantized into the respective nearest visual words. Then, an image description vector is produced –of equal size with the visual vocabulary– corresponding to a histogram that weighs the presence of each visual word to the given image, usually through the “Term Frequency - Inverse Document Frequency” (TF-IDF) model (Sivic et al., 2003). Thus, a place is recognized by comparing distance metrics between each query image descriptor and all the ones contained in the database. Even though the approach described above offers high invariance over the viewpoint changes induced by a freely moving camera, it is also highly sensitive to lighting and environmental differences. Recently, a great breakthrough has been achieved in the computer vision community, through the application of Convolutional Neural Networks (CNN), in object recognition and scene classification tasks (Krizhevsky et al., 2012; Zhou et al., 2014). Over the past several years, models trained using CNNs constantly outperform different visual recognition challenges like ImageNET, Places, COCO, etc. (Russakovsky et al., 2015;
ACCEPTED MANUSCRIPT 2 100 Feature maps
Frequency in % of being in top 20
Image
Conv. op.
Fig. 1. Simplified principal scheme of a Convolutional Layer.
80 60 40 20 0
0
50
150 200 250 Feature map index
300
350
Fig. 2. Histogram showing the frequency of each feature map (layer conv3) found among top k differing feature maps. For this example the comparisons were made between corresponding images from the entire training dataset of winter-summer season pair with k = 20.
2. Related work
One of the most acknowledged approaches in vPR is the work of Cummins and Newman (2008). According to their system, called FAB-MAP, the image description is achieved by a SURF-based visual vocabulary and a Chow Liu tree trained offline to describe the words co-visibility probability. More recent techniques (G´alvez-L´opez and Tard´os, 2012; Mur-Artal and Tard´os, 2014) take advantage of the low computational complexity offered by bags of binary visual words –obtained by features like BRIEF (Calonder et al., 2010) or ORB (Rublee et al., 2011)– and treat the vPR problem as a procedure of calculating similarity metrics between visual word vectors. Another interesting parameter introduced by a variety of vPR systems is the description of image sequences instead of single instances. Algorithms like the ones presented in (MacTavish and Barfoot, 2014; Bampis et al., 2016) are based on combining visual words obtained from multiple frames in order to describe full scenes/places as a whole. Due to the usage of local image features, the aforementioned techniques offer high invariance over possible viewpoint changes. Yet, they are also particularly sensitive to any change due to different lighting and environmental conditions (Valgren and Lilienthal, 2010). To that end, Milford and Wyeth (2012) proposed the SeqSLAM system, which addressed the vPR task by combining global image similarity metrics obtained from sequences of frames. Their main contribution was the dynamic sequence distinction approach based on identifying the vehicle’s velocity through the individual image similarity scores. Moreover, Arroyo et al. (2015) proposed a global and binary description of image sequences based on the “Local Difference Binary” (LDB) algorithm (Yang and Cheng, 2014). According to their method, each image should be downscaled, converted into an illumination invariant representation and described using the LDB method. Those description vectors are then combined into groups of fixed size and concatenated in order to produce the sequence global descriptors. Since the establishment of the CNN scalability over many computer vision techniques, a variety of approaches have been introduced that address the vPR problem through CNN-derived description vectors. S¨underhauf et al. (2015a) in their work comprehensively investigated the functionality of CNN features for different aspects of the place recognition problem (viewpoint and condition invariance). The authors extracted the out-
AN US
Zhou et al., 2014; Lin et al., 2014). Most recently, it has also been shown that CNN models can be successfully applied to a variety of recognition tasks, without being explicitly trained for them. As an example, intermediate outputs of CNNs, treated as generic image representations, demonstrate superior results on attribute detection, fine-grained recognition, image retrieval and other tasks, as compared to hand-crafted state-of-the-art features (Sharif Razavian et al., 2014).
M
Inspired by these achievements, a number of studies have been conducted to investigate the performance of CNN features applied to the vPR problem (S¨underhauf et al., 2015a,b; Panphattarasap and Calway, 2016). S¨underhauf et al. (2015a) obtained features from different layers of a particular CNN model and compared them against many state-of-the-art sequence based place recognition algorithms. They show that the middle layer of the CNN outperforms any other layer or algorithm, even in cases of single image matching.
CE
PT
ED
To further improve the performance of CNNs, the computer vision community is making effort not only towards building deeper and more sophisticated network architectures (Szegedy et al., 2015; He et al., 2015), but also towards having a better insight of how their Daedalian structure performs in different occasions and stimuli. The latter is sought by firstly visualizing feature maps and filters, and then by understanding the information that each carries (Zeiler and Fergus, 2014; Mahendran and Vedaldi, 2015; Yosinski et al., 2015). Yosinski et al. (2015) revealed that some activations in intermediate layers can tally with specific classes. For instance, such activations can function as text, face, or shoulders detectors even though the network was not explicitly trained to learn such classes.
AC
100
CR IP T
Filters
Motivated by these findings, in this paper we mainly aim to investigate some of the available “off-the-shelf” networks and determine which activations (if any) correspond to information that can be considered as a disturbance for vPR applications. Such activation maps may have learned to describe illumination differences due to a sunny or cloudy day, the snow covering some of the environment’s surface, the color of the trees’ leaves that change between different year seasons, etc. By eliminating these activation maps from the feature vector, we end up with a more robust vPR. Such a system is able to fairly describe the scene by omitting any overhead accumulated from seasonal changes of the environment.
ACCEPTED MANUSCRIPT 3 Table 1. Different values of k together with the corresponding number of feature maps for exclusion and the final feature vector length
k 10 20 30 40 50
# of feature maps 25 41 53 66 83
Feature length 60671 57967 55939 53742 50869
k 60 70 80 90
# of feature maps 97 107 121 137
Feature length 48503 46813 44447 41743
CR IP T
activation map). For each filter, we get a distinct feature map, which means that the number of feature maps at some Convolutional Layer is determined by the number of filters at that layer. Finally, the output volume is formed by accumulating all feature maps along the depth dimension (Fig. 1). The notion behind the output of a single Convolutional Layer is that every feature map embodies the response (activation) of a filter to a particular visual incentive. Typically, the first layers respond to some particularly oriented edges or colors, while the deeper the layer is the more complex the visual cues are, which might correspond to a part of an object or even entire objects or backgrounds.
CE
3. Proposed approach
PT
ED
M
AN US
put of several layers from the AlexNet (Krizhevsky et al., 2012) network and compared their performance over four real-life challenging datasets. They concluded that the mid-level Convolutional Layer conv3 outperforms all other layers in cases of great appearance changes and compresses the features, without significantly affecting the performance. In addition, they showed that the AlexNet network, trained on a scene categorization dataset (Places205 and Hybrid) (Zhou et al., 2014), performs slightly better in appearance changing scenarios. Another method that utilizes CNN features was proposed by Arroyo et al. (2016). In their approach, instead of utilizing features from one single layer, features from several CNN layers were concatenated and treated as an individual description vector. As a result, they ended up with a vast feature vector of higher appearance and viewpoint robustness, which was subsequently compressed while notably preserving most of the achieved performance. To overcome the sensitivity to viewpoint changes of CNN features extracted from the full image, S¨underhauf et al. (2015b) suggested a region-based feature extraction scheme, indicated by object proposal algorithms. The shape of the regions was also accounted during the comparison, demonstrating that their method greatly improves the performance on datasets exhibiting significant changes in viewpoint. Panphattarasap and Calway (2016) further improved the previous results by building the so-called Landmark Distribution Descriptors in order to consider the spatial location of landmark regions. The goal of the paper in hand is to investigate the baseline performance of a subset of CNN features that does not embrace activation maps reacting to disturbances from environmental changes. Therefore, our approach cannot be directly comparable to the methods that utilize composite pipelines of building CNN features, such as the ones described by S¨underhauf et al. (2015b); Panphattarasap and Calway (2016). We believe that the technique we are proposing here is not an alternative to or competing with those above, but it can be successfully integrated into any algorithm that extracts CNN features in order to achieve better performance in target applications.
3.1. CNN architecture
AC
Prior to explaining our methodology, we shall briefly describe the basic structure of a typical CNN. As a regular neural network, CNN is a sequence of layers, each of which transforming a given input data, using some differentiable function, into another representation. Unlike ordinary networks, CNNs may contain different types of layers. A simple CNN consists of Convolutional Layers, Pooling Layers and Fully-Connected Layers. One might combine and parameterize those types of layers in order to form a full CNN architecture. As the name suggests, the workhorse of CNNs is a Convolutional Layer. This layer has a set of learnable filters (or kernels) that are small in width and height, but of equal depth to the one of the input volume. By sliding each filter over the input volume and performing the convolution operation at every position, we obtain a two-dimensional array called feature map (also referred to as
3.2. The new approach A common characteristic of previous studies has been the construction of one vector obtained by stretching the output volume of a Convolutional Layer. Then, this vector is treated as a single holistic image or patch descriptor. On the contrary, our approach is different in the way that we aim at understanding
ACCEPTED MANUSCRIPT 4
AC
CE
PT
ED
M
AN US
which of the feature maps in a particular layer are important and which of them can be omitted. In order to highlight the effect of such activation maps on extreme season changes, we select the Nordland dataset (S¨underhauf et al., 2013). This dataset is ideally suited for solving such tasks since it does not exhibit almost any viewpoint change between corresponding images, thus providing the unique opportunity to target on those feature maps which are susceptible to seasonal or illumination changes of the environment. We tackle the problem by comparing images of places in different seasons. More specifically, we perform a statistical analysis of filter activations when a CNN is exposed to images of different seasons. With a view to a detailed explanation of how we detect those filters activated by disturbance, let us first consider two images (I1 and I2 ) of a season pair (e.g. winter-summer) depicting the same place. Suppose that the output of layer L of a particular CNN architecture is of dimensions M × N × P. That is, the layer has M filters and every filter has a feature map of N × P dimension. By passing forward images I1 and I2 into the network we obtain two feature arrays (F1 , F2 ∈ R M×N×P ) at the selected layer L. We then compare the feature maps of each corresponding pair of filters and obtain a vector vi , i ∈ S of M elements, where S is the number of training image pairs, containing numerical comparative values. We sort this vector to get the top k filters the activations of which differ the most and we repeat this procedure for every image in the current season pair of training data. By building a histogram from all sorted vectors vi (like the one in Fig. 2), the most frequent k feature maps are obtained (Algorithm 1, function CmprFeatMaps). Our hypothesis is as follows: “Since we deal with images of the same place, the obtained top k differing feature maps should convey the seasonal information and, thus, should be avoided if we seek to achieve a cross-season vPR system based on a CNN”. The aforementioned comparisons of activation maps are achieved by treating each of them as a gray-scale image and comparing their histograms. We used several histogram comparison metrics, namely Chi-Square, Correlation, Intersection and Bachattaraya. We iterate the main procedure over the set of every histogram comparison method and every possible pair
Fig. 4. Examples of all four seasons for the synthetic viewpoint variation experiment.
CR IP T
Fig. 3. Left: Images of the same place during (up) winter and (down) summer. Next: corresponding feature maps and histograms at index 12 in conv3 layer. Presented particular feature maps (second column) greatly change during different seasons. As a consequence, we consider it as conveying seasonal information and do not include its output to the feature vector.
Fig. 5. Example images of the SYNTHIA dataset. The same place is depicted in four different seasons.
of the seasons. In every iteration, we get a slightly different set of feature maps to be excluded. The final set has been chosen using the following rule: R0 =
c \ d [
Ri j ,
(1)
i=1 j=1
where R is a set of k differing feature maps, c - number of season pairs and d - number of histogram comparison methods (Algorithm 1, function DetectFeatures). Interpreting the above equation, only the feature maps identified jointly by all histogram comparison methods are used (intersection operation) and combined together with all tested season pairs (union oper0 ation). The reduced feature vector F ∈ R(M−|R |)×N×P , resulting by screening these feature maps, is free from the information that corresponds to environmental changes, as per our hypothesis; the experimental results in the next section prove our statement. Finally, it is worth noting that, although we use k as a running parameter in our experiments, after a set of intersection and union operations in equation 1, the number of elements in the final set of feature maps to be eliminated should not necessarily be equal to k. 4. Experiments and results 4.1. Experimental setup The Nordland dataset represents four video recordings, one per different season, of a 729 kilometers long train ride on Norway’s northernmost railway link. During the ride, the train
ACCEPTED MANUSCRIPT 5
a) winter vs. summer
1.0
0.8
0.6
0.6
64896 (full), AUC=0.610 53742 (k = 40 ), AUC=0.713 50869 (k = 50 ), AUC=0.722 48503 (k = 60 ), AUC=0.722 41743 (k = 90 ), AUC=0.709
0.4 0.2 0.0
0.2
0.4
0.6
0.4 0.2 0.8
1.0
c) winter vs. fall
1.0
0.0
0.6
64896 (full), AUC=0.670 53742 (k = 40 ), AUC=0.759 50869 (k = 50 ), AUC=0.768 48503 (k = 60 ), AUC=0.772 41743 (k = 90 ), AUC=0.753
0.4 0.2 0.0
0.2
0.4
Recall
0.6
0.2
1.0
0.0
0.6
0.8
1.0
0.8
1.0
64896 (full), AUC=0.906 53742 (k = 40 ), AUC=0.925 50869 (k = 50 ), AUC=0.928 48503 (k = 60 ), AUC=0.927 41743 (k = 90 ), AUC=0.923
0.4
0.8
0.4
d) spring vs. fall
0.8
0.6
0.2
AN US
Precision
0.0
1.0
0.8
0.0
64896 (full), AUC=0.813 53742 (k = 40 ), AUC=0.858 50869 (k = 50 ), AUC=0.866 48503 (k = 60 ), AUC=0.866 41743 (k = 90 ), AUC=0.854
CR IP T
Precision
0.8
0.0
b) winter vs. spring
1.0
0.0
0.2
0.4
Recall
0.6
M
Fig. 6. Precision-recall curves concerning the four season pairs of Nordland dataset. Our reduced features clearly outperform full-length feature (blue dashed line) in all cases. The performances of features that correspond to different values of k are relatively close. Nevertheless, the feature vectors of k = 60 and k = 50 (red dotted and black lines, respectively) show the best performance in most cases. Corresponding Area Under the Curve (AUC) values are presented in the legend to better discriminate lines on the graphs.
PT
ED
passes through different scenery, which exhibits extreme variation as it includes urban scenes, valleys, fjords, mountains and coastal areas. Every video was recorded with a frame rate of 25 fps. Furthermore, all videos were synchronized using GPS data, so at any given time, corresponding frames of all videos depicts the same place.
AC
CE
We extracted frames from Nordland videos at a rate of 1 frame per second. Next, we removed all frames where the train was in a tunnel or stationary. Frames that formed the training set were randomly selected using a uniform distribution over the image indexes. The total amount of training images was 8000 (2000 for each season), which is roughly 6% of the full dataset. For the experiments, we made use of the AlexNet, a CNN trained on a hybrid dataset from both Places and ImageNet databases. As mentioned before, this choice was dictated by the fact that previous studies (S¨underhauf et al., 2015a) revealed the more forceful performance of this network over the original network trained on ImageNet only. The same study also concluded that mid-level features, especially conv3 level features, perform well under severe appearance changes. In general, the information encoded in the layers of the network gradually becomes more and more complex, i.e. the first layers describe rather primitive shapes, which are fairly generic, whilst the last ones contain semantically meaningful details, which fail to dis-
criminate places being semantically close to each other. Thus, one may intuitively understand why the features located at the mid-layers should contain adequate information to perform robust place recognition. In light of the foregoing, we have chosen conv3 layer for our experiments on a dataset with severe appearance changes. We should clarify here that our experiments considered the conv3 layer output after the non-linearity (ReLU) operation. In order to prove the usefulness of the proposed approach for place recognition with viewpoint changes, we exploited an additional dataset with synthetic viewpoint variations. We constructed this dataset from the first half of Nordland by cropping the original images. The width of the cropped images was chosen to be 70% of the original, while the height remained the same. Starting positions of the cropped sub-images varied from 0% to 30% of the original width and were chosen randomly for the entire dataset. Example images from this synthetic dataset are presented in Fig. 4. SYNTHIA (Ros et al., 2016) is another dataset that we used in our experiments. It consists of many sequences of synthetic photo-realistic frames in different lighting conditions, weather and seasons. The purpose of this dataset is to generate operational conditions for semantic segmentation and scene understanding in the context of driving scenarios. For our work, we exploited a sequence featuring an old European town called
ACCEPTED MANUSCRIPT 6
a) summer vs. spring
1.00
0.8
0.5
0.6
winter vs. summer, 64896 (full), AUC=0.722 winter vs. summer, 48503 (k = 60), AUC=0.754 winter vs. spring, 64896 (full), AUC=0.715 winter vs. spring, 48503 (k = 60), AUC=0.765
0.9
0.0
1.0
0.0
64896 (full), AUC=0.99729 53742 (k = 40), AUC=0.99770 50869 (k = 50), AUC=0.99773 48503 (k = 60), AUC=0.99787 41743 (k = 90), AUC=0.99760
0.90 0.85 0.5
0.6
0.7
Recall
0.8
0.2
0.4
0.6 0.4
0.0
1.0
0.6
0.8
1.0
0.8
1.0
winter vs. fall, 64896 (full), AUC=0.681 winter vs. fall, 48503 (k = 60), AUC=0.714 spring vs. fall, 64896 (full), AUC=0.861 spring vs. fall, 48503 (k = 60), AUC=0.888
0.2
0.9
Recall
b)
1.0 0.8
0.95 Precision
Recall
0.8
0.4 0.2
b) summer vs. fall
1.00
0.80
0.7
0.6
CR IP T
0.85
Precision
64896 (full), AUC=0.98259 53742 (k = 40), AUC=0.98489 50869 (k = 50), AUC=0.98430 48503 (k = 60), AUC=0.98385 41743 (k = 90), AUC=0.98202
0.90
Precision
Precision
0.95
0.80
a)
1.0
0.0
0.2
0.4
Recall
0.6
Fig. 8. Results of the synthetic viewpoint variation experiment. Dashed lines for the full length features, whereas solid lines of the same color for the features with eliminated feature maps.
SEQS-04. The sequence is divided into different sub-sequences where each sub-sequence consists of images of the same path but under different seasonal conditions. After synchronization of the sub-sequences and the exclusion of frames where the vehicle remained still (waiting at a traffic light or allowing pedestrians to cross the road), the final sequence consisted of 2368 images with 592 in each sub-sequence. Sample frames are presented in Fig. 5
‘winter-summer’ season pair (Fig. 6a). This result is intuitively expected since the biggest changes in appearance are presented exactly in this season pair. In addition, reduced features noticeably outperforms full features on pairs ‘winter-spring’ and ‘winter-fall’ (Fig. 6b and Fig. 6c, respectively). In Fig. 6d, which represents ‘spring-fall’ pair, only a slight improvement can be observed when using the reduced features. In Fig. 7 we show a close-up view of the graphs related to pairs ‘summerspring’ (Fig. 7a) and ‘summer-fall’ (Fig. 7b) since the shapes of the precision-recall curves are almost identical in full-scale and one can hardly see any difference in the overlapping lines. In fact, spring, summer and fall images have small appearance changes and thus, this behavior should be expected. Nevertheless, we are still able to reduce the features’ dimensionality (about 25%) in those cases as well, without losing any of the achieved performance. Those are noticeable results since in previous studies the dimensionality reduction (e.g. binary locality-sensitive hashing, random elimination of features’ dimensions (Arroyo et al., 2016)) usually lead to some loss in terms of performance. This essentially means that we successfully identified and eliminated mainly the redundant seasonal information from the feature vectors achieving a more robust vPR system. As it can be seen in many of the season pairs experiments, features of 48503 length (corresponding to k = 60) showed the best performance. Overall, the features of 50869 length (k = 50) also demonstrate very good and almost identical results, slightly outperforming the features of k = 60 in two of the evaluated cases (‘summer vs. spring’ and ‘spring vs. fall’). To better discriminate the presented precision-recall results, we additionally report the Area Under the Curve (AUC) metrics in the legend of the corresponding figures. From these AUC val-
ED
M
AN US
Fig. 7. Zoomed in view of precision-recall curves for (a) summer-spring and (b) summer-fall season pairs. Note that precision values range from 0.8 to 1 and recall values range from 0.5 to 1.
4.2. Performance measures
AC
CE
PT
With our core objective being the baseline performance augmentation of the vPR problem using disturbance free CNN features, we chose a simple single-image nearest neighbor search approach. Thus, in the conducted experiments no sequencebased matching or any other technique for increasing the similarity scores was applied. The comparison of two feature vectors was performed by means of the cosine distance metric. Lastly, with the purpose to analyze the performance of the proposed technique we made use of the precision-recall curves and areas under them. 4.3. Results
This section describes the results obtained by comparing the full-length features with the reduced ones. During the training phase, we varied the parameter k producing corresponding sets of feature maps for exclusion from the feature vector. Table 1 presents the most representative values of k along with the corresponding sets of feature maps to be excluded, as well as the lengths of the final reduced description vectors. As it can be seen in Fig. 6, removing the feature maps that represent seasonal disturbances has the biggest impact on the
ACCEPTED MANUSCRIPT 7 not close to those obtained from the Nordland dataset. Though, considering that the closer the results to the ideal, the more difficult it becomes to improve them, our proposed method is still able to unambiguously increase the baseline performance of the CNN features. The above experiments confirm that the proposed reduced feature vector provide more tolerance over the possible seasonal changes that a dynamic environment may induce. Those feature vectors can be efficiently integrated into the majority of available place recognition systems, e.g. (Milford and Wyeth, 2012; Naseer et al., 2017; Vysotska and Stachniss, 2016), increasing their performance.
a) winter vs. summer
1.00 0.90 0.85 0.80
64896 (full), AUC=0.99248 48503 (k = 60), AUC=0.99481
0.75 0.70
0.4
0.5
0.6
0.7 Recall
0.8
0.9
1.0
b) winter vs. spring
1.00
CR IP T
Precision
0.95
5. Conclusions and future work
0.90 0.85 0.80
64896 (full), AUC=0.99051 48503 (k = 60), AUC=0.99217
0.75 0.70
0.4
0.5
0.6
0.7 Recall
0.8
0.9
1.0
Fig. 9. Results of the experiments conducted on the SYNTHIA dataset.
CE
PT
ED
M
ues, we can see that the performance of ‘winter-summer’ season pair, which benefits the most from reducing the features, increased by more than 14%. Figure 8 represents the results of the experiment on the Nordland dataset with synthetic viewpoint variations. The best performing k = 60 value from our previous evaluation was tested against the 4 most informative season pairings (viz. ’winter vs. summer’, ’winter vs spring’, ’winter vs. fall’, and ’spring vs. fall’). Precision-recall curves in the figure clearly show that the reduced features lead to a noticeable performance gain in all the considered cases. In Fig. 8a, the results from ’winter vs. summer’ and ’winter vs. spring’ pairs are shown, revealing a performance increment of 4.5% and 7% in terms of the measured AUC, respectively. Whereas, for the ’winter vs. fall’ and ’spring vs. fall’ pairs (Fig. 8b) the AUC metrics were increased by 5% and 3%, respectively. It is noteworthy that ’spring vs. fall’ and ’winter vs. spring’ pairs produced approximately the same relative results as in the experiment with no viewpoint variations. Experiments conducted on the SYNTHIA dataset shows the same results in terms of best performing k value and the season pairings which affected most by the reduced features. Figure 9 compares precision-recall curves of full and reduced (k = 60) features for ’winter vs. summer’ (a) and ’winter vs. spring’ (b) pairs. In the case of SYNTHIA dataset two of the mentioned season pairs showed slight but clear performance increment between the reduced and the fully lengthed features. In the rest of the season pairs, the performance gain was insignificant but still positive. From Fig. 9 one can see that even the basic performance of the CNN features can be considered especially high (presenting an AUC value of more than 0.99). Thus, the relative improvement obtained on SYNTHIA dataset is, of course,
AC
In this paper a novel vPR approach was proposed, capable of increasing the baseline performance of feature descriptors extracted from a mid leveled CNN layer. We evaluated the features by considering individual feature maps as the smallest indivisible units of analysis and we were able to identify maps that encode season-specific information. The presented experimental results show that by eliminating these feature maps, a more robust vPR system can be achieved. In the best case scenario, we were able to improve the vPR performance, measured in terms of AUC, by 14%. Additionally, by removing these maps, we reduced the dimensionality of the final features by substantial amount (25%), which is useful secondary effect of our approach. It is also worth noting that even in the cases of small seasonal disturbances (like the comparisons between summer and autumn seasons) we were able to reduce the features’ dimensionality by 25% to 30% while still preserving the same accuracy levels. One can experiment with increasing the value of k factor to further decrease dimensionality of feature vectors, even with the cost of reducing some of the system’s performance, for the purpose of addressing real-time large-scale place recognition tasks. The authors’ guidelines for future work include the extensive evaluation of the presented system on many challenging datasets that present, in addition to seasonal changes, day-night and weather condition changes. We are also planning to experiment with other CNN architectures, in order to identify whether the same pattern can be observed, as well as whether it can be generalized. Another possible direction would be the evaluation of the proposed procedure on feature maps from several CNN layers. The concatenation of the reduced layers into a single description vector can lead to improved results in cases where appearance and viewpoint variations are presented.
AN US
Precision
0.95
Acknowledgments This work was supported by the Institute of Information and Computational Technologies (Almaty, Kazakhstan) and the Department of Production and Management Engineering at DUTH (Xanthi, Greece).
ACCEPTED MANUSCRIPT 8
AC
CE
PT
ED
M
AN US
Angeli, A., Filliat, D., Doncieux, S., Meyer, J.A., 2008. Fast and incremental method for loop-closure detection using bags of visual words. IEEE Transactions on Robotics 24, 1027–1037. Arroyo, R., Alcantarilla, P.F., Bergasa, L.M., Romera, E., 2015. Towards lifelong visual localization using an efficient matching of binary sequences from images, in: Proc. IEEE International Conference on Robotics and Automation, pp. 6328–6335. Arroyo, R., Alcantarilla, P.F., Bergasa, L.M., Romera, E., 2016. Fusion and binarization of CNN features for robust topological localization across seasons, in: Proc. IEEE International Conference on Intelligent Robots and Systems, pp. 4656–4663. Bampis, L., Amanatiadis, A., Gasteratos, A., 2016. Encoding the Description of Image Sequences: A Two-Layered Pipeline for Loop Closure Detection, in: Proc. IEEE International Conference on Intelligent Robots and Systems. Bay, H., Tuytelaars, T., Van Gool, L., 2006. SURF: Speeded Up Robust Features, in: Proc. European Conference on Computer Vision, pp. 404–417. Calonder, M., Lepetit, V., Strecha, C., Fua, P., 2010. BRIEF: Binary Robust Independent Elementary Features, in: Proc. European Conference on Computer Vision, pp. 778–792. Cummins, M., Newman, P., 2008. FAB-MAP: Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research 27, 647–665. G´alvez-L´opez, D., Tard´os, J.D., 2012. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28, 1188– 1197. He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 . Kostavelis, I., Gasteratos, A., 2013. Learning spatially semantic representations for cognitive robot navigation. Robotics and Autonomous Systems 61, 1460–1475. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Classification with Deep Convolutional Neural Networks, in: Proc. Advances in Neural Information Processing Systems, pp. 1097–1105. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L., 2014. Microsoft COCO: Common Objects in Context, in: Proc. European Conference on Computer Vision, pp. 740–755. Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 91–110. MacTavish, K., Barfoot, T.D., 2014. Towards hierarchical place recognition for long-term autonomy, in: ICRA Workshop on Visual Place Recognition in Changing Environments, pp. 1–6. Mahendran, A., Vedaldi, A., 2015. Understanding deep image representations by inverting them, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 5188–5196. Milford, M.J., Wyeth, G.F., 2012. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights, in: Proc. IEEE International Conference on Robotics and Automation, pp. 1643–1649. Mur-Artal, R., Tard´os, J.D., 2014. Fast relocalisation and loop closing in keyframe-based SLAM, in: Proc. IEEE International Conference on Robotics and Automation, pp. 846–853. Naseer, T., Suger, B., Ruhnke, M., Burgard, W., 2017. Vision-based Markov localization for long-term autonomy. Robotics and Autonomous Systems 89, 147–157. Panphattarasap, P., Calway, A., 2016. Visual place recognition using landmark distribution descriptors. CoRR abs/1608.04274. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M., 2016. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 3234–3243. Rublee, E., Rabaud, V., Konolige, K., Bradski, G., 2011. ORB: an efficient alternative to SIFT or SURF, in: Proc. IEEE International Conference on Computer Vision, pp. 2564–2571. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 211–252. Schindler, G., Brown, M., Szeliski, R., 2007. City-scale location recognition, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S., 2014. CNN
Features Off-the-Shelf: An Astounding Baseline for Recognition, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops. Sivic, J., Zisserman, A., et al., 2003. Video Google: A text retrieval approach to object matching in videos, in: Proc. IEEE International Conference on Computer Vision, pp. 1470–1477. S¨underhauf, N., Neubert, P., Protzel, P., 2013. Are we there yet? Challenging SeqSLAM on a 3000 km journey across all four seasons, in: Proc. of Workshop on Long-Term Autonomy, IEEE International Conference on Robotics and Automation. S¨underhauf, N., Shirazi, S., Dayoub, F., Upcroft, B., Milford, M., 2015a. On the performance of ConvNet features for place recognition, in: Proc. International Conference on Intelligent Robots and Systems, pp. 4297–4304. S¨underhauf, N., Shirazi, S., Jacobson, A., Dayoub, F., Pepperell, E., Upcroft, B., Milford, M., 2015b. Place recognition with ConvNet landmarks: Viewpoint-robust, condition-robust, training-free, in: Proc. Robotics: Science and Systems. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Valgren, C., Lilienthal, A.J., 2010. SIFT, SURF & seasons: Appearance-based long-term localization in outdoor environments. Robotics and Autonomous Systems 58, 149–156. Vysotska, O., Stachniss, C., 2016. Lazy Data Association For Image Sequences Matching Under Substantial Appearance Changes. IEEE Robotics and Automation Letters 1, 213–220. Yang, X., Cheng, K.T.T., 2014. Local difference binary for ultrafast and distinctive feature description. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 188–194. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H., 2015. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579 . Zeiler, M.D., Fergus, R., 2014. Visualizing and understanding convolutional networks, in: Proc. European conference on computer vision, pp. 818–833. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A., 2014. Learning deep features for scene recognition using places database, in: Proc. Advances in Neural Information Processing Systems, pp. 487–495.
CR IP T
References