Robust Visual Place Recognition Based on Context Information

Robust Visual Place Recognition Based on Context Information

Available online at www.sciencedirect.com ScienceDirect IFAC PapersOnLine 52-22 (2019) 49–54 Robust Visual Place Recognition Based on Context Inform...

976KB Sizes 0 Downloads 46 Views

Available online at www.sciencedirect.com

ScienceDirect IFAC PapersOnLine 52-22 (2019) 49–54

Robust Visual Place Recognition Based on Context Information Robust Visual Robust Visual Place Place Recognition Recognition Based Based on on Context Context Information Information Robust Robust Visual Visual Place Place Recognition Recognition Based Based on on Context Context Information Information

Deyun Dai*. Zonghai Chen**. Jikai Wang***. Deyun Dai*. Jikai Wang***. Peng Zonghai Bao****.Chen**. Hao Zhao***** Deyun Zonghai Chen**. Jikai Deyun Dai*. Dai*. Zonghai Chen**. Jikai Wang***. Wang***. Peng Bao****. Hao Zhao*****  Deyun Dai*. Zonghai Chen**. Jikai Wang***. Peng Hao Peng Bao****. Bao****. Hao Zhao***** Zhao*****  Peng Bao****. Zhao*****  Haoand Department of Automation, University of Science Technology of China, Hefei, PR China Department of Science and of  (e-mail: [email protected]; [email protected]; Department of Automation, Automation, University University of [email protected]; Science and Technology Technology of China, China, Hefei, Hefei, PR PR China China Department of Automation, University of Science and Technology of China, Hefei, PR China (e-mail: [email protected]; [email protected]; [email protected]; Department of Automation, University [email protected]; Science and Technology of China, Hefei, PR China [email protected]; [email protected]).} (e-mail: [email protected]; (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).} (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).} [email protected]).} [email protected]; [email protected]).} Abstract: In large-scale and long-term visual SLAM, robust place recognition is essential for building a Abstract: In large-scale and long-term visual SLAM, robust place recognition is essential for building global consistent map. However, sensor viewpoints and environmental condition changes, includingaaa Abstract: In and visual SLAM, place is for Abstract: In large-scale large-scale and long-term long-term visual SLAM, robust robust place recognition recognition is essential essential for building building global consistent map. However, sensor viewpoints and environmental condition changes, including Abstract: In large-scale and long-term visual SLAM, robust is essential for building lighting,consistent weather, and seasons, bring huge challenge toplace placerecognition recognition. Wechanges, propose a placea global map. However, sensor and condition including global consistent map. However,bring sensoraa viewpoints viewpoints and environmental environmental condition changes, including lighting, weather, and seasons, huge challenge to place recognition. We propose aa place global consistent map. However, sensor viewpoints and model. environmental condition changes, including recognition algorithm based on CNN features and graph Firstly, CNN features of images are lighting, weather, and seasons, bring a huge challenge to place recognition. We propose lighting, weather, andbased seasons, bring features a huge and challenge to placeFirstly, recognition. We propose a place place recognition algorithm on CNN graph model. CNN features of images are lighting, weather, andAlexNet seasons, bring features awith huge challenge to placeFirstly, recognition. We propose a image place extracted though an network migration characteristics, and N-nearest neighbor recognition algorithm based on CNN and graph model. CNN features of images are recognitionthough algorithm based onnetwork CNN features and graph model. Firstly,andCNN featuresneighbor of images are extracted AlexNet with N-nearest image recognition algorithm based onnetwork CNN features and graph model. Firstly, CNN features of images are descriptorsthough of the an current image descriptor aremigration found by characteristics, approximate nearest neighbor searching. Then, extracted an AlexNet with migration characteristics, and neighbor extracted though an AlexNet network with migration characteristics, and N-nearest N-nearest neighbor image image descriptors of the current image descriptor are found by approximate nearest neighbor searching. Then, extracted though an AlexNet network with migration characteristics, and N-nearest neighbor image according to the difference between descriptors, a weighted directed acyclic graph (weighted DAG) descriptors of the current descriptor are by nearest neighbor searching. Then, descriptors the difference current image image descriptor are found found by approximate approximate nearest searching. DAG) Then, according toof the between descriptors, weighted directed acyclicneighbor graph (weighted descriptors of the the difference currenta image are found by approximate neighbor searching. Then, model which describes costbetween ofdescriptor context matching images isnearest established. Finally, a candidate according to descriptors, aaa between weighted directed acyclic graph (weighted DAG) according to the difference between descriptors, weighted directed acyclic graph (weighted DAG) model which describes aaminimum cost of context matching images is established. Finally, aa Compared candidate according to the difference between descriptors, a between weighted directed acyclic graph (weighted DAG) matching sequence with cost on this model is achieved by using Dijkstra algorithm. model which describes cost of context matching between images is established. Finally, candidate model which describes aminimum cost of context matching between images is established. Finally, a Compared candidate matching sequence with cost on this model is achieved by using Dijkstra algorithm. model which describes aminimum cost of context matching between images is established. Finally, candidate with SeqCNNSLAM and Fast-SeqSLAM, the experimental results demonstrate higher arecognition matching sequence with cost on this model is achieved by using Dijkstra algorithm. Compared matching sequence with minimum cost on this model is achieved by using Dijkstra algorithm. Compared with SeqCNNSLAM Fast-SeqSLAM, experimental demonstrate higher recognition matching sequence withand minimum cost on thisthe model is achievedresults by using Dijkstra algorithm. Compared accuracy and robustness of our algorithm. with SeqCNNSLAM and Fast-SeqSLAM, the experimental results demonstrate higher recognition with SeqCNNSLAM and Fast-SeqSLAM, the experimental results demonstrate higher recognition accuracy and robustness of our algorithm. with SeqCNNSLAM and Fast-SeqSLAM, the experimental results demonstrate higher recognition accuracy and robustness of our algorithm. accuracy and robustness of our algorithm. Copyright © 2019. The Authors. Published by Elsevier Ltd. All rightsDijkstra reserved.algorithm Keywords: place recognition; features; weighted DAG; accuracy and robustness of ourCNN algorithm. Keywords: place recognition; CNN features; weighted DAG; Dijkstra algorithm Keywords: DAG; Keywords: place place recognition; recognition; CNN CNN features; features; weighted weighted DAG; Dijkstra Dijkstra algorithm algorithm  Keywords: place recognition; CNN features; weighted DAG; Dijkstra algorithm  2017) propose Fast-SeqSLAM algorithm based on SeqSLAM  1. INTRODUCTION 2017) propose Fast-SeqSLAM algorithm based on SeqSLAM  for large-scale scenarios. They extract HOG 2017) propose Fast-SeqSLAM algorithm based on 1. INTRODUCTION 2017) propose Fast-SeqSLAM algorithm based global on SeqSLAM SeqSLAM 1. INTRODUCTION for large-scale scenarios. They extract global HOG 1. INTRODUCTION 2017) propose Fast-SeqSLAM algorithm based on SeqSLAM descriptors for each image and save these descriptors in a tree for large-scale scenarios. They extract global With the development of computer technology and artificial for large-scale scenarios. They extract global HOG HOG 1. INTRODUCTION descriptors for each image and save these descriptors in a tree With the development of computer technology and artificial for large-scale scenarios. They extract global HOG structure. Then, for the current image, N most similar images descriptors for each image and save these descriptors in a intelligence technology, intelligent mobile robots have With the the development development of of computer computer technology technology and and artificial artificial descriptors for each image and save these descriptors inimages a tree tree With structure. Then, for the current image, N most similar intelligence technology, intelligent mobile robots have descriptors for each image and save these descriptors in a tree are computed though approximate nearest neighbour (ANN) Then, for the current image, N most similar images With the an development ofresearch computer technology and artificial become important direction. a robot intelligence technology, intelligent mobileWhen robots have structure. structure. Then, for the current image, N most similar images intelligence technology, intelligent mobile robots have are computed though approximate nearest neighbour (ANN) become an important research direction. When a robot structure. Then, for the current image, N most similar images However, state transition is not considered and are though approximate nearest neighbour (ANN) intelligence technology,research intelligentthe mobile robots have algorithm. explores an task ofWhen incrementally become important aa robot are computed computed thoughthe approximate nearest neighbour (ANN) become an unknown importantenvironment, research direction. direction. When robot algorithm. However, the state transition is considered and explores an unknown environment, the task of incrementally are computed though approximate nearest neighbour (ANN) strategy may generate which degrade the algorithm. However, the state mismatches, transition is not not considered and become an important research direction. When a sensor robot greedy building a map and locating the robot using explores an unknown environment, the task of incrementally algorithm. However, the state transition is not considered and explores an unknown environment, the task of using incrementally greedy strategy may generate mismatches, which degrade the building a map and locating the robot sensor algorithm. However, the state transition is not considered and accuracy of this method. greedy strategy may generate explores an environment, the task of using incrementally observation is commonly referred to as simultaneous building aa unknown map and locating the robot sensor greedy strategy may generate mismatches, mismatches, which which degrade degrade the the building map and locating the robot using sensor accuracy of this method. observation is commonly referred to as simultaneous strategy may generate mismatches, which degrade the accuracy of this method. building a and map and locating the robot using sensor greedy localization mapping (SLAM) (Sü nderhauf and Protzel, observation is commonly referred to as simultaneous accuracy of this method. observation is mapping commonly referred to as simultaneous In this paper, enhanced version of Fast-SeqSLAM is localization (SLAM) (Sü nderhauf and Protzel, accuracy of this an method. observation is mapping commonly referred to as tosimultaneous 2012). In a and SLAM system, the robot needs identify the In localization and (SLAM) (Sü nderhauf and Protzel, this paper, an enhanced version of Fast-SeqSLAM is localization and mapping (SLAM) (Sü nderhauf and Protzel, proposed, CNN features are applied replace HOG features In this paper, an enhanced version of Fast-SeqSLAM is 2012). In aa and SLAM system, the robot needs to identify the In this paper, an enhanced versionto of Fast-SeqSLAM is localization mapping (SLAM) (Sü nderhauf and Protzel, visited position to increase accuracy of a map, which requires 2012). In SLAM system, the robot needs to identify the proposed, CNN features are applied to replace HOG features 2012). In a SLAM system, the robot needs to identify the In this paper, an enhanced version of Fast-SeqSLAM is and a modified Dijkstra method is presented to compute the proposed, CNN features are applied to replace HOG features visited position to increase accuracy of a map, which requires proposed, CNN features are applied to replace HOG features 2012). In a SLAM system, the robot needs to identify the to recognize scene that have been reached before. visited position position to to increase increase accuracy accuracy of of aa map, map, which which requires requires and aamatches modified Dijkstra method is presented compute the visited proposed, CNNbetween featurestwo are applied to replaceto HOG features best image sequences, in which the and modified Dijkstra method is presented to compute to recognize scene that have been reached before. amatches modifiedbetween Dijkstratwo method is presented toincompute the visited position to increase of a map, which requires and to recognize scene that have haveaccuracy been reached reached before. best image sequences, which the to recognize scene that been amatches modified Dijkstra method is presented compute the sequences matching problem is regarded as a tominimum cost best between two image sequences, in which the Fabmap (Cummins andhave Newman, 2008) before. is a representative and best matches between two image sequences, in which the to recognize scene that been reached before. sequences matching problem is regarded as a minimum cost best matches between two image sequences, in which the Fabmap (Cummins and Newman, 2008) is a representative optimization problem.problem Furthermore, we conduct comparative matching is as cost place recognition algorithm which combines local feature sequences Fabmap (Cummins and 2008) aa representative sequences matching problem is regarded regarded as aa minimum minimum cost Fabmap (Cummins and Newman, Newman, 2008) is is the representative optimization problem. Furthermore, we comparative place recognition algorithm which combines the local feature sequences matching problem is regarded as a tominimum experiments on several public datasets verify cost the optimization problem. Furthermore, we conduct conduct comparative Fabmap (Cummins andthe Newman, 2008) is a representative descriptor SURF and BoW model. Since the context place recognition algorithm which combines the local feature optimization problem. Furthermore, we conduct comparative place recognition algorithm which combines the local feature experiments on several public datasets to verify the descriptor SURF and the BoW model. Since the context optimization problem. Furthermore, we conduct comparative effectiveness of our method. on public place recognition which combines local feature experiments information is notalgorithm applied, this method is Since notthe robust enough descriptor SURF and the BoW model. the context experiments onourseveral several public datasets datasets to to verify verify the the descriptor SURF and the BoW model. Since the context effectiveness of method. information is not applied, is not robust enough onourseveral public datasets to verify the effectiveness of method. descriptor SURF and the this BoWmethod model. Since the context experiments to the changes of environmental condition. Fabmap2.0 information is not applied, this method is not robust enough effectiveness of our method. information is not of applied, this methodcondition. is not robust enough effectiveness of our method. to the changes environmental Fabmap2.0 information is notNewman, applied, this method is not robust enough (Cummins and 2011) combines Fabmap and to the changes of environmental condition. Fabmap2.0 to the changes of environmental condition. Fabmap2.0 (Cummins and Newman, 2011) combines Fabmap and to the changes of environmental condition. Fabmap2.0 Bayesian filter to achieve place recognition on tracks over (Cummins and Newman, 2011) combines Fabmap and (Cumminsfilter and toNewman, 2011)recognition combines onFabmap and 2. RELATED WORK Bayesian achieve place tracks over (Cummins and Newman, 2011) combines Fabmap and 1000 km. However, due to changes of illumination, weather, Bayesian filter to achieve place recognition on tracks over 2. RELATED WORK Bayesian filter to achieve place recognition on tracks over 2. WORK 1000 km. However, due to changes of illumination, weather, 2. RELATED RELATED Bayesian achieve on tracks over Loop closure detection and seasons, thetoappearance of therecognition scenes changes apparently. 1000 km. filter However, due to to place changes of illumination, illumination, weather, based onWORK the bag-of-words (BoW) 1000 km. However, due changes of weather, 2. RELATED WORK and seasons, the appearance of the scenes changes apparently. Loop closure detection based on theexpresses bag-of-words (BoW) 1000 km. However, due to changes of illumination, weather, It is seasons, difficult for place recognition methods to correctly match model and the appearance of the scenes changes apparently. (Sivic and Zisserman, 2003) an image as Loop closure detection based on bag-of-words (BoW) and seasons, the appearance of the scenes changes apparently. Loop closure detection based 2003) on the theexpresses bag-of-words (BoW) It is difficult for recognition methods to correctly match model (Sivic and Zisserman, an image as and seasons, the place appearance of the scenes changes apparently. two images obtained at the same scenes. It is difficult for place recognition methods to correctly match Loop closure detection based on the bag-of-words (BoW) text composed of visual words, and applies text detection model (Sivic and Zisserman, 2003) expresses an image as It is difficult for place recognition methods to correctly match (Sivic and Zisserman, 2003) expresses an image as two the same scenes. text composed of visual words, and applies text detection It is images difficultobtained for placeat recognition methods to correctly match model two images obtained at the same scenes. model (Sivic and Zisserman, 2003) expresses an image as technology to detect loops. The introduction of this text composed of visual words, and applies text detection two images obtained at the same scenes. text composed of visual words, and applies text detection Instead of using single frame to implement place recognition, technology to detect loops. The introduction of this two images obtained at the same scenes. text composed of visual words, and applies text detection technology has aroused widespread attention of researchers to detect detect loops. loops. The The introduction introduction of of this this Instead using single frame toand implement to Milford of Wyeth Wyeth, place 2012)recognition, propose a technology Instead of using single frame implement place recognition, technology has aroused of Instead ofand using single(Milford frame to toand implement place recognition, technology to 2014, detectOhwidespread loops. The attention introduction this (Milford et at., et al., 2015, Kejriwal etresearchers al.,of2016). technology has aroused widespread attention of researchers Milford and Wyeth (Milford Wyeth, 2012) propose a technology has aroused widespread attention of researchers Instead of using single frame to implement place recognition, SeqSLAM algorithm which and aimsWyeth, to obtain thepropose optimalaa (Milford et at., 2014, Oh et al., 2015, Kejriwal et al., 2016). Milford and Wyeth (Milford 2012) Milford and Wyeth (Milford and Wyeth, 2012) propose technology has aroused widespread attention of researchers However, the BoW-based approaches increase the burden of (Milford at., 2014, al., Kejriwal et 2016). SeqSLAM algorithm which aims to Though obtain optimal (Milford et etthe at.,BoW-based 2014, Oh Oh et etapproaches al., 2015, 2015, increase Kejriwal the et al., al., 2016). Milford and Wyeth and Wyeth, 2012)the propose matches between two(Milford image this methoda However, SeqSLAM algorithm which aims obtain optimal burden of SeqSLAM algorithm whichsequences. aims to to Though obtain the the optimal (Milford etthe at.,BoW-based 2014, Ohincreasing. etapproaches al., 2015, Kejriwal et al., 2016). search matching as scale Zhang et al. (Zhang et al., However, increase the burden of matches between two image sequences. this method However, the BoW-based approaches increase the burden of SeqSLAM algorithm which aimsextreme to Though obtain the optimal shows performance under apparent matchessatisfy between two image image sequences. this changes method search matching as scale increasing. Zhang et al. (Zhang et al., matches between two sequences. Though this method However, the BoW-based approaches increase the burden of 2016) propose an online visual words generation method, in search matching as scale increasing. Zhang et al. (Zhang et al., shows satisfy performance under extreme apparent changes search matching as scale increasing. Zhang et al. (Zhang et al., matches between two image sequences. Though this method such as illumination and season, the time complexity of shows satisfy performance under apparent changes 2016) propose an visual words generation method, in shows satisfy performance under extreme extreme apparent changes search matching asonline scalethat increasing. Zhang et al. (Zhang et al., which only the features are robust to viewpoint changing 2016) propose an online visual words generation method, in such as illumination and season, the time complexity of 2016) propose an onlinethat visual words to generation method, in shows satisfy performance under with extreme apparent searching increases exponentially thetime number of changes images. such as illumination and season, the complexity of which only the features are robust viewpoint changing such as illumination and season, the time complexity of 2016) propose an online visual words generation method, in are stored as words. which only the features that are robust to viewpoint changing searching increases exponentially with the number of images. only the features that are robust to viewpoint changing such as illumination and season, thethetime complexity of which To tackle this problem, Siam and Zhang (Siam and Zhang, searching increases exponentially with number of images. are stored as words. searching increases exponentially with the number of images. which only the features that are robust to viewpoint changing To tackle increases this problem, Siam and Zhang (Siam and Zhang, are are stored stored as as words. words. searching exponentially with the number of images. To To tackle tackle this this problem, problem, Siam Siam and and Zhang Zhang (Siam (Siam and and Zhang, Zhang, are stored as words. To tackle this problem, Siam and Zhang (Siam and Zhang,

2405-8963 Copyright © 2019. The Authors. Published by Elsevier Ltd. All rights reserved. Peer review under responsibility of International Federation of Automatic Control. 10.1016/j.ifacol.2019.11.046

Deyun Dai et al. / IFAC PapersOnLine 52-22 (2019) 49–54

50

In order to reduce temporal complexity, Liu and Zhang (Liu and Zhang, 2013) introduce particle filtering technique to estimate the candidate of the sequence matches. They further evaluate the probability of these candidates and discard the candidates with small probability, respectively. However, particle filter is sensitive to system parameters. Loukas et al. (Bampis et al., 2018) constitute sequence-visual-wordvectors from each sequence and discard scenes with considerable visual deviation. They further introduce a temporal consistency filter to increase the number of correct matches. Methods based on traditional handcrafted features are sensitive to various changes in the environment and are not robust enough. In recent years, CNN based image processing methods have become a research hotspot and have been widely used in place recognition (Sünderhauf et al., 2015, Kumar et al., 2017, Yin et al., 2019). Bai et al. (Dongdong et al., 2018) propose a method named SeqCNNSLAM that is based on CNN feature. They reduce matching number by applying spatial relationship and make this method more practical by adjusting parameters in an online style. Our approach also exploits CNN features but using improved Dijkstra to search matches, which reduces right matches missing.

3. THE FRAMEWORK OF ALGORITHM

Fig. 1. The framework of algorithm.

Let X  ( X 1 , , X n ) be the set of images obtained in a trajectory and C  (C1 , , Cm ) refer to a query images sequence. Our purpose is to calculate a sequence with the highest similarity to query image sequence in the map.

4. PLACE RECOGNITION ALGORITHM 4.1 Feature descriptors

Place recognition is to calculate the best matching of a query image sequence corresponding to the map database. Essentially, it is a global matching cost minimization problem, in which a sequence with the highest similarity with the query sequence is calculated. To solve the problem, we propose an approach based on Fast-SeqSLAM, where an AlexNet network is used to extract CNN descriptors for describing and matching scenes so that appearance and perspective changes can be handled simultaneously. To further improve the accuracy of obtained matches, context information is used to increase constraints of image matching, and difference matrix is converted into weighted directed acyclic graph model, in which the state transition information is also encoded. A route planning method is employed to find a minimum cost sequence from beginning node to ending node on the graph. Finally, the matching pairs are further optimized according to context information. The overall framework of the algorithm is shown in Fig.1.

We extract CNN features using an AlexNet network (Krizhevsky et al., 2012) that has been pre-trained on the Image database. The network is implemented by the open source software Caffe. The dimensions of outputs of each layer are shown in Table 1. 4.2 Place recognition based on graph 4.2.1. Similarity model The purpose is to determine a best matching sequence from the map database that matches query images based on the difference matrix in a place recognition task. We solve the best matching problem by establishing a cost model according to difference matrix D and calculating the minimum cost sequence. The model is a weighted directed acyclic graph G  (V , w, E ) , which V represents nodes, w represents weights of nodes, E represents edges.

Table 1. The dimension of the output vector for each layer of AlexNet Layer Dimension

CONV1 290400

POOL1 69984

CONV2 186624

Convolutional Layer POOL2 CONV3 CONV4 43264 64896 64896

CONV5 43264

POOL5 9216

Fully Connected Layer FC6 FC7 FC8 4096 4096 1000



Deyun Dai et al. / IFAC PapersOnLine 52-22 (2019) 49–54

In the model, the beginning node is defined as the pair of index corresponding to the first match of the loop. The ending node is defined as the pair of index of the last match of the loop. The weights indicate the similarity between the matched images in one node, and the edges indicate spatial connectivity of between nodes. The graph model is shown in Fig. 2. 1) Weighted nodes V : a set of model nodes, which has three kinds of nodes, contains beginning node v s , ending node v e , and process node vi , j . We define the node v s as image pair with the least difference from the previous  1 frames as follow v  (is , js )  arg min di , j i  (1, 1 ), j  (1, 1 )

.

(1)

The node v e denotes the pair with the least difference from the last  2 images as v e  (ie , je )  arg min d i , j i, j

i  (m   2 , m), j  (n   2 , n)

.

(2)

The node vi , j  (i, j ) represents matching pair between the i frame from query images and the j image in map database, where weight is denoted with difference as

w( i , j )  d i , j .

(3)

2) Edges E : traversing all possible nodes between the beginning node and the ending node, as shown in Fig.2. The edge from node vi , j to vi, j  is defined as Eii,,j j   {vi , j , vi, j  } . The edges of the model starting from node v  (i, j ) are i, j  j i, j

E(i , j )  {E

| j1  1,

i 1, j  j2 i, j

, k } {E

where k denotes traversing range after node v  (i, j ) , indicating state transition between images. So all edges in the model can be represented as E

V  ve vs

| j2  0,

, k} ,(4)

E( i , j ) .

(5)

4.2.2 Dijkstra-based image sequence matching The place recognition problem is transformed into obtaining the minimum cost sequence Qmin on the similarity graph model. Qmin  {v s , vi1 , j1 ,

s

i, j

51

, viq , jq , v e }T

 arg min  s w( i , j ) e

,

(6)

where q  Qmin  2 , Qmin represents the number of nodes in the sequence. It means that we need to find all nodes from beginning node to ending node to find the most similar images pairs. Therefore, considering local and global cost, we search the minimum cost sequence Qmin from the beginning to ending node using Dijkstra algorithm. Then, the sequence is optimized combining temporal construction on account of containing matching pairs between one query image and many images in map database. Finally, we obtain the best matching sequence M ={M1, , MT , ,Mm}T by combining global optimality with local optimality. The algorithm is as follow. 1. Searching the minimum cost sequence Qmin based on similarity model by Dijkstra. We adjust the searching range according to speed of camera corresponding to the query sequence. 1) Initialization: we build two sets, which the nodes that have been traversed are in a set U 1 , U1  {vs } , and that have not been traversed constitute a set U 2 , U 2  V  v s . At the same time, we also build cost

array Dˆ with length of all nodes number, that is Dˆ  V . The value Dˆ ( i , j ) of the array is cost between the beginning node and nodes vi , j which is initially set to infinity. 2) Searching the minimum cost node v  (i, j) after node vi , j as follow

v  arg min( Dˆ (i , j )  w(i, j ) ) v  {vi , j | Eii,,jj  E(i , j ) } {viVel , jVel }

Fig. 2. The cost model

,

(7)

where (iVel , jVel ) is a predicted node following the node (i, j ) , Vel is the speed of camera. In order to make full use of context cost and increase the probability of matching pairs, we add nodes {viVel , jVel } into possible

Deyun Dai et al. / IFAC PapersOnLine 52-22 (2019) 49–54

52

searchable nodes set of the node vi , j except these nodes {vi , j | Eii,,jj  E( i , j ) }

5.2 Evaluation criteria

3) Updating the sets and cost of the array as follow

U1  U1  v, U 2  U 2  v, Dˆ    Dˆ w   . (i , j )

(i , j )

(8)

(i , j )

2. Optimizing the minimum cost sequence Qmin . We remove these matching pairs that one query image is matched with many images in map database. Firstly, searching candidate matching pairs set Q  from Qmin . We denote Qt as a candidate set of the t frame query image, which contains all possible matching nodes of the t node. According to the score presented in Fast-SeqSLAM, which represents the sum of difference about nodes of the trajectory with velocity Vel , we define the score of Qt as 

ScoreQt 

Score(i , j ) ,

(9)

( i , j )Qt

which Score( i , j ) represents a score between the i frame query image and the j frame image of the map database. The optimal matching pairs and score are defined as M t  arg min ScoreQt ( i , j )Q 

Scoret  min ScoreQt

.

(10)

Then, we further constraint searching results by score threshold  as follows Scoret    Mt Mt   , others (t , Nan)

sequences with the same trajectory, where two are in daylight with different views and the other at night.

(11)

where (t , Nan) indicates that there is no image matching the t frame query image. It can decrease the false matching pairs so that increasing accuracy. 5. EXPERIMENT ANALYSIS 5.1 Datasets To verify the performance of proposed place recognition algorithm thoroughly, three public datasets are considered, including Nordland dataset, UA dataset, and Garden Points Walking dataset. The Nordland dataset contains four videos recorded by a train on the same trajectory in four different seasons. We capture image sequences at 1 fps from the videos, and remove the parking, tunnelling, etc. The image sequences of spring and winter are selected for experiments. The UA dataset is made up of two image sequences, one during day and the other at night, which are collected by a Husky robot using a RGB camera at the University of Alberta, Edmonton, Canada. Garden Points Walking dataset consists of three different

The accuracy of place recognition is measured by generating precision and recall value. The numbers of correct matches and false matches in all detected results are record as TP (true positives) and FP (false positives) respectively, and the number of right matches in undetected results is record as FN (false negatives). Let P represent the precision index and R be the recall. Then, they can be computed as follow

P

TP TP . , R TP  FP TP  FN

(12)

Since accuracy affects the quality of mapping, we also define another criteria that recall at 100% precision as RP100 . 5.3 Experimental results and analysis We extract CNN features of several convolutional layers to evaluate robustness of different convolutional layers of the AlexNet to scene changes. We also compare the results of our proposed algorithm with Fast-SeqSLAM on three datasets. We set length of temporal construction as ds  20 , the velocity of the trajectories as 0.8    1.2 , searching parameters as 1   2  5 , and k  2 . The score threshold is defined as   0.9  ds  max D . The matching image is a correct matching if the difference between images is under  when comparing precision-recall curves, where   2 for Nordland dataset,   4 for UA dataset and Garden Points Walking dataset. To evaluate the effectiveness of proposed method, place recognition performance of CNN features descriptors from different layers are compared, which is evaluated by precision-recall curves. Furthermore, we also compare the performance with SeqCNNSLAM and Fast-SeqSLAM with HOG to evaluate the modified Dijkstra method. Considering different number of image in different datasets, different sampling intervals are set when comparing curves, which are set to 10, 50 and 100 for Garden Points Walking, UA dataset and Nordland dataset respectively. Fig.3-5 illustrate the comparison experimental results about the three datasets. At the end, we also compare N  RP100 curves of our method with Fast-SeqSLAM to evaluate the impacts of the ANN algorithm for matching results, which is shown as Fig.6. The experimental results about Nordland dataset is shown in Fig.3, which exhibits appearance changes caused by different seasons. Compared with Fast-SeqSLAM, we improve precision and recall with HOG descriptors. RP100 is increased from 3% to 12%. The results also shows that the recognition precision with features of conv3 layer is the highest, which indicates that the features are robust to apparent changes. Fig.4 illustrates the results about UA dataset with illumination changes during a day. We improve RP100 from



Deyun Dai et al. / IFAC PapersOnLine 52-22 (2019) 49–54

53

21% to 29%. From the figure, it can be seen that the performance with features of pool2 is state-of-art, which indicates that the features of pool2 are robust to daylight changes. To illustrate the generalization of our proposed, we conduct experiments though Garden Points Walking dataset which exhibits both condition and viewpoint changes, shown as Fig.5 (a) and Fig.5 (b). The comparison experimental results demonstrate that our proposed method combined features of pool2 can improve recognition accuracy obviously when lighting and viewpoint are changed simultaneously.

Fig. 3. Nordland dataset and precision-recall curve.

Fig. 5. Garden Points Walking dataset and precision-recall curve: (a) different time of a day; (b) simultaneous condition and viewpoint changes.

Fig. 4. UA dataset and precision-recall curve.

Fig. 6. N  RP100 relation curve

Deyun Dai et al. / IFAC PapersOnLine 52-22 (2019) 49–54

54

From Fig.3-5, the compared results show that our method with CNN features descriptors from some layers is better than SeqCNNSLAM, which indicates that our proposed method of finding matching pairs is effective. At least, N -nearest of current imagery descriptor are used by ANN algorithm to testify that accuracy of recognition is influenced by N that RP100 varies with different N value. We conduct experiments about different N and plot N  RP100 curves which explains that recall increases first and then decreases with augments of N when precision is 100%, as shown in Fig.6. Especially, the accuracy is the highest when N is set to 5. 6. CONCLUSION The experimental results suggest that our proposed method improves robustness to these changes caused by vary of seasons, daylight, and viewpoints comparing with FastSeqSLAM algorithm. At the same time, to decrease spatial searching scale, weighted directed acyclic graph model is established. Considering context information, a minimum cost sequence is calculated and optimized by an improved searching strategy combined with temporal continuity. While increasing the cost of imagery sequence matching, our recognition accuracy is superior to Fast-SeqSLAM. We will further improve the efficiency of our proposed method. Acknowledgements This work is supported by the National Natural Science Foundation of China (Grant No. 91848111), the Young Scientists Fund of the National Natural Science Foundation of China (Grant No. 61703387) and the Science Foundation for Youths of Anhui Province (Grant No. 1708085QF159).

REFERENCES Bampis, L., Amanatiadis, A. and Gasteratos, A. (2018). Fast loop-closure detection using visual-word-vectors from image sequences. The International Journal of Robotics Research, 37, 62-82. Cummins, M. and Newman, P. (2008). FAB-MAP: Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research, 27, 647-665. Cummins,, M. and Newman, P. (2011). Appearance-only SLAM at large scale with FAB-MAP 2.0. The International Journal of Robotics Research, 30, 11001123. Dongdong, B., Chaoqun, W., Zhang, B., Xiaodong, Y. and Xuejun, Y. (2018). CNN feature boosted SeqSLAM for real-time loop closure detection. Chinese Journal of Electronics, 27, 488-499. Fuhao, Z. and Jiping, L. (2009). An algorithm of shortest path based on Dijkstra for huge data. 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, 244-247.

Kejriwal, N., Kumar, S. and Shibata, T. (2016). High performance loop closure detection using bag of word pairs. Robotics and Autonomous Systems, 77, 55-65. Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 1097-1105. Kumar, D., Neher, H., Das, A., Clausi, D. A. and Waslander, S. L. (2017). Condition and viewpoint invariant omnidirectional place recognition using cnn. 2017 14th Conference on Computer and Robot Vision (CRV), 32-39. Liu, Y. and Zhang, H. (2013). Towards improving the efficiency of sequence-based SLAM. 2013 IEEE International Conference on Mechatronics and Automation, 1261-1266. Kilford, M., Scheirer, W., Vid, E., Glover, A., Baumann, O., Mattingley, J. and Cox, D. (2014). Condition-invariant, top-down visual place recognition. 2014 IEEE International Conference on Robotics and Automation (ICRA), 5571-5577. Milford, M. J. and Wyeth, G. F. (2012). SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. 2012 IEEE International Conference on Robotics and Automation, 1643-1649. Oh, J. H., Lee, S.-H. and Lee, B. H. (2015). Ac-curate visual loop-closure detection using bag-of-words for multiple robots. Journal of Automation and Control Engineering, 3. Sünderhauf, N. and Protzel, P. (2012). Switchable constraints for robust pose graph SLAM. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 1879-1884. Sünderhauf, N., Dayoub, F., Shirazi, S., Upcroft, B. and Milford, M. (2015). On the performance of convnet features for place recognition. arXiv preprint arXiv:1501.04158. Siam, S. M. and Zhang, H. (2017). Fast-seqslam: A fast appearance based place recognition algorithm. 2017 IEEE International Conference on Robotics and Automation (ICRA), 5702-5708. SIVIC, J. & ZISSERMAN, A. Video Google: A text retrieval approach to object matching in videos. null, 2003. IEEE, 1470. Yin, P., Xu, L., Li, X., Yin, C., Li, Y., Srivatsan, R. A., Li, L., Ji, J. and He, Y. (2019). A Multi-Domain Feature Learning Method for Visual Place Recognition. arXiv preprint arXiv:1902.10058. Zhang, G., Lilly, M. J. and Vela, P. (2016). A. Learning binary features online from motion dynamics for incremental loop-closure detection and place recognition. 2016 IEEE International Conference on Robotics and Automation (ICRA), 765-772.