Embedding SLAM algorithms: Has it come of age?

Embedding SLAM algorithms: Has it come of age?

Accepted Manuscript Embedding SLAM algorithms: Has it come of age? Mohamed Abouzahir, Abdelhafid Elouardi, Rachid Latif, Samir Bouaziz, Abdelouahed Ta...

3MB Sizes 4 Downloads 162 Views

Accepted Manuscript Embedding SLAM algorithms: Has it come of age? Mohamed Abouzahir, Abdelhafid Elouardi, Rachid Latif, Samir Bouaziz, Abdelouahed Tajer

PII: DOI: Reference:

S0921-8890(17)30196-3 https://doi.org/10.1016/j.robot.2017.10.019 ROBOT 2942

To appear in:

Robotics and Autonomous Systems

Received date : 25 March 2017 Revised date : 26 August 2017 Accepted date : 27 October 2017 Please cite this article as: M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, A. Tajer, Embedding SLAM algorithms: Has it come of age?, Robotics and Autonomous Systems (2017), https://doi.org/10.1016/j.robot.2017.10.019 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Embedding SLAM Algorithms: Has it Come of Age? Mohamed Abouzahira,∗, Abdelhafid Elouardia,∗, Rachid Latifb , Samir Bouaziza , Abdelouahed Tajerc a SATIE

- CNRS UMR 8029, University of Paris-Sud - University of Paris Saclay, Digiteo Labs, BAT 660, 91405 Orsay Cedex, France b LISTI Laboratory, ENSA, Agadir, Morocco c Laboratory of Electrical Engineering and Systems Control, ENSA, Marrakech Morocco

Abstract Development of Simultaneous Localization and Mapping (SLAM) systems in the era of autonomous navigation and the growing demand for autonomous robots have put into question how to reduce the computational complexity and make use of SLAM algorithms to operate in real time. Our work, aims to take advantage of low-power embedded architectures to implement SLAM algorithms. Precisely, we evaluate the promise held by the new modern low power architectures in accelerating the execution time of SLAM algorithms. Throughout this, we map and implement 4 well-known SLAM algorithms that find utility in very different robot applications and autonomous navigation, on different architectures based embedded systems. We present first a processing time evaluation of these algorithms on different CPU based architectures. Results demonstrate that FastSLAM2.0 allows a compromise between the computation time and the consistency of the localization results. The algorithm has been modified to be adapted to large environments. It is then optimized for parallel implementations on GPU and FPGA. A comparative study of the resulting implementations is given. Our results show that an embedded FPGA based SoC architecture is an interesting alternative for a SLAM algorithm implementation using the hardware-software co-design approach. Hence, the system meets performance requirements of a robot to operate in real-time constraints. Keywords: SLAM Algorithms, Heterogeneous Architectures, GPU, CPU, FPGA, Hardware Software Codesign 1

Introduction

SLAM algorithms allow an autonomous navigation of robots in an unknown environment. Localization and mapping represent a concurrent problem that cannot be solved independently. Indeed, if a mobile robot follows an unknown trajectory in an unknown environment, the estimation of the robot’s pose and the explored map becomes more complicated. In such situation, no information is previously known by the mobile robot which is supposed to create a map and to localize itself according to the created map. Before the robot can estimate the position of a ∗ Corresponding

authors Email addresses: [email protected] (Mohamed Abouzahir), [email protected] (Abdelhafid Elouardi)

Preprint submitted to Robotics and Autonomous Systems

given landmark, it needs to know from which location this landmark was observed. At the same time it is difficult to estimate the actual position of the robot without a map. A good map is necessary for localization while an accurate pose estimate is needed for map reconstruction, hence the name of Simultaneous Localization And Mapping. Over the span of the last decade, SLAM algorithms have been executed on high-performance machines because of their computational complexity. This is to ensure real-time performances and accurate map construction. In practical situation, SLAM algorithms are an interesting topic specially when it comes to explore extremely difficult situations where the dimension of the robot may not allow using high performance machine to execute SLAM algorithms. Therefore, there is a general need, in case of embedded systems, to have an ar-

November 2, 2017

chitecture that allows efficient implementation to ensure real-time constraints. Embedded SLAM algorithms are an ongoing research topic. There were several attempts to implement SLAM algorithms on embedded low-power platforms where researchers have re-engineered these algorithms to map them on embedded platforms. Yet, the embedded SLAM implementation is still limited and depends heavily on the nature of the algorithm and the target embedded architecture. Of late, many initial attempts have been made to bridge this gap taking into account both the algorithm and architecture. Authors in Nardi et al. (2015) provided a SLAM benchmark that help making a good comparison of accuracy, computational performance and energy consumption between different architectures. The algorithm implemented is a dense SLAM systems for 3D scene understanding. It uses a depth camera to perform real-time localization and dense mapping. They used a set of devices for experiment and different programing languages. They used ARM and GPU processors from a popular embedded platforms. Using SLAMBench, Kfusion could only run in real-time on high-end GPUs and almost in real time on a low-end GPU. Recently, Bodin et al. (2016) proposed an incremental co-design exploration approach to performs some optimization to allow real-time execution of a dense-SLAM on low-power architecture. It consist of performing a vertical co-design space exploration taking into account algorithmic parameters, compiler, and hardware layers in order to optimally map the executed algorithm onto the target embedded architecture. The proposed approach was tested for two different applications based SLAM Nardi et al. (2017). Saeedi et al. (2017), included motion and structure (MS) as a fourth design spaces for the conception of a real-time SLAM-based applications. The MS parameters are calculated using information theory and used alongside with images to control the motion of the camera to meet the desired performance. Such methodology demonstrated good results in accelerating the Kfusion algorithm on low power embedded device. However, There remains several challenges to tackle with dense-SLAM for an embedded SLAM system. Furthermore, the FPGA is missed from the mentioned works above and must be also considered for evaluation as it represents a massively parallel architecture and can greatly accelerate the execution of a SLAM algorithm. Recently, other major strides have been made in SLAM

algorithms computation. Almost of them, singularly focused on improving SLAM performances without discussing other issues that arise specially in embedded low power architectures. In a recent work, Tertei et al. (2016) implemented an accelerator block of EKF-SLAM for 3D visual SLAM. They implemented the block on a low power embedded architecture integrating a processor with reconfigurable hardware. The sole overarching aim of their work is to optimize the algorithm on a dedicated architecture. They did not take into account the SLAM system in itself. In essence, the EKF algorithm is not a suitable choice for SLAM problem since the complexity scales linearly with the number of landmarks and hence the EKF is not suitable for use in large scale environment. Furthermore, the EKF SLAM is not suitable to take advantage of the massively parallel architecture of FPGAs, as most parts of the algorithm are sequential. Authors in Sileshi et al. (2016a) implemented a particle filter based SLAM on an FPGA. They implemented a robust laser FastSLAM1.0 with odometry sensors. They used a real dataset to evaluate the FPGA implementation. Most part of the FastSLAM1.0 is implemented on the Microblaze soft-core processor including the sampling, importance weight and resampling functional blocks. Other parts are accelerated on FPGA. Their work suffers from a set of different issues. The implementation is not fully hardware accelerated where some parts of the algorithm turns on the soft core. The highly parallel architecture implemented on the FPGA is not fully exploited. Moreover, they implemented the FastSLAM1.0 algorithm which suffer from samples impoverishment issues and particles depletion Montemerlo (2003). Thus, such algorithm needs to be enhanced when dealing with real outdoor/indoor large scale environments. Authors in Ma et al. (2016) proposed a large scale dense visual inertial SLAM system by combing inertial tracking in a dense SLAM framework for large scale outdoor/indoor SLAM. They implemented the resulting algorithm on a high-end NVidia TITAN GPU and an Intel i7 quad-CPU desktop. Their intention is good for proposing a new algorithm capable of mapping a large scale outdoor/indoor environment. However, most SLAM applications are deployed in extremely difficult situations. Hence, using embedded SLAM systems is mandatory. The evaluation of the algorithm on a low power embedded platform is necessary to ensure that such algorithm 2

can operate in real-time constraints. Authors in Schulz et al. (2015) implemented a stereo vision pre-processing module for visual SLAM, used to retrieve depth information from features to compose 3D landmark points. They used an embedded ARM based FPGA SoC to evaluate their implementation. Usually in a typical visual SLAM system, other tasks (data association, bundle adjustment, pose camera estimation and map correction) are also a times consuming. Therefore, a full parallel optimization of the SLAM algorithm is mandatory particularly if the target embedded platform is an FPGA to ensure real-time performances. Recently, an alternative approach for optimizing SLAM algorithms computations has been opened. It consist of moving some part of the SLAM estimation from limited low-cost computers to external computational and power resources servers. Such domain is known as Cloud Robotics and it is acquiring a great importance during the last years. This has put into question how the SLAM system must be partitioned for a better distributed computing between cloud servers and on-board embedded computers. This is completely a whole new line of research which shown its applicabilities and its efficiency for dense visual SLAM computation Riazuelo et al. (2014), Mohanarajah et al. (2015). Our contribution: The emergence of embedded Low-Power architectures gives us an opportunity to explore a new trend of embedded computation for SLAM algorithms as well. However, the question remains open whether and what are the challenges to tackle with in order to allow an embedded SLAM algorithm to operate in real time. Actually, a SLAM system consists to use a recent embedded architecture as powerful as possible with a suitable SLAM algorithm, and then try to optimize the algorithm as maximum as possible. Towards this, we mapped and implemented 4 SLAM algorithms: Monocular FastSLAM2.0, ORB SLAM, RatSLAM and Linear SLAM. These algorithms are deployed in numerous potential applications for autonomous robot navigation. Our choice was also driven by a need to investigate which SLAM algorithm may be possible to be implemented on a specific embedded architecture. In addition, our work aims to highlight the advantage of parallel embedded architectures to optimize a SLAM algorithm. The main contribution of our work compared to previous works are summarized below:

• We evaluate the embeddability of a set of popular SLAM algorithms on different low power embedded architectures and then we propose a SLAM system that could run in real time constraint (Monocular FastSLAM2.0 on FPGA). We give then a comparison of our study and also the proposed system with a modern dense SLAM framework implementation on a popular embedded devices and high-end machines Nardi et al. (2015). • We propose a full parallel implementation of a potential SLAM algorithm on both a modern GPU and an FPGA. A comparative study is given based on a real indoor dataset. We compare the obtained results with the partial implementation of FastSLAM1.0 on FPGA presented in Sileshi et al. (2016b). • Authors in Ma et al. (2016) used a high-end NVIDIA GPU of a desktop machine. Our work focuses on embedding a SLAM algorithm, a comparative study with high-end machines is given. This rises a new challenge to investigate whether by adopting the same optimization strategies as those used for highend GPU is suitable to design an embedded system based SLAM applications. This required special attention and imposed new demands to our works. • This work also conducts experiments and comparative studies between recent powerful embedded parallel architectures, namely FPGAs and GPUs. This is achieved to investigate existing potential embedded architectures and to test their capabilities in embedding a particular SLAM system. The paper is organized as follows: In section II, we give a description of our evaluation methodology and the reason behind the choice of the SLAM algorithms to benchmark in our study. In Section III, we will give a brief description of SLAM algorithms. Description is given as functional blocks partitioning of each algorithm. In Section IV, we describe algorithms implementations results as well as the experimental results. Section V provides a holistic viewpoint of results and concludes the paper. 3

2

Evaluation Methodology and Hardware Specifications

from nature and applying them to design real time SLAM systems. Regardless of their category, two kinds of sensors are usually used in SLAM algorithms: propriocepIn this section, we first give a description about our tive and exteroceptive sensors. Expensive sensors like a evaluation methodology adopted in this work to evaluate laser range finder or low-cost sensors like a camera can the SLAM algorithms. We give also an overview about be used as exteroceptive sensors to observe the environthe chosen algorithms and the reason behind their selec- ment. Odometers are proprioceptive sensors used to estions. Next, we describe the used hardware for evaluation timate the robot’s pose. Our primary consideration for and their specification. choosing the SLAM algorithms was to ensure that they cover a wide applications range. Therefore, the SLAM 2.1 Evaluation Methodology algorithms were chosen such that: they address the onA SLAM algorithm does not have a fixed processing line SLAM problem or the full SLAM problem, they are time which can be expected beforehand. In order to eval- based on a probabilistic approach or a bio-inspired apuate the processing time, the algorithms were analyzed in proach or they use either a high-cost sensor or a low-cost terms of instruction and operations order. Our evaluation sensor. With these choices in mind, we chose 4 SLAM methodology is based on the identification of processing algorithms to evaluate them in our study. The charactertasks requiring a significant computing time. It is based istics of each SLAM algorithm are individually discussed on several steps: we analyze first the execution times of below. tasks and their dependencies on algorithm parameters. A We chose the Monocular FastSLAM2.0 algorithm Monthreshold is fixed for each parameter. The algorithm is temerlo et al. (2003) as a probabilistic approach to adthen partitioned in order to have functional blocks (FBs) dress the on-line (i.e full) SLAM problem. This algoperforming defined calculations. Processing times of FBs rithm is widely used in many robotics applications and audepend on many parameters (number of odometric data tonomous navigation. It is an alternative approach based acquired in each timestamp, number of integrated obser- on the particle filter. The uncertainty of the robot’s pose is vations per timestamp, number of landmarks in the map, modeled by a number of different particles. Each particle number of matched landmarks and number of new land- has its own map. It has been proved that this algorithm marks to add in the map). Each block is then evaluated to run successfully in a very large environment and can surdetermine its processing time. Processing times reported pass many problems that limit the localization accuracy. in this paper, for SLAM algorithms, are the mean of 10 Such algorithm is parallel in its nature and uses image runs of each algorithm for 100 timestamps. The global processing for features extraction and matching. But it processing time of each algorithm is given for each times- needs heavy computing workloads since many particles tamp when exploring an environment. must be processed. Thus, it is interesting to study its embeddability exploring software optimizations on an em2.1.1 Choice of SLAM algorithms bedded architecture. SLAM algorithms can be classified into two categories: Second, we selected the ORB SLAM algorithm Muronline based-SLAM algorithms and full based-SLAM al- Artal et al. (2015) as a pure visual SLAM algorithm. It gorithms. An online SLAM algorithm consists on cal- is a feature-based monocular SLAM system that operates culating, at each time, the actual position of a robot ac- in small and large environments, indoor and outdoor encording to sensors measurements. A full SLAM algo- vironments. The algorithm uses monocular images for rithm consists on reconstructing at each time, the entire environment perception and visual odometry to provide robot trajectory. Within this two categories, SLAM al- accurate estimates of camera localizations. The algorithm gorithms are either based on a probabilistic approach or implements place recognition techniques for loop deteca bio-inspired approach. The probabilistic approach rep- tion and closing. Moreover, the nature of this algorithm resents the uncertainty of robot pose using a probability inherently makes it a suitable candidate to evaluate its emdistribution over a whole space of possible hypotheses. beddability because it uses heavy image processing tasks. The bio-inspired approach is based on learning concepts Third, we selected the RatSLAM algorithm Milford 4

et al. (2004) as a bio-inspired approach based topological SLAM system. RatSLAM is a robot navigation and a SLAM algorithm based on computational models of the neural processes underlying navigation in the rodent brain. The approach uses a combination of appearance based visual scene matching, attractor networks, and a semi-metric topological map representation. In particular, RatSLAM performs well with low resolution monocular image data. The RatSLAM system contrasts many of the other SLAM approaches that use expensive robust laser sensors. It is interesting to study the bio-inspired approach embeddability on a low power architecture. Finally, we selected the Linear SLAM algorithm Zhao et al. (2013) as an approach to address the full SLAM problem through solving a sequence of linear least squares problems. It is used for large-scale SLAM and based on sub-maps joining. The algorithm uses local maps to construct the full robot trajectory. The local maps are built using observations of features from one pose together with odometers data to the next pose. This makes another interesting case study of a full SLAM algorithm on a low power architecture.

Figure 1: HIL plateform

have developed a HIL (hardware in the loop) validation method to evaluate the SLAM system. This method is used to test the real embedded system using a virtual environment. It consists of including the embedded system in a software loop processing. In our case this method is used to evaluate the behavior of the SLAM algorithms using different exteroceptive sensors. A low-cost embedded architecture is used for SLAM algorithms processing while the result visualization is handled by a host machine. Fig.1 shows a description of the developed HIL 2.1.2 Real dataset based evaluation The SLAM applications are an interesting topic when platform to evaluate our SLAM algorithms. The host mait comes to explore a real indoor/outdoor environment. chine is used as a source of the real dataset (images, laser Evaluation of SLAM algorithms requires a set of differ- stream, encoders data). Sensors data are time-stamped ent sensors data which are necessary for benchmarking. and synchronized in host machine, then transferred to the Sensors data can be obtained either by using a real instru- test-bed architecture via a serial link, Ethernet or ROS mented robot or an available dataset. Furthermore, the message communication. SLAM algorithms are impleconsistence of SLAM algorithms depends on many pa- mented on the test-bed architecture and process input rameters of the environment such as the size of the envi- data. Mapping and localization results are then transmitronment and the number of visible landmarks. Thereby, ted to host machine at every time-stamp for visualization in our evaluation all SLAM algorithms were evaluated us- and analysis. ing the same real indoor dataset Andrea Bonarini and Tardos (2006). The dataset provides a set of different sensors 2.2 Hardware Specification For our experiments we used a set of different archidata. In our evaluation we have used data of encoders as proprioceptive sensors, data of a monocular camera and tectures. Table 1 summarizes their characteristics. We a laser stream as exteroceptive sensors. This allows us to used 4 embedded platforms. For a comparative study test all SLAM algorithms we selected in our benchmark- we also implemented our SLAM algorithms on high-end machine. The Pandaboard ES (OMAP4460) is designed ing. by Texas Instruments dedicated to open low-cost multi2.1.3 Hardware-in-the-Loop media applications. This architecture includes a DualThe evaluation of a SLAM algorithm is not limited to Core ARM processor Cortex A9 running at 1 Ghz, with the algorithm in its own. It requires an entire complex a NEON co-preocessor engine allowing SMID optimizareal-time system including sensors and calculators. We tions and significant performances advantage. The i.MX6 5

Sabre Lite development board is targeted for portable computing on battery-powered devices. The board includes a Quad ARM Cortex A9 cores running at 1.2 GHz. The ODROID-XU4 is based on a Samsung Exynos 5422. The Exynos 5422 implements a Quad-Core ARM CortexA15. Tegra X1 is a recent system on a chip (SoC) developed by Nvidia for mobile devices and multimedia applications. X1 processor integrates 4x ARM Cortex A57 and 4x ARM Cortex A53 CPU running at 1.9 GHz. For a comparative study we also implemented our SLAM algorithms on a high performance machine (Desktop and Laptop machine). 3

the camera pose. If the frame tracking was lost, a relocalization process is done in order to find a camera pose. Track Local Map/ Key Frame decision FB projects the local map into the frame in order to find more corresponding points and optimize the camera pose. A new frame is inserted to the system based on defined conditions in order to ensure a robust tracking process to camera rotation. In the KeyFrame Insertion block, the covisibility graph is updated adding a new node for the new key frame. Map Points Culling block performs a strict policy to the new landmark candidates. New Point Creation block adds new landmarks in the map using triangulation. Local BA performs some optimization to the covisibility graph. Local Keyframe Culling block detects and deletes Redundant frames to maintain a dense and compact reconstruction. The similarity between the bag of words of the last processed frame and all its neighbors in the covisibility graph is computed to perform a loop detection. Once a candidate frame is selected to close the loop, a similarity transformation is computed. The duplicated map points in the two key frames that close the loop are fused and the graph is updated to perform a loop correction closing.

SLAM Algorithms Description and Running Times Evaluation

In this section the chosen algorithms are described below as functional blocks (FBs) to help understanding the relationship among all major functions and to provide an overview about dataflow management between FBs. We then give a first running time evaluation of the algorithms on a mono-core processor of the selected hardware’s. 3.1 SLAM Algorithms Description 3.1.1 FastSLAM2.0 FBs partitioning is described in Fig.2-a. FB1 calculates the new particle pose by incorporating the odometers data. FB2 consists of extracting features from an image using FAST detector. Once the features are extracted, a matching task is then performed between observation and landmarks using a similarity based-correlation metric. FB3 ameliorates particles pose and their uncertainty to maintain diversity in particles and to prevent sample impoverishment. FB4 updates the matched landmark and compute the weight of each particle according to the current observation. FB5 adds and initializes new observed map incrementally. FB6 resamples particles to take into account the observation.

3.1.3 RatSLAM FBs partitioning is described in Fig.3-(a), In the Local View and Match block, a new local view cell is created with the associated raw pixel data in that scene when a novel visual scene is seen. When this view is re-observed by the robot, the local view cell is activated and injects activity into the pose cells. The pose cells are a Continuous Attractor Network of units connected by inhibitory connections which is similar to a navigation neuron that is found some mammals called a grid cell. The dimensions of the cell array correspond to a mobile robot with three-dimensional pose (x, y, θ). The pose cell network dynamics are such that the stable state is a single cluster of activated units, referred to as an activity packet or energy packet. The centroid of this packet encodes the best-guess of robot pose. The encoder odometry data shifts activity in the pose cells to represent the robot’s movement. A new experience is created when the current activity state in the pose cells and local view cells are not closely matched by the state associated with any existing expe-

3.1.2 ORB SLAM FBs partitioning is described in Fig.2-b. ORB Extraction block extracts corners using FAST Detector and computes the orientation and the ORB descriptor for each detected corner. Initial Pose Estimation/Relocalization block uses a constant velocity motion model to predict 6

Table 1: Architectures used for experiments Desktop

Laptop

Embedded Tegra X1

CPU

Core 2 Quad

Dual-Core T4300

Q6600

ODROID

i.MX6 Sabre

XU4

Lite

ES

Exynos 5422

Quad-Core

Dual-Core

Cortex A57 +

ARM Cortex

ARM Cortex

4x ARM

A9

A9

4

2

4x ARM

Pandaboard

Cortex A53 CPU Cores

4

2

8

4

CPU GHz

2.40

2.10

1.9

1.8

1.2

1.2

Ubuntu OS

Precise, 12.04

Precise, 12.04

Linux for

Precise, 12.04

Linaro, 13.04

Precise, 12.04

LTS

LTS

Tegra 14.04

-

-

6-15

4 - 4.5

1.6 - 7

3.5 - 6.6

min-max

LTS

power consumption (W)

Figure 2: Monocular FastSLAM2.0/Monocular ORB SLAM FBs flow chart

7

riences. As the robot transitions between experiences, a link is formed between the previously active experience to the new experience.

In this case,a hardware optimization seems necessary to allow real-time performance. nl = nl +4∗

3.1.4 Linear SLAM FBs partitioning is described in Fig.3-(b). In Data association block, a set of potentially overlapping local maps is determined first, then a search over the potentially matched features is done. Finally, a statistical data association is done to find the match. Features that are not matched during data association are initialized and added to the map. The local map provides a consistent estimate of the relative positions from the robot start pose to the local features and the robot end pose. The observation map is used to update the information vector and the information matrix of the global map. The Cholesky factorization is performed to construct a new factorization. A sparse linear equation is solved using Cholesky factorization to recover the global state vector.

q

  −2.0 ∗ log (1 ) ∗ cos (2 ∗ π ∗ 2 ) ∗g (1)

q

  −2.0 ∗ log (3 ) ∗ cos (2 ∗ π ∗ 4 ) ∗g (2)

nr = nr +4∗

where nl , nr are the left and right wheels encoders data. (1 , 2 , 3 , 4 ) are random numbers between [0, 1], and g is the slipping noise.

ORB SLAM is a keyframe-based monocular SLAM based on PTAM. The tracking and local mapping processes can be run in parallel. Table 3 shows results of running the three threads of ORB SLAM (tracking local, mapping and loop closing). The tracking is achieved in 27.09 ms on the desktop machine and 50.06 ms on the laptop machine. On the embedded architecture, the best tracking time is 174.34 ms on the TX1. ORB 3.2 Running Times Evaluation SLAM takes as input monocular images to map the The SLAM algorithms discussed above were run and environment and reconstruct the robot trajectory. Such time-evaluated using a real dataset discussed in section algorithm needs to process images exactly at the frame 2.1.2. All these algorithms were run for 100 time-steps rate they were acquired in order to provide more stable using the same workloads to maintain a coherent compar- and accurate results. Some FBs of this algorithm need a considerable amount of time, in particularly the Local ative study. Table 2 shows results of running the Monocular Fast- Bundle Adjustment and the Map Point Creation in Local SLAM2.0 on six different architectures. The Results ob- Mapping task. Running this algorithm on a powerful tained show that, with 100 particles, the monocular Fast- computer ensure real-time performance Mur-Artal et al. SLAM2.0 can run under real-time constraints on the desk- (2015). Real time constrains are achieved on the desktop top machine and not on the low-power embedded archi- machine and almost on the laptop machine but not on tecture. Furthermore, particles generation is done via a the embedded architectures. The tracking thread takes process of randomization (following the formulas 1 and 2) approximately more than 170 ms at each timestamp when to maintain diversity of particles. The number of particles it is executed on the embedded architectures. This means in such system is necessary to maintain reasonable esti- that the system is not able to process images in real-time. mates of pose and landmark uncertainty as stated in Eade In our tests we used a monocular camera with 30 fps, and Drummond (2006). Hence the number of particles which means that many images are skipped and not must be increased which may increase with environment processed. This induces information lost, thus the system complexity. Running the algorithm with a large number of provides unstable and inaccurate map and localization particles will without doubt increase the execution time as results. A hardware optimization seems necessary in the FBs of this algorithm depends heavily on the number this case, but this requires rethinking the algorithm of particles. Therefore, the embedded architectures may implementation since the ORB SLAM is not parallel in not be suitable platforms for a robot intended to explore its nature. Some time-consuming tasks are sequential a large environment using the monocular FastSLAM2.0. and require transformations to map the algorithm on an 8

Figure 3: (a) RatSLAM FBs flow chart, (b) Linear SLAM FBs flow chart

42.34, 58.21, 38 and 23 s are the times needed to construct the global map on the embedded architectures (iMX6, panda ES, XU4 and TX1) when using SEQ method. Therefore, this method is a time consuming and not suitable for embedded computation. The DC manner is suggested because of its low computational cost. Using a DC method, the Linear SLAM needs 2.77, 4.24, 2.25 and 1.5 s to construct the global map on respectively iMX6, panda ES, XU4 and TX1, 1.12 s on laptop machine and 0.74 s on the desktop machine. The time of building local maps depends on the complexity of the used technique. Using EKF, it takes around 1.5 s on the embedded architectures. Yet, the linear SLAM has not been used in active SLAM. Therefore, algorithm implementation must be changed to allow hardware optimizations, ensure real-time performances and reduce the computational complexity.

embedded system. Table 4 shows the processing times results of the RatSLAM on the different architectures. The RatSLAM algorithm cannot be processed at real-time speed or faster on the embedded architectures since the images cannot be processed at the frame rate. Therefore a hardware optimization is needed. However, in such bio-inspired approach, some optimization may increase the code complexity specially when it comes to changes in the algorithm.RatSLAM real-time performances heavily depends on three factors. The first one is the matching process: the size of detected visual templates increases the time needed to compute the similarity between the current and previously detected/stored visual templates. Second, in the pose cell FB, the computation complexity increases linearly with the size of the pose cell environment. Third, in the experience map computation FB, the computation is also linearly related to the number of correction loops per timestamp.

Table 6 shows the FPS achieved across different platforms for a set of monocular SLAM algorithms including also a comparison with KinectFusion algorithm from SLAMBench Nardi et al. (2015). Such comparison may be useful for the community and would allow a holistic overview about the time spent of different monocular SLAM algorithms on popular embedded devices. Using the one-core CPU only on high-end machines, ORB SLAM achieves real time performance with 36 FPS, monocular FastSLAM2.0 almost run on real time with 28 FPS, neither RatSLAM nor KFusion could run in real time. On the embedded architectures, all monocular SLAM algorithms performs worse and real time constraints are not achieved.

Table 5 shows the processing-time results of the Linear SLAM. All FBs described above in section 3.1.4 were overlapped, their processing times together are reported. The total time given in table 5 is the global time needed to construct the global map after 100 timestamps. In essence, linear SLAM is a solution to large-scale SLAM that requires joining a number of local submaps in two different ways: either sequentially (SEQ) or in a more efficient way using Divide and Conquer manner (DC). In our evaluation we report processing times for both methods used in submaps joining. In our experiment, 9

FBs

Desktop

Laptop

Embedded TX1

XU4

i.MX6

Panda-

Sabre

board ES

Prediction

0.101

0.212

5.32

4.22

0.30

0.46

FAST Detector/Matching

2.39

5.01

0.313

0.33

4.24

6.49

Update particle state

8.83

18.49

19.50

23.56

28.61

43.5

Map Estimation

22.56

34.59

35.53

40.34

55.03

75.4

Map Initialization

1.53

2.35

3.16

3.22

3.17

4.35

Particle Resampling

0.04

0.06

0.052

0.61

0.05

0.065

Total (ms)

35.45

60.71

63.88

72.28

91.4

130.26

Table 2: FastSLAM2.0 processing time (running with 100 particles) FBs

Desktop

Laptop

Embedded TX1

Tracking

XU4

i.MX6

Panda-

Sabre

board

Lite

ES

ORB Extraction

12.539

19.28

123.16

190.94

224.64

343.25

Initial Pose Est

5.57

11.17

33.58

42.16

61.26

93.61

Tack Local Map / New Key

8.97

19.64

17.60

20.43

32.11

48.81 484.86

Frame Decision

Mapping

Loop closing

Total (ms)

27.09

50.06

174.34

253.53

318.01

Key Frame Insertion

12.38

22.30

33.37

36.45

60.88

92.55

Map Point Culling/Creation

96

214.53

251

284.68

292.04

400.24 921.77

Local BA

200.12

415.75

368.44

521.41

672.82

Key Frame Culling

4.53

7.92

4.44

5.32

8.11

10.06

Total (ms)

313.03

660.5

657.25

847.86

1033.85

1424.62

Loop Detection /correction

6.66

17.84

18.68

20.64

22.34

29.28

(ms)

Table 3: ORB SLAM processing time Threads

Desktop

Laptop

Embedded TX1

XU4

i.MX6 Sabre

Local

PandaES

97.80

136.92

140.05

166.3

179.01 279.26

18.87

30.16

120.8

135.04 166.10 227.56

2.39

3.67

9.35

10.96

11.21

119.06

170.75

270.2

312.3

356.32 520.73

View/Match Pose Cell Network Experience

13.91

Map Total (ms)

Table 4: RatSLAM processing time

10

Desktop

Laptop

Threads

Embedded TX1

XU4

i.MX6

Panda ES

DC

SEQ

DC

SEQ

DC

SEQ

DC

SEQ

DC

SEQ

DC

SEQ

Building Local Maps

0.42

0.42

0.63

0.63

0.56

0.56

0.72

0.72

1.03

1.03

1.58

1.58

Linear SLAM

0.32

5.83

0.49

8.93

0.954

22.64

1.53

37.6

1.74

41.31

2.66

56.63

Total (s)

0.74

6.25

1.12

9.56

1.514

23.2

2.25

38.32

2.77

42.34

4.24

58.21

Table 5: Linear SLAM processing time (for 100 steps)

4 Towards Parallel Architectures: GPU Vs FPGA

FPS FastSLAM2.0

ORB SLAM

RatSLAM

CPU

CPU

CPU

Machine type Desktop

28.2

36.94

8.4

Laptop

16.47

20

5.8

X1

15.65

5.73

3.7

XU4

13.83

3.9

3.2

iMX6

10.94

3.14

2.8

Panda ES

7.67

2.05

1.9

SLAMBench (KinectFusion) Machine type

FPS

TITAN

GTX870M

TK1

XU3

Arndale

CPU

CPU

CPU

CPU

CPU

1.0

0.9

4.4

3.5

0.7

Table 6: FPS acheived respectively with Monocular FastSLAM2.0 and kinectfusion using SLAMBench Nardi et al. (2015)

The absolute performance of an embedded SLAM relies on the nature of the algorithm itself and also on the architecture of the target embedded platform. In this case, a software hardware matching methodology is mandatory to perform real-time performances for embedded SLAM applications. The computing power of embedded systems increases thanks to multi-core and heterogeneous architectures. We imagine a new wave of embedded SLAM algorithms that was not possible so far. In our mind, the design of a system dedicated to SLAM applications requires several more steps. The first step is to review all the available algorithms and all the embeddable architectures which can handle a SLAM algorithm. Then, the second step consists to choose both an architecture and an algorithm which leads to an efficient system in terms of processing time and results accuracy. In terms of results accuracy, the selected algorithms are shown to solve the SLAM problem. Mur-Artal et al. (2015) Provided an exhaustive evaluation of ORB SLAM in a set of different popular dataset. The algorithm is able to solve the SLAM problem on large scale environment. In our previous work Abouzahir et al. (2016) we proved the consistency of monocular FastSLAM2.0 using Raweseeds dataset. We introduced also some algorithmic change to adapt its utilization on large scale environment. Zhao et al. (2013) proved the consistency of Linear SLAM using also a set of popular dataset. The algorithm is shown to provide a proper global map based on a set of several sub-maps used as inputs to the linear SLAM. Finally Milford et al. (2004), tested the consistency of RatSLAM on oxford dateset. The algorithm is able to provide a good localization of the robot and can create a topological map that can be used for path planning. In terms of processing times, Fig.4 shows the global processing times of the studied algorithms on different 11

Figure 4: Global Processing Times of SLAM algorithms on the different architecture

good target for certain types of high-performance computations Owens et al. (2007). For the sake of a good performance rate, GPUs have interested the signal processing community to be a suitable for general purpose computing which becomes a hot topic in high-performance computing. However, GPUs impose strict limitations on the accelerated algorithm that must be met in order to achieve an effective speedup. On the other hand, Field Programmable Gate Arrays (FPGAs) present an attractive alternative in many cases, as the accelerating hardware does not have a hardwired architecture of its own. Rather, the algorithm under test shapes the architecture; logic building blocks are placed in parallel or pipelined to achieve a high utilization of the device capacity, depending on the expected data flow.

4.2 FPGA Programming The FPGA programming can be done using high-level embedded architectures. Our main purpose of making languages such as VHDL or Verilog. Although, when such comparison, is to show what is the SLAM algorithm employing traditional logic design techniques, the human that have low processing time without any H/S optimiza- effort necessary to harness the FPGAs capabilities for a tion. Our primary consideration for the design of SLAM complex algorithms often makes the FPGA an unattracsystem, is to achieve a compromise between accuracy of tive choice. In order to achieve an efficient and fast parlocalization results and processing time. Since all the se- allel implementation of the algorithm on FPGA, we used lected SLAM are able to provide a consistent localization OpenCL based High Level synthesis. High-Level Synresults, we choose FastSLAM2.0 for its lower processing thesis (HLS) Cong et al. (2011) is a recent technique time (but not real time) and also for its parallel nature as for utilizing programmable logic without using the traa candidate for parallel embedded implementation. The ditional hardware definition languages (Verilog / VHDL). monocular FastSLAM2.0 algorithm seems to be a candi- The HLS converts a high level language into a hardware date who meets the compromise between consistency of description, aiming to utilize the programmable device localization results and processing times. efficiently for accelerated operations and economic reIn this section we investigate further optimization of source usage. This opens an opportunity for regular prothe FastSLAM2.0 algorithm on different parallel archi- grammers to design custom co-processing architectures in tectures using software/hardware matching methodology. efficient way. The algorithm is implemented on the Arria First, we give an overview about parallel optimization on 10 FPGA (section 2.2). embedded hardwares and then we describe the implementation of the different FBs of FastSLAM2.0 on parallel 4.3 GPU Programming architectures FPGA and GPU. Experimental results and Embedded GPU resources can be accessed by a properformance evaluation in terms of processing times are grammer using various programming languages. CUDA then discussed. presents the user with a C language for direct application development on NVIDIA GPUs which restrict code porta4.1 Parallel Embedded Optimization bility specifically to NVIDIA hardwares. In general, parallel optimization is done by taking some The Embedded GPUs can also be programmed using of the load off the CPU, by letting a piece of hardware OpenCL. However, OpenCL drivers for most of embedperform some intensive tasks. GPUs are commonly used ded GPUs are not available in the public realm. Embedfor this purpose. GPUs have shown themselves to be a ded boards that have the software support to use OpenCL 12

for GPGPU programming, contain GPUs with only low numbers of cores which are not really dedicated to high performance computing. OpenGL can be also used for embedded GPU programming. The majority of current embedded GPUs support OpenGL which is independent to a specific architecture. Adopting OpenGL for general purpose computing requires a prior experience in graphic programming. Although, it provides code portability between different embedded GPUs which makes the resulting implementation independent and allows conducting a comparative study between different embedded GPU architectures. 4.4 FastSLAM2.0 Parallel implementation 4.4.1 Image Processing (FB2) The image processing task uses the FAST detector. It is a sequential algorithm already optimized using machine learning. So it is implemented on one-core of the CPU. The matching between landmarks and observations is performed according to one particle that has the highest weight. So it is parallelized the host multi-core CPU according to the observed landmarks. 4.4.2 FB1 implementation The prediction task propagates the current state of the particles in the filter via a probabilistic motion model Seignez et al. (2009). It calculates the future state by incorporating the odometric data. The operations are independent for each particle and hence the FPGA/GPU can process each particle independently. To parallelize this block, we transfer to the GPU or FPGA both the current state of the particle and the encoders data u (nr , nl ). The random numbers are generated by the host and then transferred to the GPU or FPGA. The particle state is initialized and saved in a buffer in the global memory. To implement the randomization procedure, we note that each particle is processed by one thread, each thread has its individual identification. Therefore, instead of generating the random numbers (1 , 2 , 3 , 4 ) for each particle, we just need to build a set of random numbers with limited quantity on the CPU (60 values in our implementation) and transfer them to the FPGA. Then, we combine this set using identification to have the private random value for each particle. Once the noise is computed within the kernel, the particle pose is computed and stored in the same memory buffer.

4.4.3 FB3 implementation FastSLAM 2.0 is a modified version of the particle filter to deal with sample impoverishment issue. In essence, to prevent the resampling step from being more wasteful with the particles, we compute a new Gaussian mean and covariance based on the most recent observation. Then, a new particle pose is sampled from the new proposal distribution. This functional block can be parallelized on the FPGA/GPU since each Gaussian can be constructed independently. We transfer to the GPU or the FPGA for each particle: the initial covariance matrix Pm t , the particle pose sm t , the parameters of the landmarks in the map that were matched (x0 , y0 , ρ, θ, ϕ, Ct ) and their corresponding observation in the current frame (u, v). The new proposal distribution is constructed incrementally for each particle according to each matched landmark. We transfer the matched landmarks parameters and the state of all particles in once. Therefore, this allows the Gaussian construction to be done within the kernel. The resulting Gaussian is saved in the global memory since it is a read/write memory. Gaussian construction is done inside the kernel. In essence, one kernel execution completely constructs the Gaussian because all the matched landmarks are transferred to global memory before the kernel execution. Therefore, the implementation is efficient. It allows the Gaussian construction for all particles to be done in one kernel execution. This reduces data transfer at each iteration implying a significant improvement. 4.4.4 FB4 implementation During the estimation step, if a landmark in the map is matched with a current observation, its position in the image (u,ˆ ˆ v) is predicted using the pinhole model Davison (2003). The innovation between the predicted pose and the observation (u, v) is computed. The extended Kalman Filter (EKF) corrects the inverse depth parametrization (ρ, θ, φ) of the matched landmark and its covariance matrix. The likelihood to get the sensors readings from these particles hypothesis is computed according to the observation. Since the fact that each particle has its own map, the operation mentioned above can be done independently and hence the estimation functional block can be parallelized on the FPGA/GPU. For each particle, we transfer to device : the particle pose sm t and the landmark parameters (x0 , y0 , ρ, θ, ϕ, Ct ) to predict its position in the image, the observation (u, v) for innovation computation and Pm t

13

to compute the new importance weight. The implementation of the estimation functional block in parallel is quite faster. We recall that in the FB2 implementation we already transferred all the matched landmarks parameters to global memory. Therefore there is no transfer that comes before and after the kernel execution. Thus, we just need to directly activate the kernel that implements the general EKF equation to update all the matched landmarks parameters within the kernel in one execution. As a result, the implementation of the estimation functional block in parallel allows a significant improvement. 4.4.5 FB5 implementation In our implementation we used the inverse depth parametrization for landmark initialization Montiel et al. (2006). The initial coordinates of the landmark are (xi , yi , θi , ϕi , ρi ), where xi and yi correspond to the first view camera pose, θi is the azimuth, ϕi is the elevation and ρi is the inverse depth. The implementation of this block in parallel is quite different. When transferring the matched landmarks parameters in the particle update (FB2), we transfer also their corresponding observation parameters. So the kernel is executed without any previous transfer. It computes the inverse depth parametrization for all the new landmarks within the kernels and this is done for all particles. Note that we use the binary tree for map management as mentioned earlier. Therefore, all the matched landmarks that have been updated (FB3) and initialized (FB4), are transferred back in this step to the host CPU. 4.4.6 FB6 implementation To prevent the particles depletion, the resampling functional block replaces unlikely particles with likelier ones. A new set of trajectories is sampled proportionally to the corresponding weights. The most time-consuming parts in the resampling step are accelerated on the GPU/FPGA. We note that this block has two main parts: Resampling particle state (processed on device) and Resampling particle map (processed on host CPU since the map is arranged in a binary tree on CPU). One of the bottle-necks in this block is the total weight calculation P  and the resampling M threshold Ne f f (Ne f f = 1/ i=1 ω2i , when ωi is the normalized weight. 14

Figure 5: A group of sub-tasks classified by the remainder (id \% K), computes the weights summation of a chunk of particles. The result is transferred back to the CPU to compute the final total weights.

To implement the weight summation in parallel, we divide the sum into K sub-tasks. Each sub-task takes the sum of all the particles weights which have the same surplus. The corresponding identification is divided by K. The total weight is then the sum of all the sub-task results. Each sub-task is processed by one kernel. Tasks can be executed in parallel. Results of sub-tasks are transferred to host CPU to calculate the total weight. The number of sub-tasks K must be a multiple of M (number of particles) . This is illustrated in Fig.5. The computation of the minimum number of effective particles before resampling Ne f f is also implemented in the same way: instead of computing the sum, we compute the square sum. The sub-tasks compute the square sum of a chunk of particles weights, and then the final value is computed by the CPU. 4.5

Hardware Specification

As a dedicated architecture, We used the Arria 10 FPGA as a target platform for our SLAM implementation Fig.6. Arria 10 is one of the latest chip produced by Altera delivering the highest performance at 20 nm. Arria 10 FPGA based SoC is a low power embedded architecture up to 40% lower power than previous FPGAs generation. It allows up to 1500 GB/s floating-point operation with DSP blocks. The system clock is 100 MHz. The chip also includes a Dual-Core ARM operating at 1.5 GHz. Tab.7 shows the available resources in terms of logic elements, DSP and memory blocks of the Arria 10 FPGA. To evaluate the OpenCL FPGA implementation, we used a host computer, operating at 2.5 GHz under Red Hat Enterprise Linux 7.2 with a 64 GB RAM, with FPGA board mounted on the PCIe slot. We used the Alaric board embedding the

Resource Logic Elements (LE) (K) ALM Register Memory DSP Blocks 18 x 19 Multiplier 17.4 Gbps Transceiver PCIe Hard IP Block Hard Memory Controller

Arria 10 device SX 660 660K 251,680 1,006,720 M20K 42,620 MLAB 5,788 1,687 3,374 48 2 16

Table 7: Resources of Arria 10

tailed in tab.8. The Pandaboard ES implements a PowerVR graphical processor SGX 544 for image processing running at 200 MHz. The iMX6 chip includes a Vivante GC2000 GPU with 4 SIMD cores, each running at 500MHz. The ODROID XU4 chip includes a Mali-T628MP6 GPU which consists of a device with 8 embedded SIMD cores. The chip of the Tegra X1 includes a powerful NVIDIA Maxwell GPU with 256 NVIDIA CUDA cores running at 950 MHz. The high-machine contains a GTX 560 NVIDIA GeForce GPU. It has 14 streaming processors and 336 cores running at 950MHz. In our evaluation we select the ODROID XU4 and Tegra X1 GPU, to evaluate the algorithm on both a GPU with limited and high number of core. For a comparative study, we also implemented the algorithm on a high-end GPU of the desktop machine. The development on embedded GPU is done using OpenCL and OpenGL as most of the tested embedded GPUs support this two languages, unlike CUDA that it specific to NVIDIA hardwares. The description of the OpenGL implementation is given in our previous work Abouzahir et al. (2016). 4.6

Figure 6: Arria 10 FPGA

Altera Arria 10 SoC 660 KLEs. The development in this paper was done using specific tools dedicated to Altera devices design which are not available to the public realm. We used also a recent BSP (Board Support Package) for OpenCL implementation on the Arria 10 SX660 provided by the board vendor. The SLAM algorithm was developed in OpenCL using the Altera SDK (System Development Kit) for OpenCL. The Altera SDK for OpenCL allows avoiding the traditional hardware FPGA development in order to achieve a much faster and higher level software development flow. It includes multiple optimizations and can produce deep reports of the compilation and the code optimization. Furthermore, we used the Quartus II Prime Pro edition for synthesis, it is the only software that support the Arria 10 device. The architectures described in Tab.1 contains different embedded GPUs where the main characteristics are de-

Experimental Results

4.6.1 Estimated Resource Usage Summary Tab.9 shows the estimated resources usage of the FPGA when implementing the SLAM algorithm in OpenCL without optimization. The FBs, use 121% of logic elements, 65% of logic registers, 87% of memory blocks and 49 of DSP blocks. A kernel optimization is mandatory in order to accelerate all FBs on FPGA by implementing optimizations strategies such as unrolling loops, setting work-group sizes, specifying number of compute units and work-items and setting optimization options during compilation. This optimize data processing efficiency of the OpenCL kernel and therefore, all FBs can then be accelerated on the hardware. 4.6.2 Performance Evaluation Each kernel execution consists of three phases: data transfer to device, kernel execution and data transfer back from device to host CPU phase after kernel execution. We would like to note that we have, deliberately overlapped data transfer and kernel execution phases in the time measurement to take into account the overhead of data transfer. 15

Table 8: High-end and embedded GPUs specifications Desktop

Laptop

Embedded Tegra X1

GPU

ODROID

i.MX6 Sabre

XU4

Lite

Pandaboard ES

Nvidia GTX

Intel Graphics

NVIDIA

ARM MALI-

VivanteTM

PowerVR

540M

Accelerator

Maxwell

T628-MP6

GC2000

SGX540

GPU Mhz

950

533

900

600

500

200

Languages

CUDA/OpenCL

OpenCL/OpenGL

CUDA/OpenGL

OpenCL/OpenGL OpenCL/OpenGL OpenGL ES

OpenGL

Table 9: Estimated Resource Usage Summary (%) Before Optimization Functional Blocks

Logic

Dedicated Memory

(FBs)

utiliza-

logic

tion

regis-

blocks

After Optimization DSP

Logic

blocks

utiliza-

logic

tion

regis-

ters

Dedicated Memory blocks

DSP blocks

ters

FB1

35

19

22

17

34

19

20

17

FB3

40

20

26

19

33

20

26

19

FB4

29

18

20

13

16

18

20

13

FB5

6

3

6

0

6

3

6

0

FB6

11

5

13

0

8

5

13

0

Total (ms)

121

65

87

49

97

65

85

49

16

To evaluate the parallel implementation discussed above, each implementation was run and time-evaluated using Rawseeds datasets. The evaluation is made using wheels encoders and a monocular camera data. Processing times of FBs depend on many parameters. Their dependencies determine the number of occurrences of a functional block per one iteration. Processing times reported in this paper were computed relatively to the Mean of Occurrences per Time Stamps (MOTS). This is detailed in our previous work Abouzahir et al. (2016). Table 10 synthesizes a comparison of processing times of each major kernel function (GPU and FPGA implementation) obtained after parallelization when running the algorithm with 4096 particles for 500 iterations. Let’s recall that in the prediction functional block (FB1), we generate random numbers on host CPU and we transfer only 60 random numbers from host CPU to device. This block is executed in 38.33 ms on 256 CUDA Cores, 90 ms on 8 SIMD Core and 3.84 ms on the FPGA. In the particle update functional block (FB3), we transfer the matched landmarks for all particles, the matching index and the observations made from host CPU to device. Indeed, the Gaussian construction is done within the compute kernel. All the matched landmarks are held in the global memory. This increases performance and reduces data transfer at each iteration. This block is executed in 23 ms on TX1 GPU, 56.08 on XU4 GPU and 2.16 ms on the FPGA. The results for the estimation task (FB4) are inline with expectations. In fact, as we saw before there is no transfer neither before nor after the kernel execution since the matched landmark parameters have been already transferred before in FB3 block. This block is executed in 14.45 ms on TX1 GPU, 33 ms on XU4 GPU and 1.32 on FPGA. In contrast, in the implementation of the inverse depth initialization (FB5), there is no data transfer to kernel before execution. However, we transfer back, from device to host CPU, the entire map particles (the updated and the initialized landmarks) in order to arrange them in the binary tree. This block is executed in 7.15 ms on TX1 GPU, 17.30 ms on XU4 GPU and 0.65 ms on FPGA. In the last functional block (FB6), the execution time is 2.033, 4.84 and 0.18 respectively on TX1 GPU, XU4 GPU and FPGA. Processing time of FPGA implementation has been decreased by a factor of x12 compared to GPGPU implementation on 256 CUDA Cores on TX1, by a factor of x17 compared to 8 SIMD core on XU4 and

by a factor of x7 compared to 520 cuda core on high-end GPU. The implementation of the FastSLAM1.0 proposed in Sileshi et al. (2016a) is able to process a particle at every 142.62 ns while the algorithm is not fully hardware accelerated. In our implementation, only the image processing block is parallelized on the host CPU, while the others blocks of FastSLAM2.0 are fully accelerated on the FPGA. Our FastSLAM2.0 FPGA implementation is able to process a particle at 138.92 ns for a single detected landmark. Such acceleration result is promising and can ensure real time execution.

Figure 7: FPGA, GPU performance comparison.

The SLAM algorithm implemented in our work uses a monocular camera (bearing-only sensor) to observe environment. The number of particles in such system is necessary to maintain reasonable estimates of pose and landmark uncertainty as seen in section 3.2. The number of particles necessary is not yet defined which may increase with the environment complexity. However, a large number of particles is needed to achieve an accurate localization results. Therefore, a naive implementation of FastSLAM2.0 with a high number of particles would increase the localization accuracy but at the expense of robot performance to operate in real time. To achieve a compromise between accurate localization and real time performance, we implemented the algorithm for different numbers of particles ranging from 22 to 216 . Note that a low number of particles (24 ) is not enough for accurate

17

FBs

ARM based GPU SoC XU4 Octa-

FB2

GPU

ARM based GPU SoC X1 Quad-

Core

Core

ARM

ARM

4.01

GPU

Desktop Host

Host based GPU FPGA

highend

Host

FPGA

GPU

-

4.62

2.33

2.01

FB1

90.06

38.33

31.33

-

3.48

FB3

56.08

23.86

19.50

-

2.16

FB4

33.90

14.43

11.79

-

1.31

FB5

17.30

7.15

7.153

-

0.65

FB6

4.84

2.033

1.743

-

0.18

Total (ms)

172.28

90.42

73.84

9.79

Table 10: Mean of Processing Times of functional blocks on Rawseeds dataset

a modern powerful embedded GPGPU (TX1) and over a high-end GPU (desktop machine). Using a real dataset, the parallel implementation on the FPGA is shown to outperform the embedded GPGPU implementation. Fig.8 shows evolution of CPP (cycle per particle) for different implementations computed with the following equation: f ∗t M M is the number of particles, f is the clock frequency expressed in Hz and t is the processing time in s. CPP serves as a baseline for performances comparison of implementations on different processing units Njiki et al. (January 2016). For small number of particles, the parallel archiFigure 8: FPGA, GPU performance comparison. tecture needs more cycles to execute the FastSLAM2 algorithm. For even large number of particles, CPP corresponding to FPGA implementation keeps a lower value localization, not as much as (214 or 216 ) are needed. Al- than high-end and embedded GPGPU implementations as though, this range allows a better analysis of complexity. long as the number of particles increases. Fig.7 explores the parallel computing power of the Fig.9 shows the speed up achieved by the FPGA impleFPGA. For even larger number of particles, the speedup mentation compared to the GPU implementation on difachieved by the FPGA implementation meets the real- ferent architectures. For small number of particles, the time constraints. As a result, even if a large number of FPGA implementation performs better than the GPU and particles is needed in a complex environment to achieve we get the best global acceleration. This is because the accurate map and localization results, the FPGA imple- FPGA implementation is sufficiently faster and not afmentation of the monocular FastSLAM2.0 system will be fected by the transfer time, while for the GPU the transfer always efficient and can operates in real time constraints. time for small number of particles is more important than Furthermore, results show a significant improvement of the processing time within kernel. Tab.11 shows the avthe algorithm processing time on an FPGA device over erage of the transfer time between CPU-GPU and CPUCPP =

18

Figure 9: FPGA, GPU global acceleration.

FPGA. The transfer between the CPU and FPGA is faster than those between CPU-GPU on others architectures. Architectures

Transfer time MB/s

XU4

ARM-GPU

540.74

TX1

ARM-GPU

652.41

High-end Desktop

Intel-GPU

872.32

Arria 10

Intel-FPGA

1723.51

Table 11: Data transfer time between different architecture

an embedded device. Such configuration could only be achieved on high-end computers for Kfusion algorithm. The encouraging results we obtained on the FPGA in term of acceleration performance, demonstrates that a dedicated architecture can be integrated on a SLAM system and operates in real-time constraints. However, our evaluation is still limited due to some reasons. First, a full deployment of the algorithm on FPGA is not yet achieved. In our test we used a host CPU from a high-end machine as a co-processor while the FPGA is mounted on the PCIe bus. The image processing part of the monocular FastSLAM2.0 is still to be implemented using the ARM SoC of the Arria 10 in order to achieve a full embedded implementation. This rises a new challenge to investigate whether by adopting the same optimization strategies as those used for the Intel-FPGA implementations are sufficient to achieve real time constraints using ARM-FPGA SoC. Indeed, the super fast FPS achieved on the intel-FPGA implementation is constrained by the fast data transfer between intel CPU and FPGA via the PCIe express. Such transfer rate will be without doubt limited when using the integrated ARM processor. In real-time computing, data transfer is a critical operation which increases the processing time. Therefore, We have to make sure that transfer time between ARM and FPGA doesn?t overpower any performance gain. Second, this paper adopted the high level synthesis for FPGA programming. Despite of the advantage of high level programming based OpenCL, its use is still limited. The provided results do not seem to be repeatable as they rely on commercial system and tools dedicated for Altera. In addition, the Altera off-line compiler (AOC) takes a tremendous amount of time in order to generate the hardware configuration files (duration of compilation hours and even days for more complex functional blocks and if more optimization are requested from the compiler). The FPGA has always been considered as the generation of integrated circuit claimed higher performance and reliability. Although, More efforts are needed to make it easily accessible to the community.

As the number of particles increases, the global acceleration of the FPGA compared to the GPU one decreases. This is due to the degree of parallelization on the GPU which becomes important (more processing elements will be released) and hence we can achieve the best acceleration results. However, for even larger number of particles there are not enough processing elements available on the GPU. In contrast, the degree of parallelization increases on the FPGA by which we can take advantage of the maximum performance of the FPGA. This increases the gap between FPGA and GPU by which we can achieve a high speed up with FPGA implementation compared to the GPU one. Table 12 shows the FPS achieved by monocular Fast- 5 Conclusion SLAM2.0 and KFusion across different high-end and low-end parallel architectures. The proposed monocular This paper quantitatively evaluated the execution times SLAM system on FPGA achieves super real-time per- of widely-used SLAM algorithms on low power embedformance with 102.14 FPS, which is very impressive for ded architectures. We present performance results on 19

Monocular

SLAMBench

FastSLAM2.0

(KinectFusion)

(212 particles) Machine type

FPS

Desktop

XU4

ARM-

Arria

X1

10

CPU-

ARM-

ARM-

Host-

GPU

GPU

GPU

FPGA

13.54

5.8

11.05

102.14

TITAN

GTX870M

TK1

XU3

Arndale

GPU

GPU

GPU

GPU

GPU

22

5.5

135

96

4.2

Table 12: FPS acheived respectively with Monocular FastSLAM2.0 and kinectfusion using SLAMBench (Nardi et al. (2015))

state-of-the-art desktop, laptop, embedded and dedicated architectures. Our work demonstrated that multi-core and heterogeneous architectures of embedded platforms can be exploited to achieve a real-time SLAM application. However, this still depends on the nature of the algorithm under-test. Our work demonstrated that the FastSLAM2 is an attractive solution to the SLAM problem in terms of execution times. The choice of using both the FastSLAM2.0 and a dedicated architecture was made principally because the FastSLAM2.0 can be massively parallelized. This property is exploited by using a dedicated FPGA-based architecture as a target platform for an efficient embedded SLAM system. In essence, the absolute performance of FastSLAM2.0 on an a parallel embedded architecture relies on the number of processing elements. As the number of processing elements steadily increases, and can be expected to match the number of particles needed for an accurate Monocular FastSLAM2.0 system intended to operate in a large scale environment, an embedded FPGA based SoC architecture is an interesting alternative for a SLAM implementation using the hardware-software co-design approach. One potential application of our study is the use of SLAM algorithm on UAV Schmuck and Chli (2017). Our proposed optimized monocular SLAM system on FPGA shows the possibility of executing the SLAM algorithm on embedded robot computer instead of relying on external computational and power resources servers. As a future work, we intend to achieve a full embedded implementation of monocular FasSLAM2.0 on FPGA using the integrated ARM processor of the Arria 10 SoC.

6 Acknowledgment Authors gratefully acknowledge the financial support of the International Relations Department of Paris-Sud university. Authors thank professor O. Hammami for his support allowing us to benchmark the SLAM system on the Alaric platform and using the associated tools at ENSTA ParisTech. References Abouzahir, M., Elouardi, A., Bouaziz, S., Latif, R., Tajer, A., 2016. Large-scale monocular fastslam2. 0 acceleration on an embedded heterogeneous architecture. EURASIP Journal on Advances in Signal Processing 2016 (1), 88. 4, 4.5, 4.6.2 Andrea Bonarini, Wolfram Burgard, G. F. M. M. D. G. S., Tardos, J. D., 2006. Rawseeds: Robotics advancement through web-publishing of sensorial and elaborated extensive data sets. In: In proceedings of IROS’06 Workshop on Benchmarks in Robotics Research. 2.1.2 Bodin, B., Nardi, L., Zia, M. Z., Wagstaff, H., Sreekar Shenoy, G., Emani, M., Mawer, J., Kotselidis, C., Nisbet, A., Lujan, M., et al., 2016. Integrating algorithmic parameters into benchmarking and design space exploration in 3d scene understanding. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, pp. 57–69. 1 Botero, D., Gonzalez, A., Devy, M., et al., 2012. Architecture embarqu´ee pour le slam monoculaire. In: Actes de la conf´erence RFIA 2012. 20

Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, Montiel, J., Civera, J., Davison, A., 2006. Unified inK., Zhang, Z., April 2011. High-level synthesis for fpverse depth parametrization for monocular slam. In: gas: From prototyping to deployment. IEEE TransacInt. Conf. on Robotics: Science and Systems. pp. 16– tions on Computer-Aided Design of Integrated Circuits 19. 4.4.5 and Systems 30 (4), 473–491. 4.2 Mur-Artal, R., Montiel, J., Tardos, J. D., 2015. Orbslam: a versatile and accurate monocular slam system. Davison, A., 2003. Real-time simultaneous localisation Robotics, IEEE Transactions on 31 (5), 1147–1163. and mapping with a single camera. In: IEEE Int. Conf. 2.1.1, 3.2, 4 on Computer Vision. pp. 1403–1410. 4.4.4 Eade, E., Drummond, T., June 2006. Scalable monocular slam. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Vol. 1. pp. 469–476. 3.2 Idris, M. Y. I., Noor, N. M., Tamil, E. M., Razak, Z., Arof, H., May 2010. Parallel matrix multiplication design for monocular slam. In: 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation. pp. 492–497. Ma, L., Falquez, J. M., McGuire, S., Sibley, G., 2016. Large scale dense visual inertial slam. In: Field and Service Robotics. Springer, pp. 141–155. 1 Milford, M. J., Wyeth, G. F., Rasser, D., 2004. Ratslam: a hippocampal model for simultaneous localization and mapping. In: Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on. Vol. 1. IEEE, pp. 403–408. 2.1.1, 4 Mohanarajah, G., Usenko, V., Singh, M., D’Andrea, R., Waibel, M., 2015. Cloud-based collaborative 3d mapping in real-time with low-cost robots. IEEE Transactions on Automation Science and Engineering 12 (2), 423–431. 1 Montemerlo, M., 2003. Fastslam: A factored solution to the simultaneous localization and mapping problem with unknown data association. Ph.D. thesis. 1 Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., 2003. Fastslam2.0 an improved particle filtering algorithm for simultaneous localization and mapping that provably converges. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI). IJCAI, Acapulco, Mexico. 2.1.1

Nardi, L., Bodin, B., Saeedi, S., Vespa, E., Davison, A. J., Kelly, P. H., 2017. Algorithmic performance-accuracy trade-off in 3d vision applications using hypermapper. arXiv preprint arXiv:1702.00505. 1 Nardi, L., Bodin, B., Zia, M. Z., Mawer, J., Nisbet, A., Kelly, P. H., Davison, A. J., Luj´an, M., O’Boyle, M. F., Riley, G., et al., 2015. Introducing slambench, a performance and accuracy benchmarking methodology for slam. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 5783– 5790. 1, 3.2, 6, 12 Njiki, M., Elouardi, A., Bouaziz, S., Casula, O., Roy, O., January 2016. A multi-fpga architecture-based realtime tfm ultrasound imaging. Journal of Real-Time Image Processing. 4.6.2 Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Kr¨uger, J., Lefohn, A. E., Purcell, T. J., 2007. A survey of general-purpose computation on graphics hardware. In: Computer graphics forum. Vol. 26. Wiley Online Library, pp. 80–113. 4.1 Riazuelo, L., Civera, J., Montiel, J., 2014. C 2 tam: A cloud framework for cooperative tracking and mapping. Robotics and Autonomous Systems 62 (4), 401– 413. 1 Saeedi, S., Nardi, L., Johns, E., Bodin, B., Kelly, P. H., Davison, A. J., 2017. Application-oriented design space exploration for slam algorithms. ICRA. 1 Schmuck, P., Chli, M., 2017. Multi-uav collaborative monocular slam. 5 Schulz, V. H., Bombardelli, F. G., Todt, E., Oct 2015. A soc with fpga landmark acquisition system for binocular visual slam. In: 2015 12th Latin American Robotics 21

Symposium and 2015 3rd Brazilian Symposium on Robotics (LARS-SBR). pp. 336–341. 1 Seignez, E., Kieffer, M., Lambert, A., Walter, E., Maurin, T., 2009. Real-time bounded-error state estimation for vehicle tracking. IEEE Int. Journal of Robotics Research 28, 34–48. 4.4.2 Sileshi, B., Oliver, J., Toledo, R., Gonc¸alves, J., Costa, P., 2016a. On the behaviour of low cost laser scanners in hw/sw particle filter slam applications. Robotics and Autonomous Systems 80, 11–23. 1, 4.6.2 Sileshi, B. G., Oliver, J., Toledo, R., Gonc¸alves, J., Costa, P., 2016b. Particle filter slam on fpga: A case study on robot@ factory competition. In: Robot 2015: Second Iberian Robotics Conference. Springer, pp. 411–423. 1 Tertei, D. T., Piat, J., Devy, M., 2016. Fpga design of ekf block accelerator for 3d visual slam. Computers & Electrical Engineering. 1 Zhao, L., Huang, S., Dissanayake, G., 2013. Linear slam: A linear solution to the feature-based and pose graph slam based on submap joining. In: Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, pp. 24–30. 2.1.1, 4

22

Author biographies

Abouzahir Mohamed received the M.S. Degrees from Cadi Ayad University, Marrakech, in 2012 in Electronics Engineering. He received a doctorate from Paris Saclay and Ibn Zohr university in 2017 in embedded systems and robotics. In Methods and Tools for Signals and Systems Group in Paris, his research is about the software hardware matching of the simultaneous localization and mapping algorithms for autonomous robots.

Abdelhafid ELOUARDI received the M.S. degrees from Pierre & Marie Curie University, Paris, in 2001 and the Ph.D. degree in Electronics from Paris-Sud University, Orsay, in 2005. He worked at Henri Poincaré University, Nancy, as a researcher in 2005-2006. He is currently an Associate Professor at the Fundamental Electronics Institute, Paris-Sud University, Orsay, France. In Embedded Systems team of Autonomous Systems Department, his research interests includes Hardware Software co-design, evaluation and instrumentation of embedded systems, design of smart architectures for image and signal processing, simultaneous localization and mapping applications.

Samir BOUAZIZ received the Ph.D. degree in Electronics, in Paris Sud University, Orsay, France 1992. He is a Professor at Fundamental Electronics Institute (IEF) in the same university. He is leader of the research team Embedded Systems. His research focuses on Hardware/software designs of embedded systems for autonomous vehicles and robots. His research is led by time constraints and complexity consideration for a good fit between hardware and algorithms. Embedded Electronics, instrumentation and benchmark systems to understand Human-Vehicle interactions are also his scientific interests.

Rachid LATIF received the PhD degree in Signal Processing in 2001 from Ibn Zohr University in collaboration with Le Havre University France and the Habilitation degree in 2005, from Ibn Zohr University, Morocco. Currently, he is a Professor with the Department of Electrical Engineering, National School of Applied Science, Ibn Zohr University, Agadir, Morocco. Prof. Latif is the head of the Signals Systems and Computer Science Laboratory (ESSI) in National School of Applied Sciences and is a member and General Secretary of the Marocain Acoustical Society (SMA). His research interests include Fuzzy Logic, Timefrequency Signal Processing, Monitoring and Embedded System.

Tajer Abdelouhaed achieved a PhD in Control of Discrete-events Systems at the University of Reims Champagne-Ardenne in France in 2005, after a Master degree in Systems Optimization and Safety at the University of Reims Champagne-Ardenne / University of technology of Troyes. Research interests: Discrete Event Systems, Fault diagnosis, Modeling, Supervisory Control Theory, Optimal Control, Complexes systems and Embedded Systems.Currently, he is HDR professor at the High School of Applied Sciences in Cadi Ayyad, University of Marrakech in Morocco and realizes research at LGECOS Laboratory.

_______________________

Mohamed ABOUZAHIR

Abdelhafid Elouardi

Rachid LATIF

Samir BOUAZIZ

Abdelouahed TAJER

Highlights

- Complete study of processing times of four different SLAM algorithms under different popular embedded devices. - We propose a parallel implementation and comparison of the FastSLAM2.0 on a GPGPU / FPGA - High Level Synthesis on FPGA using OpenCL. - Global processing time of FastSLAM2.0 Intel-FPGA implementation was 7,5x times accelerated compared to a High-end GPU. - Super Real time FPS achieved by the Intel-FPGA implementation of FasSLAM2.0