INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi
Algorithm and hardware implementation for visual perception system in autonomous vehicle: A survey ⁎
Weijing Shia, Mohamed Baker Alawieha, Xin Lib,c, , Huafeng Yud a
Electrical and Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA Electrical and Computer Engineering Department, Duke University, Durham, NC 27708, USA c Institute of Applied Physical Sciences and Engineering, Duke Kunshan University, Kunshan, Jiangsu 215316, China d Boeing Research and Technology, Huntsville, AL 35758, USA b
A R T I C L E I N F O
A BS T RAC T
Keywords: Algorithm Hardware Autonomous vehicle Visual perception
This paper briefly surveys the recent progress on visual perception algorithms and their corresponding hardware implementations for the emerging application of autonomous driving. In particular, vehicle and pedestrian detection, lane detection and drivable surface detection are presented as three important applications for visual perception. On the other hand, CPU, GPU, FPGA and ASIC are discussed as the major components to form an efficient hardware platform for real-time operation. Finally, several technical challenges are presented to motivate future research and development in the field.
1. Introduction The last decade has witnessed tremendous development of autonomous and intelligent systems: satellites in space, drones in air, autonomous vehicles on road, and autonomous vessels in water. These autonomous systems aim at progressively taking over the repeated, tedious and dangerous operations by human, especially in an extreme environment. With the Grand Challenge and Urban Challenge of autonomous vehicles, organized by the Defense Advanced Research Projects Agency (DARPA) in 2004 and 2007 respectively [1], autonomous vehicles and their enabling technologies have received broad interests as well as investments from both academia and industry. After these challenges, major developments have been quickly switched from academic research to industrial commercialization. Automotive Original Equipment Manufacturers (OEMs) such as GM, BMW, Tesla, Daimler and Nissan, tier-one suppliers such as Bosch, Denso and Delphi, as well as software companies such as Google, Uber and Baidu, have progressively joined the global competition for self-driving technology. Many have already noticed Google self-driving cars in Mountain View, CA and Austin, TX, and Uber cars in Pittsburg, PA for road testing. The revolution of autonomous driving brings up a lot of discussions on various issues related to society, policy, legislation, insurance, etc. For instance, how would the society accept autonomous vehicles when their behavior is still unknown and/or unpredictable? How should the policy be made to accelerate the development and deployment of
⁎
autonomous vehicles? What laws should be built to regulate autonomous vehicles for their integration into our society? How do we handle the accidents with insurance involving autonomous vehicles? The recent reports on artificial intelligence [2,3] may be good references for thinking and addressing these questions. Different from conventional vehicles, autonomous vehicles are equipped with new electrical and mechanical devices for environment perception, communication, localization and computing. These devices include radar, LIDAR, ultra-sonic, GPS, camera, GPU, FPGA, etc. They also integrate new information processing algorithms of machine learning, signal processing, encryption/decryption and decision making. The autonomy level [4] of these autonomous vehicles is finally determined by the combination of all devices and algorithms at different levels of maturity. In the literature, a number of these new devices and technologies were first integrated into vehicles as the enablers for Advanced Driver Assistant System (ADAS). However, ADAS only provides simple and partial autonomous features at low levels of autonomy. Yet, ADAS has been proved to be valuable in improving vehicle safety. Examples of ADAS include lane departure warning system, adaptive cruise control, blind spot monitor, automatic parking, etc. These systems generally work with the conventional vehicle E/E (Electrical/Electronic) architecture, and do not require any major modification of vehicle architecture. ADAS has been extensively adopted for today's commercial vehicles with low cost. On the other hand, an increasing number of companies are extremely interested in research and development of high-level
Corresponding author at: Electrical and Computer Engineering Department, Duke University, Durham, NC 27708, USA. E-mail addresses:
[email protected] (W. Shi),
[email protected] (M.B. Alawieh),
[email protected] (X. Li),
[email protected] (H. Yu).
http://dx.doi.org/10.1016/j.vlsi.2017.07.007 Received 20 May 2017; Received in revised form 24 July 2017; Accepted 26 July 2017 0167-9260/ © 2017 Elsevier B.V. All rights reserved.
Please cite this article as: Shi, W., INTEGRATION the VLSI journal (2017), http://dx.doi.org/10.1016/j.vlsi.2017.07.007
INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
W. Shi et al.
As the interface between the real world and the vehicle, an accurate perception system is extremely critical. If inaccurate perception information is used to guide the decision and control system, an autonomous car may make incorrect decisions, resulting in poor driving efficiency or worse, an accident. For example, if the traffic sign detection system misses a STOP sign, the vehicle may not make the correct decision to stop, thereby leading to an accident. Among all perception functions, visual perception is one of the most important components. It interprets visual data from multiple cameras and performs critical tasks such as vehicle and pedestrian detection. Although an autonomous driving system usually has other non-visual sensors, cameras are essential because they mimic human eyes and most traffic rules are designed by assuming the ability of visual perception. For example, many traffic signs share similar physical shapes and they are differentiated by their colored patterns that can only be captured by a visual perception system. In the next section, we review several important applications for visual perception and highlight the corresponding algorithms.
autonomy. Namely, an autonomous car can drive itself, instead of providing driver assistant only. At this level of autonomy, it requires vehicles to sense the surrounding environment like humans, such as distance of obstacle, signalization, location, moving pedestrians, etc., as well as making decisions like humans. These requirements lead to the adoption and integration of a large set of new sensing devices, information processing algorithms, hardware computing units, and then lead to new automotive E/E architecture design where safety, security, performance, power consumption, cost, etc., must be carefully considered. In spite of all the technical and social challenges for adopting autonomous vehicles, autonomy technologies are being developed with significant investment and at a fast rate [5,6]. Among them, visual perception is one of the most critical technologies, as all important decisions made by autonomous vehicles rely on visual perception of the surrounding environment. Without correct perception, any decision made to control a vehicle is not safe. In this paper, we present a brief survey on various perception algorithms and the underlying hardware platforms to execute these algorithms for real-time operation. In particular, machine learning and computer vison algorithms are often used to process the sensing data and derive an accurate understanding of the surrounding environment, including vehicle and pedestrian detection, lane detection, drivable surface detection, etc. Based upon the perception outcome, an intelligent system can further make decisions to control and manipulate the vehicle. To meet the competitive requirements on computing for real-time operation, special hardware platforms have been designed and implemented. Note that machine learning and computer vision algorithms are often computationally expensive and, therefore, require a powerful computing platform to process the data in a timely manner. On the other hand, a commercially competitive system must be energyefficient and with low cost. In this paper, a number of different possible choices for hardware implementation are briefly reviewed, including CPU, GPU, FPGA, ASIC, etc. The reminder of this paper is organized as follows. Section 2 overviews autonomous vehicles and several visual perception algorithms are summarized in Section 3. Important hardware platforms for implementing perception algorithms are discussed in Section 4. Finally, we conclude in Section 5.
3. Visual perception algorithms Visual perception is mainly used for detecting obstacles that can be either dynamic (e.g., vehicle and pedestrian) or static (e.g., road curve and lane marker). Different obstacles may have dramatically different behaviors or represent different driving rules. For example, a road curve defines the strict boundary of the road and exceeding the boundary must be avoided. However, a lane marker defines the “soft” boundary of driving lane which vehicles may break if necessary. Therefore, it is not sufficient to detect obstacles only. A visual perception algorithm must accurately recognize the obstacles of interest. In addition to obstacle detection, visual perception is also used for drivable surface detection, where an autonomous vehicle needs to detect possible drivable space even when it is off-road (e.g. in a parking lot) or when the road is not clearly defined by road markers (e.g. on a forest road). Over the past several decades, a large body of perception algorithms have been developed. However, due to the page limit, we will only review a small number of most representative algorithms in this paper. 3.1. Vehicle and pedestrian detection
2. Autonomous vehicles Detecting vehicles and pedestrians lies in the center of driving safety. Tremendous research efforts have been devoted to develop accurate, robust and fast detection algorithms. Most traditional detection methods are composed of two steps. First, important features are extracted from a raw image. A feature is an abstraction of image pixels such as the gradient of pixels or the similarity between a local image and the designed patterns. They can be considered as the low-level understanding of a raw image. A good feature can efficiently represent the valuable information required by detection while robustly tolerating the distortions such as rotation of image, variation of illumination condition, scaling of object, etc. Next, once the features are available, a learning algorithm is applied to further inspect the feature values and recognize the scene represented by the image. By adopting an appropriate algorithm for feature selection (e.g., AdaBoost [8,9]), a small number of important features are often chosen from a large set of candidates to build an efficient classifier. Histogram of oriented gradients (HoG) [10] is one of the most widely adopted features for object detection. When calculating the HoG feature, an image is divided into a grid of cells and carefully normalized on local area. The histogram of the image gradients in a local area forms a feature vector. The HoG feature is carefully hand-crafted and can achieve high accuracy in pedestrian and vehicle detection. It also carries relatively low computational cost, which makes it popular in real-time applications such as autonomous driving. However, the design of hand-crafted features such as HoG requires extensive
As an intelligent system, autonomous car must automatically sense the surrounding environment and make correct driving decisions by itself. In general, the function components of an autonomous driving system can be classified into three categories: (i) perception, (ii) decision and control, and (iii) vehicle platform manipulation [7]. The perception system of an autonomous vehicle perceives the environment and its interaction with the vehicle. Usually, it covers sensing, sensor fusion, localization, etc. By integrating all these tasks, we generate an understanding of the external world based on sensor data. Given the perception information, a driving system must make appropriate decisions to control the vehicle. The objective is to navigate a vehicle by following a planned route to the destination while avoiding collisions with any static or dynamic obstacle. To achieve this goal, the decision and control functions compute the global route based on a map in its database, constantly plan the correct motion and generate local trajectories to avoid obstacles. Once the driving decision is made, the components for vehicle platform manipulation execute the decision and ensure the vehicle to act in an appropriate manner. They generate control signals for propulsion, steering and braking. Since most traditional vehicles have already adopted the electrical controlling architecture, manipulation units usually do not require any major modification of the architecture. Additionally, vehicle platform manipulation may cover emergency safety operations in case of system failure. 2
INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
W. Shi et al.
over which a vehicle cannot drive. Therefore, obstacle detection alone does not serve as a complete solution. For many autonomous vehicles, additional sensors such as LIDAR are deployed to accurately detect drivable surface. These approaches, however, are expensive and there is a strong interest in developing other alternative cost-effective approaches. Semantic segmentation is a promising technique to address the aforementioned problem. It labels every pixel in an image with the object it belongs to. The object may be a car, a building or the road itself. By using semantic segmentation, an autonomous vehicle can directly locate the drivable space. The conventional algorithms for semantic segmentation adopt random field labeling and the dependencies among labels are modeled by combining features such as color and texture. However, these conventional algorithms rely on hand-crafted features that are not trivial to identify. In [17], a CNN is trained to extract local features automatically. These features are at multiple resolutions and thus are robust to scaling. It has been demonstrated that the CNN approach outperforms other state-of-the-art methods in the literature. However, adopting CNN results in expensive computational cost. For this reason, the authors of [18] reduce the estimation of drivable space to an inference problem on a 1-D graph and it uses simple and light-weight techniques for real-time feature computation and inference. Experimental results have been presented to demonstrate its superior performance even for challenging datasets.
domain-specific knowledge, thereby limiting the successful development of new features. Alternatively, we may generate new features base on existing handcrafted features such as HoG. For instance, the authors of [11] propose to add an extra middle layer for feature extraction after computing the low-level features. The proposed middle layer combines different types of low-level features by processing these features with a variety of filter patterns. Learning methods such as realboost are then applied to select the best combinations of low-level features and these combinations become the new features. Although the computational cost increases for generating new features, these approaches can achieve higher detection accuracy than the conventional methods relying on low-level features only. More recently, the breakthrough on convolutional neural network (CNN) poses a radically new approach where feature extraction is fully integrated into the learning process and all features are automatically learned from the training data [12]. A CNN is often composed of multiple layers. In a single convolutional layer, the input image is processed by a set of filters and the output can be further passed to the following convolutional layers. These filters at all convolutional layers are learnt from the training data and such a learning process can be conceptually viewed as automatic feature extraction. CNN has been demonstrated with the state-of-art accuracy for pedestrian detection [12]. However, it is computationally expensive where billions of floating-point operations are often required for processing a single image. To address this complexity issue, fast R-CNN [13] and YOLO [14] have been proposed in the literature to reduce computational cost and, consequently, achieve real-time operation.
4. Hardware platforms In the last decade, autonomous vehicles have attracted worldwide attention. In addition to algorithm research, hardware development is extremely important for operating an autonomous vehicle in real time. The Urban Challenge organized by DARPA in 2007 requires each team to demonstrate an autonomous vehicle navigating in a given environment where complex maneuvers such as merging, passing, parking and negotiating intersections are tested [1]. After its great success, autonomous driving has been considered to be technically feasible and, consequently, moved to the commercialization phase. For this reason, various autonomous driving systems are being developed by the industry. Academic researchers are also actively involved in this area to develop novel ideas and methodologies to further improve performance, enhance reliability and reduce cost. The hardware system of an autonomous vehicle is composed of sensors (e.g., camera, LIDAR, radar, ultrasonic sensor, etc.), computing devices and a drive-by-wire vehicle platform [19]. In this section, we first briefly summarize the sensing and computing systems for autonomous driving demonstrated by several major industrial and academic players. Next, we will further describe the most recent progress in highperformance computing devices to facilitate autonomous driving.
3.2. Lane detection Lane detection is an essential component for autonomous vehicles to drive on both highway roads and urban streets. Failure to correctly detect a lane may break traffic rules and endanger the safety of not only the autonomous vehicle itself but also other vehicles on the road. Today, lanes are mostly defined by the lane markings which can only be detected by visual sensors. Therefore, designing real-time vision algorithm plays an irreplaceable role in reliable lane detection. To facilitate safe and reliable driving, lane detection must be robustly implemented under non-ideal illumination and lane marking conditions. In [15], a lane detection algorithm is developed and it is able to deal with challenging scenarios such as a curved lane, worn lane markings, and lane changes including emerging and splitting. The proposed approach adopts a probabilistic framework to combine object recognition and tracking, achieving robust and real-time detection. However, the approach in [15] relies on motion models of the vehicle and requires information from inertial sensors to track lane markings. It may break down when the motion of a vehicle shows random patterns. To address this issue, the authors of [16] propose a new approach that characterizes the tracking model by assuming static lane markings, without relying on the knowledge about vehicle motion. As such, it has demonstrated superior performance for extremely challenging scenarios during both daytime and nighttime.
4.1. Sensing systems Camera is one of the most critical components for visual perception. Typically, the spatial resolution of a camera in autonomous vehicle ranges from 0.3 megapixel to 2 megapixel [20,21]. A camera can generate the video stream at 10–30 fps and captures important objects such as traffic light, traffic sign, obstacles, etc., in real time. In addition to camera, LIDAR is another important sensor. It measures the distance between vehicle and obstacles by actively illuminating the obstacles with laser beams [22]. Typically, a LIDAR system scans the surrounding environment periodically and generates multiple measurement points. This “cloud” of points can be further processed to compute a 3D map of the surrounding environment [23]. LIDAR is known to be relatively robust and accurate [24], but it is also expensive. Alternatively, a stereo camera can be used to interpret the 3D environment [25]. It is composed of two or more individual cameras. Knowing the relative spatial locations of all individual cameras, a depth
3.3. Drivable surface detection One of the fundamental problems in autonomous driving is to identify the collision-free surface where a vehicle can safely drive. Although obstacle detection plays an important role in constraining the surface and defining the un-drivable space, it is not sufficient to fully determine the drivable space due to two reasons. First, it is extremely difficult, if not impossible, to detect all possible physical objects in real life. Various objects may act as obstacles and not all of them can be precisely recognized by a detection algorithm. Second, a number of obstacles may not be described by a physical and well-characterized form. For example, bridge edge and water surface are both obstacles 3
INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
–
Four short-range, four middle-range and four long-range radars Three radars
Four ultrasonic sensors
– Twelve sonar sensors – One long-range and five middle-range radars – –
Two 4-layer laser scanners and two single-layer laser scanners
For autonomous driving, a powerful computing system is required to interpret a large amount of sensing data and perform complex perception functions in real time. To achieve this goal, a variety of computing architectures have been proposed, such as multicore CPU system [24], heterogeneous system [20], distributed system [29], etc. Table 2 summarizes the major computing systems adopted by several state-of-the-art autonomous vehicles. As shown in Table 2, most systems are composed of more than one computing devices. For instance, BMW adopts a standard personal computer (PC) and a real-time embedded computer (RTEPC) [31]. They are connected by a direct Ethernet connection. The PC is connected to multiple sensors and vehicle bus signals through Ethernet and CAN buses. It fuses the data from all sensors to fully understand the external environment. The PC also stores a database of high-precision maps. Meanwhile, the RTEPC is connected to the actuators by CAN buses for steering, braking and throttle control. It performs a variety of important functions such as localization, trajectory planning and control. A similar system with separated computing devices can also be found in the autonomous vehicle designed by Stanford [24]. Its computing system is composed of two multicore CPU servers. A 12core server runs vision and LIDAR algorithms. The other 6-core server performs planning, control, and low-level communication tasks. The autonomous vehicle designed by Carnegie Mellon deploys 4 computing devices and each of them is equipped with one CPU and one GPU [20]. All computing devices are interconnected by Ethernet connection to support high computing power for complicated algorithms as well as to tolerate possible failure events. In addition, a separate interface computing device runs the user application that controls the vehicle via a touch-screen interface. This idea of using a cluster of computing devices is shared by the European V-Charge project leaded by ETH Zurich [21], where the computing system is composed of a cluster of 6 personal computers.
Junior [24]
CMU [20] V-Charge [21] A1 [29]
Bertha Benz drive [30]
BMW [31]
2011
2013 2013 2014
2014
2015
One mono-camera
VIAC [28] 2011
Three cameras including one stereo camera
– Six radars
Six LIDARs – Two multilayer laser scanners and four single-layer laser scanners –
– –
One 4-layer laser scanner and three single-beam laser scanners Three LIDARs
Vehicle
Seven cameras including four cameras for stereo vison Four cameras including two cameras for stereo vision Two cameras including one infrared camera Four fish-eye cameras and one stereo camera One color camera and three mono-cameras
Ultrasonic Radar
map can be computed by comparing the difference between multiple images from different cameras. Therefore, the distance of an object in the scene can be estimated. Generally, a stereo camera is less expensive than a LIDAR. However, a stereo camera is passive and it is sensitive to environmental artifacts and/or noises posed by bad weather and illumination. Besides camera and LIDAR, radar and ultrasonic sensors are also widely used to detect obstacles. Their detection areas can be shortrange and wide-angle, mid-range and wide-angle, and long-range and narrow-angle [22]. For applications such as crashing detection and blind spot detection [26], a short detection range of 20–30 m is commonly used [27]. For other applications such as cruise control, a long detection range of 200 m is required [27]. Ultrasonic sensors are similar to radars, but they use high-frequency sound waves, instead of radio waves, to detect objects. Both radars and ultrasonic sensors do not capture the detailed information of an obstacle (e.g., color, texture, etc.) and cannot classify the obstacle into different categories (e.g., vehicle, pedestrian, etc.). Table 1 summarizes the sensors adopted by today's autonomous vehicles. Note that most autonomous vehicles integrate multiple types of sensors due to the two important reasons. First, fusing the data from multiple sensors improves the overall perception accuracy. For example, a LIDAR system can quickly detect the regions of interest and a camera system can apply highly accurate object detection algorithms to further analyze these important regions. Second, different layers of sensors with overlapped sensing areas provide additional redundancy and robustness to ensure high accuracy and reliability. For instance, when the camera system fails to detect an incoming vehicle, the radar system can act as a fail-safe and prevent an accident from happening. 4.2. Computing systems
Year
Table 1 Sensors for autonomous vehicles.
Camera
LIDAR
W. Shi et al.
4
INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
W. Shi et al.
Table 2 Computing systems for autonomous vehicles. Year
Vehicle
Computing system
2011 2013
Junior [24] CMU [20]
2013 2014
V-Charge [21] A1 [29]
2015
BMW [31]
One Intel Xeon 12-core server and one Intel Xeon 6-core server One computing device equipped with an Intel Atom Processor D525 and four mini-ITX motherboards equipped with NVIDIA GT530 GPUs and Intel Core 2 Extreme Processor QX9300s Six personal computers Two embedded industrial computers, a rapid controller prototyping electronic computing unit and 13 32-bit microcontroller-based electronic computing units One standard personal computer and one real-time embedded prototyping computer
Therefore, a large number of computer vision and deep learning tools such as OpenCV [40] and Caffe [41] have taken advantage of GPU to improve the throughput. For this reason, GPU has been considered as a promising computing device for the application of autonomous driving. However, GPU often consumes high energy. For instance, NVIDIA Tesla K40 is used in [38] and its power consumption is around 235 W. Such a high power consumption poses two critical issues. First, it increases the load of power generation system inside a vehicle. Second, but more importantly, it makes heat dissipation extremely challenging because the environmental temperature inside a vehicle is often significantly higher than the normal room temperature. To address these issues, various efforts have been made to design and implement mobile GPUs with reduced power consumption. For instance, NVIDIA has released its newest mobile GPU Tegra ×1 implemented with the TSMC 20 nm technology [42]. It is composed of a 256-CUDA-Core GPU and two quad-core ARM CPUs, as shown in Fig. 1. It also contains an end-to-end 4 K 60 fps pipeline which supports high-performance video encoding, decoding and displaying. In addition, it offers a number of I/O interfaces such as USB3.0, HDMI, serial peripheral interface, etc. The two ARM CPUs are implemented with different options: (i) a high-performance ARM quad-core A53, and (ii) a power-efficient ARM quad-core A57. When running a set of given applications, we can switch between the high-performance and low-power cores to achieve maximum power efficiency as well as optimal performance. Tegra ×1 is one of the important chips for the DRIVE PX Auto-Pilot Platform marketed by NVIDA for autonomous driving [42]. At its peak performance, Tegra ×1 offers over 1 T FLOPs for 16-bit operations and over 500 G FLOPs for 32-bit operations. It is designed to improve power efficiency by optimizing its computing cores, reorganizing its GPU architecture, improving memory compression, and adopting 20 nm technology. While a conventional GPU consumes
The A1 car designed by Hanyang University further distributes its computing functions over more devices [29]. It adopts a distributed computing system consisting of two embedded industrial computers, a rapid controller prototyping electronic computing unit and 13 microcontroller-based electronic computing units. The two high-performance embedded industrial computers provide high computing power to run sensor fusion, planning, and vision algorithms. The rapid controller prototyping electronic computing unit is particularly designed for real-time operation and, therefore, is used for time-critical tasks such as vehicle control. The 13 electronic computing units are used for braking, steering, acceleration, etc. In order to achieve realtime response, those computing devices are placed next to the actuators in order to reduce the latency for communication. While the aforementioned hardware systems have been successfully designed and adopted for real-time operation of autonomous driving, their performance (measured by accuracy, throughput, latency, power, etc.) and cost (measured by price) remain noncompetitive for highvolume production of commercial deployment. Hence, radically new hardware implementations must be developed to address both the technical challenges and the market needs in this field, as will be further discussed in the next sub-section. 4.3. Computing devices The aforementioned autonomous vehicles have successfully demonstrated their self-driving capabilities by using conventional computing systems. However, their performance and cost are still noncompetitive for commercial deployment and new computing devices must be developed to improve performance, enhance reliability and reduce cost. In this sub-section, we review the recent advances in the field driven by major industrial and academic players. 4.3.1. Graphics processing units Graphics processing unit (GPU) is conventionally designed and used for graphic processing tasks. Over the past decades, the advance of GPU has been driven by the real-time performance requirements for complex and high-resolution 3D scenes in computer games where tremendous parallelism is inherent [32]. Today, general-purpose graphics processing unit (GPGPU) is also widely used for highperformance computing (HPC). It has been demonstrated with promising performance for scientific applications such as cardiac bidomain simulation [33], biomolecular modeling [34], quantum Monte Carlo [35], etc. A GPU contains hundreds or even thousands of parallel processors and can achieve substantially higher throughput than CPU when running massively parallel algorithms. To reduce the complexity of GPU programming, multiple parallel programming tools such as CUDA [36] and OpenCL [37] have been developed in the literature. Many computer vision and machine learning algorithms used for automotive perception are inherently parallel and, therefore, fit the aforementioned GPU architecture. For example, convolutional neural network (CNN) is a promising technique for autonomous perception [38,39]. The computational cost of evaluating a CNN is dominated by the convolution operations between the neuron layers and a number of spatial filters, which can be substantially accelerated by GPU.
Fig. 1. Simplified architecture for NVIDIA Tegra ×1 [42].
5
INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
W. Shi et al.
hundreds of watts, Tegra ×1 only consumes a few watts. For example, when implementing the CNN model GoogleNet, Tegra ×1 achieves the throughput of 33 images per second while only consuming 5.0 W [43]. Its energy efficiency is 5.4× better than the conventional desktop GPU Titan X.
4.3.2. Field-programmable gate arrays A field-programmable gate array (FPGA) is an integrated circuit that can be configured to implement different digital logic functions. Conventionally, FPGA is used as an emulation platform for early-stage validation of application-specific integrated circuit (ASIC). It is now extensively used for HPC for two reasons. First, FPGA is reconfigurable and the same FPGA fabric can be programmed to implement different logic functions. Compared to the conventional ASIC design, an FPGA design reduces the non-recurring engineering (NRE) cost by reducing the required design and validation time. Second, FPGA is programmed for a given application with its specific computing architecture. Hence, it improves computing efficiency and reduces energy consumption over CPU and/or GPU whose architectures are designed for general-propose computing. Conventionally, an FPGA-based design is often described by hardware description language (HDL) such as Verilog or VHDL. The design is specified at register-transfer level (RTL) by registers and combinational logics between these registers. It is a low-level abstraction and designers must appropriately decide the detailed hardware architecture and carefully handle the massive concurrency between different hardware modules. Once the RTL description is available, it is further synthesized by an EDA tool to generate the netlist mapped to FPGA. Such a conventional design methodology is time-consuming and requires FPGA designers to fully understand all low-level circuit details. Recently, with the advance of high-level synthesis (HLS), FPGA designers can now write high-level specifications in C, C++ or SystemC. An HLS tool, such as Altera OpenCL or Xilinx HLS, is used to compile the high-level description into HDL. Furthermore, designers can control the synthesis process by directly incorporating different “hints” into the high-level description. The advance of HLS has made broad and significant impacts to the community as it greatly reduces the overall design cost and shortens the time-to-market. An appropriately optimized FPGA design has been demonstrated to be more energy-efficient than CPU or GPU for a variety of computer vision algorithms such as optical flow, stereo vision, local image feature extraction, etc [44]. In the literature, numerous research efforts have been made by both academic and industrial researchers to implement computer vision algorithms for the perception system required by autonomous driving. Among them, CNN is one of the most promising solutions developed in recent years. For instance, Altera has released its CNN accelerator for the FPGA devices Stratix 10 and Arria 10 manufactured by 20 nm technology. Both devices have built-in DSP units to efficiently perform floatingpoint operations. At its peak performance, Arria 10 can process hundreds of GFLOPs and Stratix 10 can process several TFLOPs. The CNN accelerator is implemented with Altera OpenCL programming language [45]. Fig. 2 shows the architecture of the aforementioned accelerator. It is composed of several computing kernels, and each of them implements one CNN layer. Different kernels are connected by OpenCL channels or pipes for data transmission without access to external memory, thereby reducing power consumption. The same
Fig. 3. Simplified architecture for the CNN accelerator implemented by Microsoft [46].
OpenCL channels or pipes can be used to transmit data between FPGA and other external devices such as cameras. As an alternative example, Microsoft has developed a high-throughput CNN accelerator built upon Altera Stratix V and mapped the design to Arria 10 [46]. Fig. 3 shows the simplified architecture of the accelerator. It can be configured by a software engine at run-time. Inside the accelerator, the input buffers and weight buffers store the image pixels and filter kernels respectively. A large array of processing elements (PEs) (e.g., thousands of PEs) efficiently computes the dotproduct values for convolution. A network-on-chip passes the outputs of PEs to input buffers. When running the accelerator, image pixels are read from DRAM to the input buffers. Next, the PEs compute the convolution between image pixels and filter kernels for one convolutional layer of CNN and then store the results back to input buffers. These results will be further circulated as the input data for the next conventional layer. In this way, the intermediate results of CNN are not written back to DRAM and thus the amount of data communication between DRAM and FPGA is substantially reduced. Table 3 compares the energy efficiency of CPU, GPU and the aforementioned two FPGA accelerators for the CNN of AlexNet. It is straightforward to observe that the FPGA accelerators improve the energy efficiency, measured by the throughput over power, compared to CPU and GPU. 4.3.3. Application-specific integrated circuits Although FPGA offers a general reconfigurable solution where an application-specific design can be implemented to reduce the overhead posed by CPU and/or GPU, it often suffers from slow operation speed and large chip area. These disadvantages are inherent in its reconfigurability where logic functions and interconnect wires are programed by using lookup tables and switches. Compared to FPGA, ASIC is able to achieve superior performance by sacrificing low-level reconfigurability. ASIC implementation poses significant NRE cost for both design and validation, especially for today's large-scale systems. The NRE cost is high because a manufactured chip may fail to work and, hence, Table 3 Performance comparison of CPU, GPU and FPGA accelerators for CNN.
CPU E52699 Dual Xeon [45] GPU Tesla K40 [46] FPGA Arria 10 GX 1150 [45] FPGA Arria 10 GX 1150 + [46] Fig. 2. Simplified architecture for the CNN accelerator implemented by Altera [45].
6
Throughput (frame/ second)
Power (W)
Efficiency (frame/ second/W)
1320
321
4.11
824 1200
235 130
3.5 9.27
233
25
9.32
INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
W. Shi et al.
Fig. 4. Simplified architecture for the EyeQ SoC implemented by Mobileye [47].
several design iterations are often required. Once the AISC design is validated in silicon, it can be manufactured with high volume and the average cost per chip can be greatly reduced. For autonomous vehicles, the market size is tremendous and, therefore, it justifies the high NRE cost for ASIC design. In addition, the visual perception algorithms for autonomous driving have been relatively mature, thereby eliminating the risk for ASIC implementation to be outdated after its long design circle. For instance, Mobileye has launched the EyeQ SoC to implement computationally intensive real-time algorithms for ADAS [47,48]. It broadly covers a number of important visual perception algorithms, including lane departure detection, vehicle detection, traffic sign recognition, etc. As shown in Fig. 4, the EyeQ SoC contains two ARM processors and four vision computing engines (i.e., a classifier engine, a tracker engine, a lane detection engine, and a window, pre-processing and filter engine). In the aforementioned architecture, one of the ARM processors is used to manage vision computing engines as well as the other ARM processor. The other ARM processor is used for intensive computing tasks. The classifier engine is designed for image scaling, preprocessing, and pattern classification. The tracker engine is used for image warping and tracking. The lane detection engine identifies lane markers as well as road geometries. The window, preprocessing and filter engine is designed to convolute images, create image pyramids, detect edges, and filter images. Furthermore, a direct memory access (DMA) component is used for both on-chip and off-chip data transmission under the control of ARM processor. Recently, Mobileye has implemented the EyeQ. 2 SoC, an upgraded version of the EyeQ SoC, as shown in Fig. 5. It covers several additional applications including pedestrian protection, head lamp control, adaptive cruise control, headway monitoring and warning, etc. Different from the EyeQ SoC, ARM processors are replaced by MIPS processors. Furthermore, three vector microcode processors with single instruction multiple data (SIMD) and very long instruction word (VLIW) are added. In addition, the lane detection engine is removed while two other vision computing engines are added for feature-based classifier and stereo vision. Similar to the EyeQ SoC, one of the MIPS processors is used to control the vision computing engines, vector microcode modules, DMA and the other MIPS processor. The other MIPS processor together with the vision computing engines performs computationally intensive tasks. Besides Mobileye, Texas Instruments has developed the TDA3x SoC for ADAS [49]. It offers a variety of functions such as autonomous emergency braking, lane keep assist, advanced cruise control, traffic sign recognition, pedestrian and object detection, forward collision warning, etc. [50] The simplified architecture of TDA3x is shown in
Fig. 5. Simplified architecture for the EyeQ. 2 SoC implemented by Mobileye [48].
Fig. 6. Simplified architecture for the TDA3x SoC implemented by Texas Instruments [49].
Fig. 6. It uses a heterogeneous architecture composed of DSP, embedded vision engine, ARM core, and image signal processor. The DSP unit can operate at 750 MHz and it contains two floating-point multipliers and six arithmetic units. The embedded vision engine is a vector processor operating at 650 MHz and it is optimized for computer vision algorithms. The heterogeneous architecture of TDA3x facilitates multiple ADAS functions in real time. More recently, a number of advanced system architectures have been proposed to facilitate efficient implementation of deep learning algorithms. Tensor processing unit (TPU) by Google [51] and dataflow processing unit (DPU) by Wave Computing are two of these examples. 5. Conclusions In this paper we briefly summarize the recent progress on visual perception algorithms and the corresponding hardware implementations to facilitate autonomous driving. In particular, a variety of algorithms are discussed for vehicle and pedestrian detection, lane detection and drivable surface detection. On the other hand, CPU, GPU, FPGA and ASIC are presented as the major components to form an efficient hardware platform for real-time computing and operation. While significant technical advances have been accomplished in this area, there remains a strong need to further improve both algorithm and hardware designs in order to make autonomous vehicles safe, reliable and comfortable. The technical challenges can be broadly classified into three categories: 7
INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
W. Shi et al.
• •
•
Algorithm design: Accurate and robust algorithms are needed to handle all corner cases so that an autonomous vehicle can appropriately operate over these scenarios. Such a robustness feature is particularly important in order to ensure safety. Hardware design: Adopting increasingly accurate and robust algorithms often increases computational complexity and, hence, needs a powerful hardware platform to implement these algorithms. It, in turn, requires us to further improve both system architecture and circuit implementation in order to boost the computing power for real-time operation. System validation: Accurately and efficiently validating a complex autonomous system is non-trivial. Any visual perception system based on machine learning cannot be 100% accurate. Hence, the system may fail for a specific input pattern and accurately estimating its rare failure rate can be extremely time-consuming [52].
[22]
[23]
[24]
[25] [26] [27] [28]
To address the aforementioned challenges, academic researchers and industrial engineers from interdisciplinary fields such as artificial intelligence, hardware system, automotive design, etc. must closely collaborate in order to achieve fundamental breakthroughs in the area.
[29]
[30]
References [1] DARPA Urban Challenge, 2007. Online Available: 〈http://archive.darpa.mil/ grandchallenge/〉. [2] National Science and Technology Council, Networking and information technology research and development subcommittee, The National Artificial Intelligence Research and Development Strategic Plan, Oct, 2016. Online Available: 〈https:// www.nitrd.gov/news/national_ai_rd_strategic_plan.aspx〉. [3] Executive Office of the President, National Science and Technology Council Committee on technology, Preparing for the Future of Artificial intelligence, Oct. 2016. OnlineAvailable: 〈https://www.whitehouse.gov/blog/2016/10/12/ administrations-report-future-artificial-intelligence/〉. [4] SAE J3016, Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-road Motor Vehicles, 2016. Online Available: 〈http://standards. sae.org/j3016_201609/〉. [5] J. Markoff, Toyota Invests $1 Billion in Artificial Intelligence in U.S., New York Times, Nov. 2015. Online Available: 〈http://www.nytimes.com/2015/11/06/ technology/toyota-silicon-valley-artificial-intelligence-research-center.html〉. [6] D. Primack, K. Korosec, GM Buying Self-driving Tech Startup for More Than $1 Billion, Fortune, Mar. 2016. Online Available: 〈http://fortune.com/2016/03/11/ gm-buying-self-driving-tech-startup-for-more-than-1-billion/〉. [7] S. Behere, M. Törngren, A Functional Architecture for Autonomous Driving, ACM, WASA, Montréal, QC, Canada, 2015. [8] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, IEEE CVPR 1 (2001) 511–518. [9] D. Jeon, Q. Dong, Y. Kim, X. Wang, S. Chen, H. Yu, D. Blaauw, D. Sylvester, A 23mW face recognition processor with mostly-read 5 T memory in 40-nm CMOS, IEEE J. Solid-State Circuits 52 (6) (2017) 1628–1642. [10] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, IEEE CVPR 1 (2005) 886–893. [11] S. Zhang, R. Benenson, B. Schiele, Filtered channel features for pedestrian detection, IEEE CVPR (2015) 1751–1760. [12] L. Zhang, L. Lin, X. Liang, K. He, Is faster R-CNN doing well for pedestrian detection?, IEEE ECCV (2016) 443–457. [13] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, NIPS (2015). [14] J. Redmon, A. Farhadi, YOLO9000: Better, Faster, Stronger, Dec. 2016. Online Available: 〈https://arxiv.org/abs/1612.08242〉. [15] Z. Kim, Robust lane detection and tracking in challenging scenarios, IEEE Trans. Intell. Transp. Syst. 9 (1) (2008) 16–26. [16] R. Gopalan, T. Hong, M. Shneier, R. Chellappa, A learning approach towards detection and tracking of lane markings, IEEE Trans. Intell. Transp. Syst. 13 (3) (2012) 1088–1098. [17] J. Alvarez, Y. LeCun, T. Gevers, A. Lopez, Semantic road segmentation via multiscale ensembles of learned features, IEEE ECCV 2 (2012) 586–595. [18] J. Yao, S. Ramalingam, Y. Taguchi, Y. Miki, R. Urtasun, Estimating drivable collision-free space from monocular video, IEEE WACV (2015) 420–427. [19] T. Drage, J. Kalinowski, T. Braunl, Integration of drive-by-wire with navigation control for a driverless electric race car, IEEE Intell. Transp. Syst. Mag. 6 (4) (2014) 23–33. [20] J. Wei, J. Snider, J. Kim, J. Dolan, R. Rajkumar, B. Litkouhi, Towards a viable autonomous driving research platform, IEEE IV (2013) 763–770. [21] P. Furgale, U. Schwesinger, M. Rufli, W. Derendarz, H. Grimmett, P. Mühlfellner, S. Wonneberger, J. Timpner, S. Rottmann, B. Li, B. Schmidt, T.N. Nguyen, E. Cardarelli, S. Cattani, S. Brüning, S. Horstmann, M. Stellmacher, H. Mielenz,
[31]
[32] [33]
[34]
[35]
[36]
[37] [38] [39] [40] [41]
[42] [43]
[44]
[45]
[46]
[47] [48] [49]
8
K. Köser, M. Beermann, C. Häne, L. Heng, G.H. Lee, F. Fraundorfer, R. Iser, R. Triebel, I. Posner, P. Newman, L. Wolf, M. Pollefeys, S. Brosig, J. Effertz, C. Pradalier, R. Siegwart, Toward automated driving in cities using close-to-market sensors: an overview of the V-charge project, IEEE IV (2013) 809–816. J. Zolock, C. Senatore, R. Yee, R. Larson, B. Curry, The use of stationary object radar sensor data from advanced driver assistance systems (ADAS) in accident reconstruction, SAE Technical Paper, no. 2016-01-1465, 2016. B. Douillard, J. Underwood, N. Kuntz, V. Vlaskine, A. Quadros, P. Morton, A. Frenkel, On the segmentation of 3D LIDAR point clouds, IEEE ICRA (2011) 2798–2805. J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Kolter, D. Langer, O. Pink, V. Pratt, M. Sokolsky, G. Stanek, D. Stavens, A. Teichman, M. Werling, S. Thrun, Towards fully autonomous driving: systems and algorithms, IEEE IV (2011) 163–168. D. Forsyth, J. Ponce, Computer Vision: A Modern Approach, Pearson, 2002. R. Mobus, U. Kolbe, Multi-target multi-object tracking, sensor fusion of radar and infrared, IEEE IV (2004) 732–737. NXP, Automotive radar millimeter-wave technology. Online Available: 〈http:// www.nxp.com/pages/automotive-radar-millimeter-wave-technology:AUTRMWT〉. M. Bertozzi, L. Bombini, A. Broggi, M. Buzzoni, E. Cardarelli, S. Cattani, P. Cerri, A. Coati, S. Debattisti, A. Falzoni, R. Fedriga, M. Felisa, L. Gatti, A. Giacomazzo, P. Grisleri, M. Laghi, L. Mazzei, P. Medici, M. Panciroli, P. Porta, P. Zani, P. Versari, VIAC: an out of ordinary experiment, IEEE IV (2011) 175–180. K. Jo, J. Kim, D. Kim, C. Jang, M. Sunwoo, Development of autonomous car—part II: a case study on the implementation of an autonomous driving system based on distributed architecture, IEEE Trans. Ind. Electron. 62 (8) (2015) 5119–5132. J. Ziegler, P. Bender, M. Schreiber, H. Lategahn, T. Strauss, C. Stiller, T. Dang, U. Franke, N. Appenrodt, C. Keller, E. Kaus, R. Herrtwich, C. Rabe, D. Pfeiffer, F. Lindner, F. Stein, F. Erbs, M. Enzweiler, C. Knöppel, J. Hipp, M. Haueis, M. Trepte, C. Brenk, A. Tamke, M. Ghanaat, M. Braun, A. Joos, H. Fritz, H. Mock, M. Hein, E. Zeeb, Making bertha drive—an autonomous journey on a historic route, IEEE Intell. Transp. Syst. Mag. 6 (2) (2014) 8–20. M. Aeberhard, S. Rauch, M. Bahram, G. Tanzmeister, J. Thomas, Y. Pilat, F. Homm, W. Huber, N. Kaempchen, Experience, results and lessons learned from automated driving on germany's highways, IEEE Intell. Transp. Syst. Mag. (2015) 42–57. J. Nickolls, W. Dally, The GPU computing era, IEEE Micro 30 (2) (2010) 56–69. A. Neic, M. Liebmann, E. Hoetzl, L. Mitchell, E. Vigmond, G. Haase, G. Plank, Accelerating cardiac bidomain simulations using graphics processing units, IEEE Trans. Biomed. Eng. 59 (8) (2012) 2281–2290. J. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. McNally, J. Meredith, J. Rogers, P. Roth, K. Spafford, S. Yalamanchili, Keeneland: bringing heterogeneous GPU computing to the computational science community, Comput. Sci. Eng. 13 (2011) 90–95. R. Weber, A. Gothandaraman, R. Hinde, G. Peterson, Comparing hardware accelerators in scientific applications: a case study, IEEE Trans. Parallel Distrib. Syst. 22 (1) (2011) 58–68. M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences with CUDA, IEEE Micro 28 (4) (2008) 13–27. J. Stone, D. Gohara, G. Shi, OpenCL: a parallel programming standard for heterogeneous computing systems, Comput. Sci. Eng. 12 (3) (2010) 66–73. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, NIPS (2015) 91–99. C. Chen, A. Seff, A. Kornhauser, J. Xiao, DeepDriving: learning affordance for direct perception in autonomous driving, IEEE ICCV (2015). Open Computer Vision Library (OpenCV). Online Available: 〈http://opencvlibrary. sourceforge.net〉. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, ACM MM (2014). NVIDIA, Nvidia tegra x1. Online Available: 〈http://international.download.nvidia. com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf〉. NVIDIA, GPU-Based Deep Learning Inference: A Performance and Power Analysis. Online Available: 〈https://www.nvidia.com/content/tegra/embedded-systems/ pdf/jetson_tx1_whitepaper.pdf〉. K. Pauwels, M. Tomasi, J. Diaz Alonso, E. Ros, M. van Hulle, A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features, IEEE Trans. Comput. 61 (7) (2012) 999–1012. Intel, Efficient Implementation of Neural Network Systems Built on FPGAs and Programmed With OpenCL. Online Available: 〈https://www.altera.com/en_US/ pdfs/literature/solution-sheets/efficient_neural_networks.pdf〉. K. Ovtcharov, O. Ruwase, J. Kim, J. Fowers, K. Strauss, E. Chung, Accelerating deep convolutional neural networks using specialized hardware, Microsoft Research, Feb. 2015. Online Available: 〈https://www.microsoft.com/en-us/ research/publication/accelerating-deep-convolutional-neural-networks-usingspecialized-hardware/〉. Mobileye, EyeQ. Online Available: 〈http://www.mobileye.com/technology/ processing-platforms/eyeq/〉. Mobileye, EyeQ. 2. Online Available: 〈http://www.mobileye.com/technology/ processing-platforms/eyeq2/〉. TI, New TDA3x SoC for ADAS Solutions in Entry- to Mid-level Automobiles. Online Available: 〈http://www.ti.com/lit/ml/sprt708a/sprt708a.pdf〉.
INTEGRATION the VLSI journal xxx (xxxx) xxx–xxx
W. Shi et al.
K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, D. Yoon, In-datacenter performance analysis of a tensor processing unit, Int. Sci. Congr. Assoc. (2017). [52] W. Shi, M. Alawieh, X. Li, H. Yu, N. Arechiga, N. Tomatsu, Efficient statistical validation of machine learning systems for autonomous driving, IEEE/ACM ICCAD (2016).
[50] M. Mody, P. Swami, K. Chitnis, S. Jagannathan, K. Desappan, A. Jain, D. Poddar, Z. Nikolic, P. Viswanath, M. Mathew, S. Nagori, H. Garud, High performance front camera ADAS applications on TI's TDA3X platform, High Perform. Comput. (2015) 456–463. [51] N. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony,
9