Deep learning-based method for vision-guided robotic grasping of unknown objects

Deep learning-based method for vision-guided robotic grasping of unknown objects

Advanced Engineering Informatics 44 (2020) 101052 Contents lists available at ScienceDirect Advanced Engineering Informatics journal homepage: www.e...

10MB Sizes 0 Downloads 38 Views

Advanced Engineering Informatics 44 (2020) 101052

Contents lists available at ScienceDirect

Advanced Engineering Informatics journal homepage: www.elsevier.com/locate/aei

Full length article

Deep learning-based method for vision-guided robotic grasping of unknown objects

T

Luca Bergaminia, Mario Sposatoa, Marcello Pellicciarib, Margherita Peruzzinia, , Simone Calderaraa, Juliana Schmidta ⁎

a b

Department of Engineering Enzo Ferrari, University of Modena and Reggio Emilia, Modena, MO, Italy Department of Sciences and Methods for Engineering, University of Modena and Reggio Emilia, Reggio Emilia, RE, Italy

ARTICLE INFO

ABSTRACT

Keywords: Collaborative robotics Deep learning Vision-guided robotic grasping Industry 4.0

Nowadays, robots are heavily used in factories for different tasks, most of them including grasping and manipulation of generic objects in unstructured scenarios. In order to better mimic a human operator involved in a grasping action, where he/she needs to identify the object and detect an optimal grasp by means of visual information, a widely adopted sensing solution is Artificial Vision. Nonetheless, state-of-art applications need long training and fine-tuning for manually build the object’s model that is used at run-time during the normal operations, which reduce the overall operational throughput of the robotic system. To overcome such limits, the paper presents a framework based on Deep Convolutional Neural Networks (DCNN) to predict both single and multiple grasp poses for multiple objects all at once, using a single RGB image as input. Thanks to a novel loss function, our framework is trained in an end-to-end fashion and matches state-of-art accuracy with a substantially smaller architecture, which gives unprecedented real-time performances during experimental tests, and makes the application reliable for working on real robots. The system has been implemented using the ROS framework and tested on a Baxter collaborative robot.

1. Introduction Collaborative robots (or co-bots) have been conceived to safely and symbiotically work with human operators in a shared workplace, aiding and supporting whenever close collaboration is needed. For this purpose, co-bots must interact with human operators within unstructured environments, continuously adapting their behaviour to satisfy everchanging needs. Moreover today the extensive integration with sensors and software solutions and the availability of simple programming interfaces are significantly increasing the outreach of those technologies in the industrial field. At the same time, the research community is pushing to build collaborative robots able to take decisions and autonomously solve tasks. In this context, the European H2020 Research Project Colrobot [1] is developing a collaborative mobile robotic platform, conceived for solving assembly operations in automotive and aerospace industry and create a symbiotic cooperation between humans and robots. Robot are employed both on autonomous kitting task, where they need to identify and collect parts in a working space shared with humans, and in collaborative assembly operations, where a direct interaction and cooperation with the operator is requested. Hence, one of the main project challenges is to bring these different operational ⁎

modes (i.e., fully autonomous and interactive) into a single platform, designing also a flawless, unconstrained context switch between the two functionalities. The Colrobot platform addresses such challenge by designing and developing novel vision-guided grasping and dexterous manipulation techniques. Artificial vision enables robots to adapt to unpredictable situations, but the actual time and effort in programming and training drastically reduces the overall co-bots performance as potential applications. As far as the industrial state-of-art, vision-guided grasping applications are mainly based on geometric visual techniques [2–4], in which a vision sensor is mounted on the robot, and the objects and their grasping features are identified according to some reliable yet closed procedures such as Template Matching (Fig. 1). Vision-guided grasping methods are mainly divided in two steps: the first is the definition of a model from a sample image, in which an operator has to carefully select the features of the object to look for and masks everything unnecessary, taking also into account the use of structured lights (e.g. infrared illuminators, filtered camera lenses) in the scene, for further robustness of the model. The second step uses the model to manually select a viable grasp pose on the object. These methods clearly rely on human expertise and its application in building

Corresponding author at: Department of Engineering Enzo Ferrari, University of Modena and Reggio Emilia, via Vivarelli 10, 41125 Modena, MO, Italy. E-mail address: [email protected] (M. Peruzzini).

https://doi.org/10.1016/j.aei.2020.101052 Received 30 November 2018; Received in revised form 22 July 2019; Accepted 7 January 2020 1474-0346/ © 2020 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/BY-NC-ND/4.0/).

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

Fig. 1. Industrial software example for Template Matching.

camera images that make the integration with an industrial setup easier and cheaper, nonetheless reaching state-of-art results; (iii) the definition of a novel network architecture, smaller than previous architectures proposed by existing solutions, which is trained in an end-toend fashion, and achieve real-time speed during inference. These results encourage further research and improvement on the system. The paper is organized as follows: Section 2 presents the state-of-art, both for industrial and research applications. Section 3 describes the dataset used to train the proposed solutions, the developed solution for Single-Grasp Detection and for the Multi-Grasp Detection. Section 4 gives some results about the novel approach application, in terms of valuable metrics, both in simulation and in real-world experiments, comparing also with the state-of-art. Section 5 presents an application to a real world scenario, where the proposed solutions are used in an end-to-end framework to solve a real-world case study. Finally, Section 6 sums up the scientific contributions of this work and discusses about future works.

models. As a consequence, every time a new object is added, the human-based system calibration is needed, especially if a major change in the environment happens. Thus, at the state-of-art, industrial solutions are unfeasible when a higher degree of flexibility is required, as in the case of symbiotic human robot collaboration [5]. Recently, researches have made huge steps forward applying Artificial Vision and Deep Learning techniques for solving different tasks without the need of human intervention [6–8]; therefore, such techniques represent the perfect fit for a collaborative robot, because they inherently deal with loosely-constrained tasks, as a collaborative robot would face, and finding solutions to the problems by autonomously abstracting knowledge from raw data. One of the most tackled, and yet still one of the hardest tasks in robotics, is grasping of generic objects. Indeed, it comprehends a broad range of different applications and low constrained specifications. As an example, the e-commerce company Amazon, hosts every year the socalled Amazon Robotic Picking Challenge [9] in which teams of university and robotics company have to develop a robotic solution that is able to recognize, grasp, move different items from the shelves and pack them, autonomously, by exploiting Computer Vision, Deep Learning and clever grippers designs [10]. In this context, replicating human behaviours is useful not only to replace human actions in specific conditions, but also to create collaborative systems and to make robots naturally interact with humans during task execution [11]. A human operator that needs to pick up a tool, a part, or any unknown object, would decide how to grasp it firmly and safely, using visual clues and knowledge from previously seen examples. Therefore, one way to mimic the human knowledge in this case is to combine Artificial Vision sensors and Neural Networks. The present research work described an application for robotic grasping that uses a Convolutional Neural Network (CNN) to predict a set of possible grasp poses on single and multiple unseen objects. Different from the industrial state-of-art, this system does not need previous knowledge or human-selected visual features of the object to grasp. The prediction is performed in real-time, regardless of the number of objects in the scene, using only a 2D image captured from a single camera sensor. The main contributions of this work are: (i) the definition of a novel “Loss Function”, which does not suffer of common problems of the existing solutions found in literature; (ii) the use of 2D low-resolution

2. Related works Current approaches to robotic grasping can be grouped in two main families, namely Computer Vision-based approaches and Deep Learningbased approaches. The first one include several techniques [12–15], but since the aim of the work is the development of an industrial application for vision-guided robotic grasping, this section will limit the focus on a set of techniques well established in industrial robotics applications. Deep Learning techniques, on the other hand, have been raising interest in the latest years, but they still lacks significant industrial applications. The following section provides an overview of the two families. 2.1. Industrial computer vision approaches The majority of the currently used vision systems in industrial robotics is based on the adoption of standardized, well-established computer vision techniques. In this case, image analysis is made of four steps: pre-processing, features definition, reasoning, and reaction [16]. Each step is usually designed precisely to solve a specific task. Concerning object detection, one of the pillars of computer vision is presented by Roth, who designed a vision module to extract blobs from 2

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

extracting information from visual data, such as images or video streams. Throughout the learning process, in which a set of labelled data are fed as input to the network, each filter moves on the image and captures visual features (e.g., lines, edges, colour gradients), and the combination of those filter in the layered structure allows the learning of more complex and thus semantically higher features [28]. These features are treated in relation to the information associated with the data (i.e., the label), and an abstract representation of the data is learned. It is important to remark that, differently from traditional systems where each feature has to be hand-designed in the model, with an ANN the features selection is done autonomously during the learning process itself. Hence, the system is able to build an abstract model from raw data, and progressively improve it by testing on new images. As convolution enables both local and global descriptors for the image, various solutions have been proposed both for the single and multiple robot grasping detection. The first is considered easier, as only one objects is depicted in the image, while the latter poses additional challenges including false positives and negatives detection. Single Grasp. Mahler et al., [29] produced a 6.7 million point clouds synthetic dataset made of several thousand 3D models of objects to train a Deep CNN to predict grasp poses, along with a grasp metric, measuring the probability of success of the grasping on RGB-D images. Apart from the necessity to extensively generate a huge amount of data, needed to provide a wide enough train-set, using synthetic images could lead to a loss of generalization in real-world applications. Moreover, the grasp prediction is performed on objects in uncluttered environment (i.e., the system only works on isolated objects), thus the authors used a separated procedure to unclutter the scene before running the prediction. Another study [30] used the Cornell Dataset [31] with two deep networks to evaluate all the candidate grasps. The first network has few features, and is able to analyse several different grasps, discarding the unlikely ones and giving an imperfect grasp but at a fast rate, while the second is bigger and slower, but it only needs to refine the top predictions of the first network. Despite the complex architecture and the results, this solution does not deal with multiple objects in the same prediction cycle. A Deep CNN architecture is used in [32], starting from an AlexNet [33], pre-trained on the famous ImageNet object classification dataset [34]. Using low level features learned on a different task and with a bigger dataset is the so-called transfer learning approach [35], which has been shown to speed up the learning process on similar tasks. After the pre-training phase, network is trained on the Cornell Dataset and predicts a set of viable grasps, along with

binary images, using a connected component analysis [17]. Through the calculation of various moments on the extracted blob, the system is able to determine position, size, and orientation of the object. Although fast and accurate, this method is not robust enough, especially when complex situations occur, such as in the case of overlapping objects or noisy images. Another method that implements simple features detection has been proposed by [18], overcoming the limit of the blob analysis with touching parts, using objects boundaries such as lines, corners and holes. Pattern recognition and the so-called Affine Searching procedures were used in [19], where a first geometric description in terms of position and orientation of the features is produced at the first stage, and then 6 D.O.F. affine transformations are applied to the descriptor to detect any possible variation in the aspect of the object. Moreover, Sanz [20] implemented a system using firstly global thresholding of the image for fast segmentation, and then computing the principal moments of the object (i.e., centroid, orientation and inertia axis) from the boundary points. Finally, candidate grasp points are extracted and the most stable is chosen, based on off-line stability properties. Other commercial software systems use a set of well-known computer vision algorithms, such as Canny [21], Scale-invariant feature transform (S.I.F.T.) [22] and Hough Transform [23], with parameters that are manually selected for each situation. In Fig. 2 an example of industrial software for vision-guided identification and grasping is shown, where a set of pre-trained computer vision algorithms are available to build the model. However, while each one of the mentioned solution shows some degree of robustness and predictability of the outputs, they lack in the flexibility of easily adapt to new requests, and any modification of the task comes at the cost of major changes in the pipeline of operations needed, if not the rebuilding of the entire application. 2.2. Deep learning approaches Due to the limitations of the current approaches, the robotic grasping problem has been widely studied by the research community in the recent years [14,24–26]. Most of the modern approaches leverage the properties of Computer Vision and Artificial Neural Networks (ANN) to extract knowledge from raw data and generalize the solution of complex task in real-world, unconstrained scenario. In this direction, the most common choice is to treat data with a Convolutional Neural Network (CNN) [27]. A CNN is a computational structure, made by interconnected layers of learnable filters, that is especially suited for

Fig. 2. Vision Pro Software, courtesy of Cognex. 3

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

Fig. 3. Cornell Dataset example images.

classification of the object, so to force the prediction of similar grasping pose on objects of the same class. Kumra et al. [36], proposed the use of a ResNet50 [37] as a feature extractor, to predict multiple grasps pose on single objects. Two architectures were developed in parallel, one using only RGB images and the other adding the Depth channel. The first part of the networks was pre-trained on ImageNet dataset, and only the last fully connected part in both networks were trained from scratch, reducing training time consistently, even though the networks are very deep. Authors also shows that adding the Depth channel as input doesn’t boost performance, mainly because the feature extractor network is usually pre-trained with only RGB images. All these methods significantly improve the state-of-art results on real data; nonetheless they are not robust enough to unpredictable situations. In particular, if systems are trained on single object images with uniform background, the results can be strongly affected by sudden changes in lights conditions, or partial objects representation in the scene [10,11]. The main cause of such a behaviour may be in the Loss Function (i.e., the mean square of the error between the groundtruth and the prediction). Due to the Mean Squared Error (M.S.E.), the overall grasping prediction may converge to the average of two valid grasping poses, if two separated areas of the image respond to the filters (as in the case of two objects in the image). This problem is better investigated in Section 3.2.2. Multi Grasp. Some works have also tackled multi-object, multi-grasp detection: Chu et al. [38] used two stages of a ResNet50 architecture, pre-trained on the COCO 2014 dataset [39]. The first was trained with non-oriented ground-truth bounding boxes, in order to generate grasp proposal over the whole image, with elements in the feature map working as anchors with different scale and aspects. The regression problem over the orientation of the grasp rectangle was transformed in a classification of the angle in twenty quantized values, and the second part of the network classifies the region proposal into different classes, according to the discretized angle of the bounding box. Finally, two parallel fully connected layers output the parameter of the rectangle. In [40], the image is divided in grid cells, and for each cell a set of fixed

dimension anchor boxes are generated, centred in the centre of the cell and with six different orientation. The oriented anchor that best matches the ground-truth label in each grid cell is used to calculate the offset with the predicted grasping rectangle, likewise with the groundtruth, and the difference between the two offset is used in the loss function, for optimizing the ResNet50-based network. As seen, classic industrial vision methods for grasping prediction are mainly based on geometric features modelling, based on experts, handengineered approaches, and are generally fast, efficient and reliable. Major drawbacks of those techniques are the time consuming finetuning, needed to cope with all the task-dependent constraints, and the scarce compliance of the final application to adapt to new tasks. The class of Deep Learning based solutions to the grasping problem, conversely, show great robustness to variations in the task settings, and the task modelling does not require human intervention, as the features are automatically learned. These method, however, rely on massive amount of labelled data to build a model and acquire good generalization skills, and the collection and processing of such data is not always straightforward. In the next section, a solution that tackles some of the limits presented in literature, is introduced. 3. The deep learning-based method As stated in the previous section, a robotic grasping solution based on classic visual methods, such as geometric shapes or other ComputerVision analysis, may be too complex to use in a collaborative scenario, as it would be prone to errors due to loose constraints of the task and complexity of the environment. On the other hand, most of the Deep Learning-based approaches presented in Section 2 rely on 3D input data, not easily available in an industrial setup, and on big architecture and common Loss Functions to tackle the grasping problem. These choices, although effective for the accuracy of the simulation results, affect the training times of the system (because a bigger network takes longer to reach acceptable accuracy values) and, more important, 4

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

weighs on the real-time performance of the whole grasping application. The presented work, tackle those limit introducing new architectures and a novel Loss Function, that enable faster training and unprecedented real-time performances at test time.

learned on the classification task are useful for the grasping problem as well. As such, the training phase has been initially performed by locking the weights of the first layers (freezing the weights), as to avoid changes in the low-level features, while the higher part of the network was trained from scratch, accounting for the high-level, task-dependent features learning. Then, the architecture is split in two branches: the top branch is trained to output the orientation of the grasping rectangle, while the bottom one outputs width, height and centre coordinates of the rectangle. Such division is paramount due to the difference of range of the parameters: the angle α is bounded in the range [0°, 90°], while the others can take any value in the dimension of the image. The two branches are then joined, and a regression layer computes the grasping box corners coordinates from the five predicted parameters. The whole architecture is schematically shown in Fig. 5.

3.1. Dataset In order to compare the results of the present work with the state-ofart [30,32], the present research used the well-known Cornell Grasping Dataset. The dataset provides 885 RGB-D images of 280 different objects, where each picture contains a single object on a uniform background, represented with different orientations, along with the associated point cloud. In Fig. 3, some examples of the images from the dataset are shown. Each image comes with a set of labels, where a label is an oriented rectangle depicting either a valid or an invalid grasp on the object. The choice of a rectangle label for representing a grasp pose comes from taking into account the most simple robotic gripper design, namely a parallel-plate two fingers gripper, where the fingers lay on the short side of the rectangle. A 2D rectangle can be represented either with the (x, y) coordinates of its vertices, or with a set of 5 parameters: {(xc, yc), w, h, α} (i.e., centre coordinates, width, height and rotation angle, respectively). Fig. 4a shows the two sets of parameters, while Fig. 4b displays examples of labelled images of the dataset. In order to enhance the robustness of the solution and to better generalize the knowledge gained from the raw images, heavy data augmentation is performed, by randomly changing the background colour, scaling, translating and rotating the objects in the images. Finally, Gaussian noise and illumination variations were added, to make the system robust to real sensors noises and differences in lighting conditions. It has been found that, according to the literature [41–43], such data augmentation consistently avoid the network over-fitting.

3.2.2. Loss function As stated in Section 2, most of state-of-art solutions for single-object grasping, use the M.S.E. loss to calculate the error between the prediction of the network and the ground-truth labels during the training process. However, employing this loss function may lead to few drawbacks:

• average predictions; • different magnitudes of parameters. The first one is related with the average result between multiple ground-truth labels: when the predictions diverge to an average solution during train, the network may find a feasible solution in the arithmetic average of the ground-truth labels. Differently from other tasks, where this solution may indeed be a valid one, for oriented grasping rectangles the average result often leads to an invalid solution, as shown in Fig. 6. The second issue may arise when optimizing values with substantially different magnitudes, as happens often when dealing with bounding boxes locations and dimensions. Different normalization approaches have been employed to force the various loss components to contribute equally to the final score. As an example, Zhou et al. [40] employed logarithmic values for width and height, while also normalizing angles in relation with the number of anchor boxes. Here, is presented a loss function based on the Intersection over Union (I.o.U.) between oriented rectangles, which has been proved to be superior in terms of convergence using a network trained from scratch, without requiring any additional normalization step nor hyperparameters. The proposed loss is also much more related to the task, as it matches the scoring function used in the previous literature [30,32,36]. The loss is computed between two grasping rectangles, parameterized using 2D coordinates of the corners, and, contrary to the M.S.E. loss, it is bound between 1 (perfect match between the two rectangles) and 0 (fully disjointed rectangles). A geometric illustration of the I.o.U. is depicted in Fig. 7.

3.2. The Single-Grasp Detection approach For the Single Grasp Network, only RGB images are used in the proposed work, discarding the depth information, in order to develop a low-cost system where 2D images can be easily taken by a low-resolution camera, as generally used in industrial robotics. 3.2.1. Network architecture The Cornell Dataset is a relatively small dataset, not wide enough to train a deep neural network from scratch in an end-to-end fashion using only a single target. Thus, the proposed Single-Grasp network uses, for the first layers, weights of the VGG-19 [44] architecture, that was pretrained on the ImageNet dataset. Despite the latter dataset has been conceived for objects classification, a subset of objects present in the Cornell Dataset can be found in ImageNet (such as house-hold items, tools, food), so it can be safely assumed that the low-level features

Fig. 4. (a) Parameters of a grasping rectangle. (b) Example of Cornell Dataset labels. 5

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

Fig. 5. Proposed architecture for Single-Grasp prediction.

Our implementation deeply relies on the Separate Axis Theorem (SAT), which scores the overlapping area of two convex polygons. SAT states that:

a linear or even an exponential factor, multiple labels per single image can be exploited to force the network to rely not on global but on localized features, thus enabling:

SAT 1. If two convex objects are not penetrating, there exists an axis for which the projection of the objects will not overlap”.

• A stronger gradient during back-propagation, as more information is

The proposed loss is fully derivable and can be optimized by any neural network based framework using standard gradient descent techniques. A pseudo-algorithm of the loss computation is shown in Algorithm 1. As the output of the network consists of rectangle parameters as expressed in Fig. 4a, in order to compute our I.o.U. loss the two rectangles are first converted from those five parameters to the four (x, y) coordinates of its corners.

• •

extracted from a single image when trying to predict not one but a set of correct grasping poses; the possibility of employing also a set of incorrect grasping poses (when included in the annotation) and forcing the network to avoid them; the possibility of composing (both during train or test time, as in [38]) images with multiple objects, which should force the network to further exploit local pattern instead of global ones to produce the final set of predictions.

The first two points have been addressed by the present research on Multi-Grasp Detection, while the last point will be dealt in future works. Relating to the stronger gradients, they allowed to train a network directly from scratch instead of relying on pre-trained architecture. Relating to the possibility of employing also a set of incorrect grasping

3.3. The Multi-Grasp detection approach Differing from Single-Grasp, Multi-Grasp detection aims at predicting more than one grasping pose for each object in the scene. Even if on the one hand the inference time may grow by a constant,

Fig. 6. Two examples where the average solution doesn’t represent a valid one. 6

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

Fig. 7. Geometric meaning of the I.o.U. loss function.

poses, the network was forced to learn not only the correct grasping poses but also the incorrect ones, and showed how this last solution has better performance. 3.3.1. Labels pre-processing One of the main challenges of dealing with multiple labels prediction is the high variance that the outputs cover due to the local position of the label in the image. Few techniques have been presented in the latest years, and the majority of them stemming from works in the object detection field. The present work adopts a similar approach, where the target image is logically divided into a fixed number of rectangular patches, as shown in Fig. 8a. Indeed, predicting rectangles with quite different orientation and sizes is challenging, especially when they lie in the same patch. While normalization may help, it has been found that introducing a fixed set of different rectangles prototypes, named anchors, to initialize the predicted rectangles is in practice enough. These anchors are chosen directly from the train set in an offline fashion, using standard clustering techniques on three of the rectangles parameters (width, height and orientation), while the remaining two are set to the corresponding patch centre. An example of the anchors used in these experiments is shown in Fig. 8b The single label {(xc, yc), w, h, α} is computed by summation between the anchor and the network’s output:

x c = xa + x p

(1)

x c = xa + x p

(2)

yc = ya + yp

(3)

w = wa + wp

(4)

where {xa, ya, wa, ha, αa} are the anchor’s parameters, while {xp, yp, wp, hp, αp} are the predicted rectangle’s parameters, respectively. Of course, as potentially each patch can produce a full set of predictions the network must also be trained to compute a scoring map for each anchor in each patch. As such, authors labelled a grid with the same number of patches with:

• A value of 1 for each anchor closest to a ground-truth positive grasping pose; • a value of −1 for each anchor closest to a ground-truth negative grasping pose; • a value of 0 for each other anchor in the patches. 3.3.2. Network architecture The structure of the proposed framework is shown in Fig. 10. The entire architecture is based on two fundamental building blocks, namely the Scale Block and the Residual Block. The first one performs a convolution to halve the spatial resolution, while the second one

Fig. 8. (a) Grid cell division of the input image. (b) Anchors. 7

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

Fig. 9. Building blocks of the network.

Fig. 10. Proposed Multi-Grasp architecture. 8

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

increase or decrease the number of filters while keeping the spatial resolution. Both blocks are shown in Fig. 9. Differing from previous works, here a Fully Convolutional network is employed, divided into three parts to predict multiple grasping poses. The first part learns lowlevel features from the RGB images, and shares them with the other ones. Then, the output features maps are forwarded to predict the presence of a possible grasping pose and its parameters. In doing that, a second network predicts the score for each of the available anchors for every patch of the image, while a third network outputs the differences for each of the grasping pose rectangles parameters. To keep a low number of trainable parameters, the following actions were taken:

thus reaching almost 50 FPS on test time. 3.3.3. Loss function A loss function for Multi-Grasp detection must necessarily include two different parts, namely a classification and a regression terms. In the framework, the first one forces the network to predict a value for each anchor in a given patch, in accordance with the ground-truth notation reported in Section 3.3.1. Thus, instead of relying only on regression during training and inference, this method employs a jointly classification and regression scheme, where the first part identifies the correct anchor to be associated and bootstraps the rectangle parameters to plausible values, while the second one predicts the difference for each of the five parameters to be applied to the identified anchor. Since those differences refer to the centre of each patch and the width and the height of an anchor, their values span in the same interval for the entire image, easing the training phase. The final loss has the following form:

• consecutive fully connected layers were avoided, as they require a • •

number of parameters which is function of both the input and output sizes of the layer; the first part of the network is shared with the others to reuse the previous parameters; prediction of multiple grasping for multiple objects is based only on a single forward of the image through the network.

LOSS =

Lclassification

State-of-art techniques available in deep learning were also employed to further enhance the training speed; in particular:

Lclassification 0 +

1

Lclassification1 + (6)

where α, β, γ, δ are used to balance the different terms of the loss. Eventually, the network produces two matrices; one is used to score the anchors (MC), and the other one is used for parameters regression (MR), with the following shape:

• instance Normalization [45] on every Convolutional layer was used. •

Lregression +

This method was preferred over the most used batch normalization, since it does not hold any additional parameters; skip connections and residual blocks were implemented (as in ResNet architecture) to avoid gradient vanishing during back-propagation.

MC

W

H

N _A

MR

W

H

NA

(7)

5

(8)

where W and H are the grid dimensions, NA is the number of anchors used, and 5 is the number of the rectangle’s parameters.

Note that as a fully convolutional network is employed, the input image size may in fact be varied at inference time to better suit the camera distance for a particular setup. However, as the network's filters dimensions obviously don't change, it's important to provide the network with randomly scaled up and down objects during the training by means of data augmentation. Furthermore, almost all current state-of-the-art methods use a pretrained Convolutional Network as features extractor, chosen between some of the most famous classification architectures such as ResNet50 or VGG-19. This fact implies some limits in terms of delay time and makes hard to achieve real-time performance. On the contrary, the proposed method is primary conceived for real-time collaborative scenarios, where a high number of FPS is usually required. As an example, Convolutional layers from ResNet50 alone introduce almost 50 M parameters, causing low performances at test time, as the image is forwarded through the entire 50 layers. On the contrary, the proposed three networks combined only hold 10 M parameters and 45 layers,

4. Experimental tests The simulation results are then shown in terms of accuracy of the prediction, on the test set. Similar to [30,32], the predicted rectangles is compared with the ground-truth in each image. A predicted grasp is considered good if it satisfies both of the following conditions:

• Intersection over Union between greater than 25%; • Orientation error less than 30°.

prediction and ground-truth

The first condition is purposely loose enough to be easily matched, so to balance the complexity of the task on such a small dataset. For sake of comparison, result on the Pascal VOC Challenge [46], were considered good if the I.o.U. score was above 50%. The second

Fig. 11. Metrics for a valid grasp. 9

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

Ubuntu 16.04 was employed. The network was trained on a NVIDIA GTX 1080 Ti GPU, with 11 GB dedicated memory, and with CUDA-9 and cuDNN-5 installed. For the Single-Grasp network, the training started initializing VGG-19 weights for the first layers, as stated in Section 3.2, using weights file available for the framework. The batch size was set to 128 and trained for 250 epochs with random data augmentation. Experimental results demonstrated that the choice of the batch size is of paramount importance for training convergence, and verified a certain instability level when training is performed with smaller values. The research adopted the Adam optimizer [48] with learning rate 1e−4 and exponential decay. The Multi-Grasp network was trained for an average of 80 epochs, with batch of 64 elements. The loss terms were balanced as follows:

Table 1 Accuracy scores for Single and Multi-Grasp network, on both train a test. Algorithm

Lenz et al. [30] Redmon et al. [32] Kumra et al. [36] Proposed method

Mode

RGB-D RGB-D RGB RGB

Accuracy (%)

Speed (fps)

Single-Grasp

Multi-Grasp

73.9 84.4 88.9 73.0

N/A 88.0 N/A 87.1

0.07 13.15 16.03 ~40

Bold value highlights the highest values known in literature.

condition, on the other hand, is needed to assure the correctness of the prediction. As shown in Fig. 11, a predicted rectangle with a different orientation with respect to the ground-truth will be considered wrong, while a rectangle with a compatible orientation will contribute on the final result of the algorithm. Experimental results obtained by adopting different algorithms are reported in Table 1.

L classification0 = L classification1 = 1

L classification

1

=2

Also for this architecture the Adam optimizer was chosen, with learning rate of 4e−4 and weight decay of 0.995. Some examples of the Single-Grasp Detector are shown in Fig. 12, where the network is able to predict a valid grasp and shows robustness to different kinds of objects, such as:

4.1. Implementation details For experimental testing, the PyTorch Framework [47] under

Fig. 12. (a) Reflective object. (b) Bifurcate shape object. (c) Transparent object. (d) Object not completely in the picture. 10

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

3. Predicting a safe and stable grasp on the requested object; 4. Executing the grasp in a safe way; 5. Delivering the object to the operator. For the object recognition and localization at point 2, the system uses a Deep CNN solution based on the Y.O.L.O network [49], properly trained on a custom dataset. This dataset will be released in future publications. By using this dataset, the operator can assembly two parts together, using different tools for completing the task. The robot receives a voice command from the operator, containing the name of the tool necessary to complete the task. Parsing the voice command, the robot understands which tool to look for, recognizes it by object detection among the objects on the table, as explained above, and predicts a set of grasps to properly handle the object and execute the chosen grasp. A video demonstration of the task can be seen in the supplementary material. In Fig. 14 some outputs from the execution are shown.

Fig. 13. Output of the Multi-Grasp Network.

5.1. Results discussion

• Highly reflective objects, as the wrench in Fig. 12a; • Objects with bifurcate shapes, as the pliers in Fig. 12b, in which the network shows to mimic a human grasp; • Objects that can be easily confused with the background (Fig. 12c); • Objects not completely in the picture, like the mouse cable in

As shown in Table 1, the accuracy results are comparable with the state-of-art. It is worth notice, though, that the proposed Multi-Grasp solution is the only, to the best of the authors' knowledge, to predict multiple grasps pose on multiple objects, using only 2D, RGB low-resolution images. Moreover, thanks to the proposed architecture, the system reaches real-time values with almost 40 frames per second. With the implementation of a collaborative assembly task, a set of designing choice arise:

Fig. 12d.

It is worth to emphasize again how, for the mouse in Fig. 12d, the Loss Function presented in this work avoids the prediction to be affected by the cable in the picture. An example of the network output is shown in Fig. 13.

• What to do if more than one object of a chosen class is detected? • What to do if the chosen object is not present/not detected? • What grasp coordinates to choose, if more than one grasp is detected

5. Case study

on the object?

To show the performances of the proposed solution in a real-world scenario, an end-to-end application for collaborative robotic grasping has been developed. For task implementation a robot has been employed on a shop floor, aiding a human operator during an assembly task. The operator would need to ask the robot for any hand tools or fixtures, which lie in a casual position on a table facing the robot, among other tools and parts. Hence the robot has to be capable of:

A solution for dealing with the first problem may be the development of a reasoning system [50]. With such system, the robot could ask the operator further detail about the requested object (e.g., colour, position on the table with respect to other objects, dimensions), in order to select the correct one. This research path will be explored in future developments. Concerning the second problem, in the current implementation of the system, if the object is not mis-detected or not present in the scene, a visual warning is raised by the system, and operator responds accordingly. Regarding the last point, several solutions can be implemented. The simplest one considers all the grasps on the same object

1. Recognizing the operator’s command, in the form of a voice or gesture; 2. Recognizing and localizing the object (i.e., hand tool or fixture) as requested by the operator;

Fig. 14. Output of task execution. (a) Object Detection and Localization. (b) Grasp Detection. 11

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

and selects the one with the highest score value. Another solution can take into account the object centroid, calculated in the image, and choose the grasp prediction closest to it. Other solution may deal with additional object features or sensing technologies. For sake of simplicity, in the present implementation, the grasp with the highest score is chosen.

In future development, effort will be focused on merging object recognition and localization algorithm with grasp prediction into a unique architecture, in order to reduce the context-switching overhead in the implementation, and to develop a grasp detector for overlapping objects. Moreover, the reasoning system described in Section 5.1, to deal with multiple objects of the same class present in the scene, will be an interesting topic for future research.

6. Conclusions

Declaration of Competing Interest

In this paper, a Deep Learning solution for the industrial robotic grasping problem has been presented. Two architectures, respectively for Single-Grasp and Multi-Grasp detection, were developed to predict grasp pose on a single object and on multiple objects, respectively. A novel loss function was developed, which has been demonstrated to overcome some of the limits of the solutions already proposed in literature. Its structure allows also building a smaller and faster network (if compared with the state-of-art), which can be trained from scratch on a small dataset, such as the Cornell. Moreover, it has been found that system performances are not jeopardized by the quantity of objects in the scene, nor by the resolution of the input images.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgment This paper is supported by European Union’s Horizon 2020 Research and Innovation Program under grant agreement No. 688807, project ColRobot (Collaborative Robotics for Assembly and Kitting in Smart manufacturing).

Appendix A Algorithm for calculating the I.o.U. Loss.

Rectangle conversion In this section authors give further details on some of the geometric transformations regarding rectangle representation that was employed in this work. As already stated, an oriented rectangle can either be represented with:

• A set of 2D coordinates (x, y). Those coordinates may be ordered. If that is the case, the rectangle boundaries can be obtain by drawing a line between each two consecutive points in the set; • 5 parameters. The two coordinates of the rectangle centre (x , y ), the width and height (w, h), and the oriented angle α. c

c

12

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al.

Coordinates to parameters conversion Given the set of 4 ordered coordinates, the computation of the 5 parameters was implemented as follows:

• The two centre coordinates (x , y ) as the mean of the 4 corners coordinates; • the width 2 as the length, expressed as L norm, of the less angled edge, the latter expressed as the absolute value of the arctan in the interval [−π/2, π/2]; the • height h as the length of the most angled edge; • the oriented rectangle δ as the angle of the less angled edge, computed using the signed arctan in the interval [−π/2, π/2]. c

c

2

Parameters to coordinates conversion Given the 5 parameters {xc, yc, w, h, δ}, the computation of the 4 ordered coordinates works as follows:

• A unit norm vector, oriented as the edge with length equal to the rectangle's width, was obtained by rotating the vector [1, 0] with a rotation matrix of angle δ; • the obtained vector was further rotated, by π/2 to obtain a unit norm vector oriented as the edge with length equal to the rectangle's height. From that, the 4 coordinates are easily computed by summing or subtracting the vectors, multiplied by half their respective dimension w and h, to the rectangle centre. As this method requires ordered coordinates, these were sorted using the following algorithm:

As both the rectangle dimensions (w, h) are shorter the diagonal, the algorithm ensure to avoid having opposite vertices nearby in the output vector. Appendix B. Supplementary material Supplementary data to this article can be found online at https://doi.org/10.1016/j.aei.2020.101052.

matching (2017). http://arxiv.org/abs/1710.01330. [11] M. Simao, P. Neto, O. Gibaru, Natural control of an industrial robot using hand gesture recognition with neural networks, in: IECON 2016 – 42nd Annu. Conf. IEEE Ind. Electron. Soc., IEEE, 2016, pp. 5322–5327. doi: 10.1109/IECON.2016. 7793333. [12] M. Popović, D. Kraft, L. Bodenhagen, E. Başeski, N. Pugeault, D. Kragic, T. Asfour, N. Krüger, A strategy for grasping unknown objects based on co-planarity and colour information, Rob. Auton. Syst. 58 (2010) 551–565, https://doi.org/10. 1016/j.robot.2010.01.003. [13] G. Taylor, L. Kleeman, Grasping unknown objects with a humanoid robot, Proc. Aust. Conf. Robot. Autom. (2002) 27–29. [14] A. Saxena, J. Driemeyer, A.Y. Ng, Robotic grasping of novel objects using vision, Int. J. Rob. Res. (2008), https://doi.org/10.1177/0278364907087172. [15] G. Du, K. Wang, S. Lian, Vision-based robotic grasping from object localization, pose estimation, grasp detection to motion planning: a review (2019). http://arxiv. org/abs/1905.06658. [16] K.S. Rattan, A.J. Scarpelli, R.E. Johnson, A computer vision and robotic system for an intelligent workstation, Comput. Ind. (1989), https://doi.org/10.1016/01663615(89)90116-4. [17] S.D. Roth, Vision system for distinguishing touching parts, 4,876,728, 1989. https://www.google.com/patents/US4876728. [18] G.J. Gleason, G.J. Agin, A modular vision system for sensor-controlled manipulation and inspection, in: Proc. 9th Int. Symp. Ind. Robot. Washingt. D.C., USA, Society of Manufacturing Engineers, 1979, pp. 57–70. [19] W. Silver, Geometric pattern matching for industrial robot guidance, Robot. Res. Springer, London, 2000, pp. 69–77. [20] P.J. Sanz, A. Requena, J.M.I. Quereda, A.P. Del Pobil, Grasping the not-so-obvious: vision-based object handling for industrial applications, IEEE Robot. Autom. Mag. 12 (2005) 44–52. [21] J. Canny, A Computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-8 (1986) 679–698, https://doi.org/10.1109/TPAMI.1986. 4767851. [22] D.G. Lowe, Object recognition from local scale-invariant features, in: Proc. Seventh

References [1] Colrobot Consortium, Colrobot, 2016. www.colrobot.eu (accessed February 10, 2018). [2] C.E. Smith, N.P. Papanikolopoulos, Vision-guided robotic grasping: issues and experiments, in: Proc. IEEE Int. Conf. Robot. Autom., 1996. doi: 10.1109/robot.1996. 509200. [3] M. Nieuwenhuisen, J. Stueckler, A. Berner, R. Klein, S. Behnke, Shape-primitive based object recognition and grasping, in: Robot. 2012; 7th Ger. Conf. Robot., 2012. doi: 10.1016/S1698-031X(07)74060-8. [4] K. Mikolajczyk, A. Zisserman, C. Schmid, Shape recognition with edge-based features, in: Procedings Br. Mach. Vis. Conf. 2003, British Machine Vision Association, 2003, pp. 79.1–79.10. doi: 10.5244/C.17.79. [5] F. Wallhoff, J. Blume, A. Bannat, W. Rösel, C. Lenz, A. Knoll, A skill-based approach towards hybrid assembly, Adv. Eng. Inf. (2010), https://doi.org/10.1016/j.aei. 2010.05.013. [6] Y. Yoon, H.G. Jeon, D. Yoo, J.Y. Lee, I.S. Kweon, Learning a deep convolutional network for light-field image super-resolution, in: Proc. IEEE Int. Conf. Comput. Vis., 2015. doi: 10.1109/ICCVW.2015.17. [7] J. Gong, C.H. Caldas, C. Gordon, Learning and classifying actions of construction workers and equipment using Bag-of-Video-Feature-Words and Bayesian network models, Adv. Eng. Inf. (2011), https://doi.org/10.1016/j.aei.2011.06.002. [8] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator, in: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3156–3164. doi: 10.1109/CVPR.2015.7298935. [9] Amazon, Amazon Robotic Picking Challenge, 2017. https://blog.aboutamazon. com/amazon-robotics-challenge-winners-announced (accessed November 20, 2018). [10] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F.R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli, F. Alet, N.C. Dafle, R. Holladay, I. Morona, P.Q. Nair, D. Green, I. Taylor, W. Liu, T. Funkhouser, A. Rodriguez, Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image

13

Advanced Engineering Informatics 44 (2020) 101052

L. Bergamini, et al. IEEE Int. Conf. Comput. Vis., vol. 2, 1999, pp. 1150–1157. [23] R.O. Duda, P.E. Hart, Use of the Hough transformation to detect lines and curves in pictures, Commun. ACM. 15 (1972) 11–15. [24] A. Bicchi, V. Kumar, Robotic grasping and contact: a review, Proc.-IEEE Int. Conf. Robot. Autom. (2000), https://doi.org/10.1109/ROBOT.2000.844081. [25] R. Pelossof, A. Miller, P. Allen, T. Jebara, An SVM learning approach to robotic grasping, in: IEEE Int. Conf. Robot. Autom. 2004. Proceedings. ICRA ’04. 2004, 2004. doi: 10.1109/ROBOT.2004.1308797. [26] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, D. Quillen, Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection, Int. J. Rob. Res. (2018), https://doi.org/10.1177/0278364917710318. [27] Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time series, in: Handb. Brain Theory Neural Networks, 1998. doi: 10.1109/IJCNN.2004.1381049. [28] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Comput. (1989), https://doi.org/10.1162/neco.1989.1.4.541. [29] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J.A. Ojea, K. Goldberg, Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics, in: CoRR, 2017. [30] I. Lenz, H. Lee, A. Saxena, Deep learning for detecting robotic grasps, Int. J. Rob. Res. 34 (2015) 705–724, https://doi.org/10.1177/0278364914549607. [31] I. Lenz, H. Lee, A. Saxena, Cornell grasping dataset (2013). http://pr.cs.cornell.edu/ grasping/rect_data/data.php (accessed January 31, 2018). [32] J. Redmon, A. Angelova, Real-time grasp detection using convolutional neural networks, in: Int. Conf. Robot. Autom., Seattle, Washington, USA, 2015. [33] A. Krizhevsky, I. Sulskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst. 60 (2012) 84–90, https://doi.org/10.1145/3065386. [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale hierarchical image database, in: IEEE Conf. Comput. Vis. Pattern Recognition, CVPR, 2009. [35] L.Y. Pratt, Discriminability-Based Transfer Between Neural Networks, in: Adv. Neural Inf. Process. Syst. 5, [NIPS Conf., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993, pp. 204–211.

[36] S. Kumra, C. Kanan, Robotic grasp detection using deep convolutional neural networks, IEEE Int. Conf. Intell. Robot. Syst. 2017 (2017) 769–776, https://doi.org/10. 1109/IROS.2017.8202237. [37] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conf. Comput. Vis. Pattern Recognit., 2016. doi: 10.1109/CVPR. 2016.90. [38] F. Chu, R. Xu, P.A. Vela, Deep grasp: detection and localization of grasps with deep, Neural Netw. (2018). [39] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft COCO: Common objects in context, in: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 2014. [40] X. Zhou, X. Lan, H. Zhang, Z. Tian, Y. Zhang, N. Zheng, Fully convolutional grasp detection network with oriented anchor (2018). [41] L. Perez, J. Wang, The effectiveness of data augmentation in image classification using deep learning (2017). http://arxiv.org/abs/1712.04621. [42] S.C. Wong, A. Gatt, V. Stamatescu, M.D. McDonnell, Understanding data augmentation for classification: when to warp?, in: 2016 Int. Conf. Digit. Image Comput. Tech. Appl. DICTA 2016, 2016. doi: 10.1109/DICTA.2016.7797091. [43] C. Shorten, T.M. Khoshgoftaar, A survey on image data augmentation for deep learning, J. Big Data 6 (2019) 60, https://doi.org/10.1186/s40537-019-0197-0. [44] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, ArXiv Prepr. ArXiv1409.1556 (2014). [45] D. Ulyanov, A. Vedaldi, V.S. Lempitsky, Instance normalization: the missing ingredient for fast stylization, in: CoRR. abs/1607.0, 2016. [46] M. Everingham, L. Van~Gool, C.K.I. Williams, J. Winn, A. Zisserman, The PASCAL visual object classes challenge 2012 (VOC2012) results, n.d. [47] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in PyTorch, in: NIPS-W, 2017. [48] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: CoRR. abs/ 1412.6, 2014. [49] J. Redmon, A. Farhadi, YOLOv3: an incremental improvement, ArXiv (2018). [50] D. Whitney, E. Rosen, J. MacGlashan, L.L.S. Wong, S. Tellex, Reducing errors in object-fetching interactions through social feedback, in: IEEE Int. Conf. Robot. Autom., 2017. doi: 10.1109/ICRA.2017.7989121.

14