Towards robust intention estimation based on object affordance enabling natural human-robot collaboration in assembly tasks

Towards robust intention estimation based on object affordance enabling natural human-robot collaboration in assembly tasks

Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect ScienceDirect Available online atonline www.scienc...

811KB Sizes 0 Downloads 40 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect ScienceDirect

Available online atonline www.sciencedirect.com Available at www.sciencedirect.com Procedia CIRP 00 (2018) 000–000 Procedia CIRP 00 (2018) 000–000

ScienceDirect ScienceDirect

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

Procedia CIRP 00 (2017) 000–000 Procedia CIRP 78 (2018) 255–260 www.elsevier.com/locate/procedia

6th CIRP Global Web Conference 6th CIRP Global Conference “Envisaging the future manufacturing, design,Web technologies and systems in innovation era” “Envisaging the future manufacturing, design, technologies and systems in innovation era”

28th CIRP Design Conference, May 2018, Nantes, France Towards Towards robust robust intention intention estimation estimation based based on on object object affordance affordance enabling natural human-robot collaboration in assembly tasks of A new methodology analyze the functional and physical architecture enabling naturaltohuman-robot collaboration in assembly tasks existing products for an product family identification a, assembly oriented a a a Martijn Cramer a,*, Jeroen Cramera, Karel Kellensa, Eric Demeestera

Martijn Cramer *, Jeroen Cramer , Karel Kellens , Eric Demeester Paul Stief *, Jean-Yves Dantan, Alain Etienne, Ali Siadat

KU Leuven, Dept. of Mechanical Engineering, ACRO Research Group, Wetenschapspark 27, 3590 Diepenbeek, Belgium a KU Leuven, Dept. of Mechanical Engineering, ACRO Research Group, Wetenschapspark 27, 3590 Diepenbeek, Belgium * Corresponding author. Tel.: +32-11-278-820. E-mail address: [email protected] École Nationale d’Arts et Métiers, Arts [email protected] Métiers ParisTech, LCFC EA 4495, 4 Rue Augustin Fresnel, Metz 57078, France * Corresponding author. Tel.:Supérieure +32-11-278-820. E-mail address: a

* Corresponding author. Tel.: +33 3 87 37 54 30; E-mail address: [email protected]

Abstract Abstract

In manufacturing industry, a shift is observable towards high-mix, low-volume batches. The upcoming era demands both flexible and automated In manufacturing a shiftskills is observable towards high-mix, low-volume Thesolution upcoming flexible and automated systems for whichindustry, the required go far beyond those of humans or robotsbatches. alone. The canera bedemands found inboth human-robot collaboration Abstract systems for which the required skills go far beyond those of humans or robots alone. The solution can be found in human-robot collaboration (HRC). HRC could be more natural and efficient if both actors recognise and anticipate each other’s activities and intentions. Current mainstream (HRC). HRC could berecognising more natural and efficient actorslearning recognise and anticipate eachwith other’s activities and intentions. Current mainstream activities usingifaboth machine framework trained labelled data and gestures. Inresearch today’sfocuses businessonenvironment, the trend towards more product variety and customization is unbroken. Dueoftomovements this development, the needThis of research focuses activities a machine learning framework trainedand with labelled data of movements andpaper, gestures. This approach ignores on the recognising object affordances: theusing valuable activities objects being manipulated. In optimize this first steps agile and reconfigurable production systems emerged relationships to cope withbetween various the products and product families. To design and production approach ignores the object affordances: thewhere valuable relationships between the activities and objects being manipulated. In this paper, steps are made alternative object affordances are adopted to recognise operator activities andthe intentions duringfirst an HRC systems astowards well as an to choose the approach optimal product matches, product analysis methods are needed. Indeed, most of known methods to are made towards an alternative approach where object affordances are adopted to recognise operator activities and intentions during anaim HRC assembly task. analyze a product or one product family on the physical level. Different product families, however, may differ largely in terms of the number and assembly task. nature ofThe components. This fact by impedes anB.V. efficient and choice appropriate product family combinations for the production © 2018 Authors. Published Elsevier This comparison is an open access article of under the CC BY-NC-ND license © 2018 2018AThe The Authors. Published by Elsevier B.V. This isis an access CC license © Authors. Published by Elsevier B.V. This an open open access articleofunder under the CC BY-NC-ND BY-NC-ND license system. new methodology is proposed to analyze existing products in article view their the functional and physical architecture. The aim is to cluster (https://creativecommons.org/licenses/by-nc-nd/4.0/). (https://creativecommons.org/licenses/by-nc-nd/4.0/) (https://creativecommons.org/licenses/by-nc-nd/4.0/). these products in new assembly oriented product families for the optimization of existing assembly lines and the creation of future the reconfigurable Selection 6th CIRP CIRP Global Web Web Conference Conference “Envisaging future Selection and and peer-review peer-review under under responsibility responsibility of of the the scientific scientific committee committee of of the the 6th Global “Envisaging the future Selection and peer-review under responsibility of the scientific committee of the 6th CIRP Global Web Conference “Envisaging the future and assembly systems. Based on Datum Flow Chain, the physical structure of the products is analyzed. Functional subassemblies are identified, manufacturing, design, technologies and systems in innovation era”. manufacturing, design, technologies and systems in innovation era”. design, technologiesMoreover, and systems in innovation era”.and physical architecture graph (HyFPAG) is the output which depicts the amanufacturing, functional analysis is performed. a hybrid functional similarity families providing design collaboration support to both, production system planners and product designers. An illustrative Keywords:between intention product estimation; machineby learning; human-robot Keywords: estimation; machine learning; collaboration An industrial case study on two product families of steering columns of example of intention a nail-clipper is used to explain thehuman-robot proposed methodology. thyssenkrupp Presta France is then carried out to give a first industrial evaluation of the proposed approach. © 2017 The Authors. Published by Elsevier B.V. 1. Introduction gaining from2018. researchers worldwide, as shown by the Peer-review under responsibility of the scientific committee of the 28th CIRP Designinterest Conference

1. Introduction

gaining interest researchers worldwide, by the huge increase of from publications in this domain inas theshown past decades huge increase of publications in this domain in the past decades towards [1]. In manufacturing a shift is observable towards [1].When performing tasks together, humans often do not need high-mix, low-volumeindustry, batches. The upcoming era demands for high-mix, low-volume batches. The upcoming demands for Whencommunication performing tasks humans often do notother. need both flexible and automated systems. Even if a era human operator explicit viatogether, words to understand each both flexible and automated if a human operator explicit communication via words to understand each other. is affordable, many of thesystems. requiredEven assembly adaptability, One of the reasons for this is that humans estimate the intentions 1.is Introduction theofproduct range and characteristics manufactured and/or affordable, many and of the required assembly go adaptability, One reasons this is that humans the behaviour intentions precision, reliability quality requirements often far of of theirthe partners andfor subsequently predictestimate their future assembled in this system. In this context, the main challenge in precision, reliability and quality requirements go often far of their and subsequently predict their future behaviour beyond the skills of humans alone. On the other hand, fullyor next partners movements based on these intentions. Human-robot Due to the fast development in the domain of modelling and analysis is now not only to cope with single beyond the skills of humans alone. On the other hand, fullyor next movements onnatural, these intentions. Human-robot automated systems such as robots lack flexibility as they mainly interaction could alsobased be more if human and robot knew communication andsuch an asongoing trend of digitization and products, a could limitedalso product range or existing product families, automated systems robots lack flexibility as they interaction be more natural, if human and perform pre-programmed tasks within unobstructed and mainly known each other’s intentions and could anticipate basedrobot on knew these digitalization, manufacturing enterprises are facing important but also to be able to analyze and to compare products to define perform pre-programmed tasks within unobstructed and known each other’s intentions and could anticipate based these environments. The solution can be found in the middle by intentions. Subsequently, once future behaviours of theon partners challenges inthetoday’s environments: a continuing productSubsequently, families. It can befuture observed that classical existing environments. The be worlds: found inrobot the middle by new intentions. once behaviours of the partners combining bestsolution ofmarket thesecantwo precision, are estimated, the tasks in the collaboration can be implicitly tendency towards reduction of product development times and combining the of these two robot precision, familiesthe aretasks regrouped incollaboration function of clients orimplicitly features. are estimated, in theFor repeatability andbest strength along withworlds: human intelligence and product divided among the actors. example, can whenbe partner B repeatability and strength along with human intelligence and shortened product lifecycles. In addition, there is an increasing However, assembly oriented product families are hardly to find. divided among the A actors. For task example, whenB may partner B flexibility under variable conditions. This type of physical estimates that partner starts with X, partner either flexibility under variable conditions. This type of demand of customization, being the same timehuman-robot in aphysical global On the that product family level, products differ mainly ineither two estimates partner A starts with task X, partner B may cooperation between humans andatrobots is called provide material for his partner’s task, or start preparing the cooperation with between humans and robots is called human-robot competition competitors all over the world.(HRI) This trend, characteristics: (i) the numbertask, of components and (ii) the provide partner’s collaboration (HRC) or human-robot interaction and is main next task,material or workfor on his another subtask, etc.or start preparing collaboration (HRC)the or development human-robot interaction (HRI) and is type which is inducing from macro to micro of components (e.g. mechanical, electronical). next task, or work on another subtask,electrical, etc. markets, results in diminished lot sizes due to augmenting Classical methodologies considering mainly single products 2212-8271 © The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license product varieties (high-volume to by low-volume production) [1]. access or article solitary, existing product 2212-8271 © The Authors. Published Elsevier B.V. This is an open underalready the CC BY-NC-ND license families analyze the (https://creativecommons.org/licenses/by-nc-nd/4.0/). To cope with this augmenting variety as well as to be able to product structure on a physical level (components level) which (https://creativecommons.org/licenses/by-nc-nd/4.0/). Selection and peer-review under responsibility of the scientific committee of the 6th CIRP Global Web Conference “Envisaging the future manufacturing, design, Selection and peer-review under responsibility of the scientific committee of the 6thcauses CIRP Global Web Conference “Envisaging future manufacturing, identify possible optimization potentials in the existing difficulties regarding an the efficient definition design, and technologies and systems in innovation era”. technologies and systems doi:10.1016/j.procir.2017.04.009 production system, it in is innovation importantera”. to have a precise knowledge comparison of different product families. Addressing this Keywords: Assembly; Design method; Family identification In manufacturing industry, a shift is observable

doi:10.1016/j.procir.2017.04.009 2212-8271 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review underPublished responsibility of theB.V. scientific committee of the 6th CIRP Global Web Conference “Envisaging the future manufacturing, 2212-8271 © 2017 The Authors. by Elsevier design, technologies and systems era”. Peer-review under responsibility of in theinnovation scientific committee of the 28th CIRP Design Conference 2018. 10.1016/j.procir.2018.09.069

256 2

Martijn Cramer et al. / Procedia CIRP 78 (2018) 255–260 M. Cramer et al. / Procedia CIRP 00 (2018) 000–000

In this work, first steps are made at transferring human collaborative behaviour to robots to achieve more flexible, efficient and overall safer human-robot collaboration in manufacturing industry. Current mainstream research focuses on recognising activities using a machine learning framework trained with labelled data of movements and gestures. This approach ignores the valuable relationships between the activities and objects being manipulated referred to as ‘object affordances’. This work examines the feasibility to determine current and predict future operator activities in an assembly task performed by humans based on these object affordances. This paper is organised as follows. In Section 2.1, human activity recognition methodologies are introduced with an explanation of the different levels of complexity at which activities are considered and recognised. Section 2.2 continues with the description of the concept of ‘object affordances’ and its current applications in human activity recognition. In Section 2.3 a formal description of the applied Hidden Markov Model, a probabilistic model frequently adopted for activity recognition, is given. Subsequently, Section 3 describes the implementation of our intention recognition framework with Section 3.1 and Section 3.2 respectively covering the proposed approaches for object recognition, and activity recognition as well as intention estimation in more detail. In Section 4 experimental results are discussed, and possible future research tracks are pointed out. Finally, Section 5 concludes with a summary of this work. 2. Related work 2.1. Human activity recognition Human activity recognition deals with the automatic analysis and recognition of ongoing human activities from video or sensor data. This ability has enabled several applications ranging from surveillance of public places to detect abnormal and suspicious activities, over real-time monitoring of patients, children and elderly persons, to the supervision of surgeons or operators in manufacturing industry in order to improve safety, quality and efficiency [2]. Human activities can be categorised in four different levels of complexity: gestures, actions, interactions, and group activities. Gestures, such as stretching an arm or raising a leg, are defined as elementary movements of body parts. They are the atomic building blocks of an action which is composed of temporally organised gestures carried out by a single person, for instance a person walking or waving. Interactions on the other hand are described as human activities involving multiple persons and/or objects. Two persons shaking hands is an example of a human-human interaction whereas a person pouring coffee represents a human-object interaction. Finally, group activities are defined as activities performed by a group of persons, such as a team playing football. Depending on the level of complexity of the activities to be recognised, different techniques exist in the domain of human activity recognition. A comprehensive review of human activity recognition methodologies is presented by Aggarwal and Ryoo [2]. Furthermore, a distinction can be made between video- and (binary) sensor-based human activity recognition [3]. In the former, activities are recognised from video data by detecting

patterns of features in the image sequences and classifying them according to their activity category. These features include 3D space-time volumes, human joint trajectories, vectors obtained from hand or skeleton tracking [2], etc. For sensor-based human activity recognition, every object in the environment is equipped with or monitored by (binary) sensors, which detect whether an object is used or not. Subsequently, activities are estimated based on these data [3]. Our work currently focuses on video-based recognition of human-object interaction. 2.2. Object affordances One branch of activity recognition techniques focuses on recognising human activities based on human-object interactions, which requires the identification of objects, motions as well as their interplays. Although most activity recognition approaches based on human-object interaction ignore those interplays, both object recognition and activity recognition could be improved by including the dependencies among objects, motions and human activities. For example, even though a water bottle and spray bottle look similar, the required type of interaction and the associated motion is different, namely a drinking action for the water bottle and a spraying action for the spray bottle. Therefore, analysing the activity performed on objects could benefit object recognition as well as the opposite direction whereby recognising objects could provide clues about the activity performed [2]. The underlying thought behind this is the concept of ‘object affordances’ introduced by J. J. Gibson in 1979 as “properties of an object [...] that determine what actions a human can perform on them” [4]. A theory that shares common ground with this concept is the activity theory posited by Russian psychologists Vygotsky, Leontjev, and Lurija where an activity is formulated as one or more subjects working on an object to obtain a desired outcome [5]. Activities are distinguished from each other according to the objects used [6, 7]. For both theories, the unique relationship between an activity and the objects one need to interact with, plays the key role. Consequently, observing the objects being manipulated provides knowledge about the activity the human is performing, and it has been deployed for human activity recognition. In [3, 8, 9, 10, 11, 12] daily human activities are recognised in a domestic scenario based on the interaction between people and objects. An evaluation in a kitchen environment was carried out by Flores-Vázquez and Aranda [8] with real-time recognition of 4 activities involving 11 different objects related to the preparation of breakfast. Here, objects as well as their location are recognised from colour camera images of the scene to classify the object interactions into 4 types of elementary actions: unchanged, added to, removed from or moved around in the scene. Based on these sensed objects and elementary actions, the probabilities of human activity are computed. In [11], data of the used objects are combined with object location information to overcome limitations of each approach. A 2layer Hidden Markov Model (HMM) is employed to recognise 19 different activities of daily living (ADLs), grouped according to the locations of objects required to perform these activities. In the HMM’s first layer the group of activities is selected that satisfies the maximum joint probability using



Martijn Cramer et al. / Procedia CIRP 78 (2018) 255–260 Author name / Procedia CIRP 00 (2018) 000–000

object location information. Subsequently, in the second layer the information about used objects are considered to select the individual activity. While in the previous works it was often assumed that humans can interact directly with their environment so that by observing both human and environment information can be gained about the current activity, applications exist where direct physical interaction between human and environment is not possible, and the human needs to interact indirectly with his environment by controlling a robot. In [9] human intentions are estimated by observing the environment to assist persons with limited physical and communicative capabilities in a domestic setting. After processing RGB-D data, object information is extracted from the scene and an object-action intention network is constructed. This network contains all the possible actions associated with an object, i.e. object affordances, and is used to propose a set of yes-or-no questions to the user. Over time, user preferences are learned, and intention predictions are improved by user interaction. Finally, not only data of used objects can be useful to determine human activities, also unused objects may provide valuable information about which activities have not been performed. Experiments carried out by Abdullah-Al-Wadud [3] show improvements in terms of recognition accuracy by taking information about unused objects into account. This work aims at transferring the concept of object affordances to human activity recognition as well as to the estimation of future operator activities, i.e. operator intentions, during an assembly task in manufacturing industry. 2.3. Hidden Markov Model A Hidden Markov Model is a doubly embedded stochastic process with an underlying non-observable (hidden) stochastic process, which can only be observed through another set of stochastic processes that produce a sequence of observations [13]. Fig. 1 depicts a graphical representation of an HMM.

Fig. 1. The dynamic Bayesian network representation of the employed Hidden Markov Model in this work. The hidden layer consists of the different steps (tasks) of the assembly plan whereby the observable layer represents the zones containing the objects grasped during execution of these tasks.

This probabilistic model was first published in the late 1960s by Baum et al. [14, 15, 16, 17, 18] and from then on frequently used in speech processing applications. An extensive tutorial on HMMs was drawn up by Rabiner [13]. Important characteristics of an HMM are: • 𝑁𝑁, the number of hidden states with the individual states denoted as 𝑆𝑆 = {𝑆𝑆1 , 𝑆𝑆2 , 𝑆𝑆3 , … , 𝑆𝑆𝑁𝑁 } and the hidden state at time 𝑡𝑡 being 𝑞𝑞𝑡𝑡 ∈ 𝑆𝑆.

257 3

𝑀𝑀, the number of observation symbols with the individual symbols denoted as 𝑉𝑉 = {𝑣𝑣1 , 𝑣𝑣2 , 𝑣𝑣3 , … , 𝑣𝑣𝑀𝑀 }. • The state transition probability matrix 𝐴𝐴 = {𝑎𝑎𝑖𝑖𝑖𝑖 } with for 1 ≤ 𝑖𝑖, 𝑗𝑗 ≤ 𝑁𝑁 𝑎𝑎𝑖𝑖𝑖𝑖 = 𝑃𝑃(𝑞𝑞𝑡𝑡+1 = 𝑆𝑆𝑗𝑗 |𝑞𝑞𝑡𝑡 = 𝑆𝑆𝑖𝑖 ) representing the probability to be in state 𝑆𝑆𝑗𝑗 at time 𝑡𝑡 + 1 given that the process was in state 𝑆𝑆𝑖𝑖 in the previous time step. • The emission probability matrix 𝐵𝐵 = {𝑏𝑏𝑗𝑗 (𝑘𝑘)} with 𝑏𝑏𝑗𝑗 (𝑘𝑘) = 𝑃𝑃(𝑣𝑣𝑘𝑘 𝑎𝑎𝑎𝑎 𝑡𝑡|𝑞𝑞𝑡𝑡 = 𝑆𝑆𝑗𝑗 ) for 1 ≤ 𝑗𝑗 ≤ 𝑁𝑁 and 1 ≤ 𝑘𝑘 ≤ 𝑀𝑀 representing the probability of observing symbol 𝑣𝑣𝑘𝑘 at time 𝑡𝑡 given that the process is currently in state 𝑆𝑆𝑗𝑗 . • The initial state distribution 𝜋𝜋 = {𝜋𝜋𝑖𝑖 } with 𝜋𝜋𝑖𝑖 = 𝑃𝑃(𝑞𝑞1 = 𝑆𝑆𝑖𝑖 ) for 1 ≤ 𝑖𝑖 ≤ 𝑁𝑁 representing the probability of the process being in state 𝑆𝑆𝑖𝑖 at the beginning. • 𝑇𝑇, the length of the observation sequence generated by the process 𝑂𝑂 = {𝑂𝑂1 , 𝑂𝑂2 , 𝑂𝑂3 , … , 𝑂𝑂𝑇𝑇 } with 𝑂𝑂𝑡𝑡 one of the symbols from 𝑉𝑉. This results in the compact notation 𝜆𝜆 = (𝐴𝐴, 𝐵𝐵, 𝜋𝜋) indicating the parameter set of the HMM. Typically, three assumptions are made with respect to HMMs [19]: • Limited horizon or first-order Markov assumption: the probability of the process being in a particular state at time 𝑡𝑡 only depends on the previous state, 𝑃𝑃(𝑞𝑞𝑡𝑡 |𝑞𝑞𝑡𝑡−1 , 𝑞𝑞𝑡𝑡−2 , … , 𝑞𝑞1 ) = 𝑃𝑃(𝑞𝑞𝑡𝑡 |𝑞𝑞𝑡𝑡−1 ). • Output independence assumption: the probability of an observation 𝑂𝑂𝑖𝑖 only depends on the state that generated the observation, not on any other state or observation, 𝑃𝑃(𝑂𝑂𝑖𝑖 |𝑞𝑞1 , … , 𝑞𝑞𝑖𝑖 , … , 𝑞𝑞𝑇𝑇 , 𝑂𝑂1 , … , 𝑂𝑂𝑖𝑖 , … , 𝑂𝑂𝑇𝑇 ) = 𝑃𝑃(𝑂𝑂𝑖𝑖 |𝑞𝑞𝑖𝑖 ). • Stationary process assumption: the state transition probabilities do not change over time, 𝑃𝑃(𝑞𝑞𝑡𝑡 |𝑞𝑞𝑡𝑡−1 ) = 𝑃𝑃(𝑞𝑞2 |𝑞𝑞1 ). Furthermore, three basic problems need to be solved for the model to be useful in real-world applications [20]: • Problem 1 (likelihood computation): given the HMM 𝜆𝜆 = (𝐴𝐴, 𝐵𝐵, 𝜋𝜋) and the observation sequence 𝑂𝑂 = {𝑂𝑂1 , 𝑂𝑂2 , 𝑂𝑂3 , … , 𝑂𝑂𝑇𝑇 ) , compute the probability this sequence was generated by the given model, i.e. 𝑃𝑃(𝑂𝑂|𝜆𝜆) . This problem is solved by the forward algorithm [20]. • Problem 2 (decoding): given the HMM 𝜆𝜆 = (𝐴𝐴, 𝐵𝐵, 𝜋𝜋) and the observation sequence 𝑂𝑂 = {𝑂𝑂1 , 𝑂𝑂2 , 𝑂𝑂3 , … , 𝑂𝑂𝑇𝑇 }, determine the most likely sequence of hidden states that emitted this sequence of observations. This problem is solved by the Viterbi algorithm [20]. • Problem 3 (training): given a set of observations 𝑂𝑂, i.e. a training sequence, adjust the parameter set 𝜆𝜆 = (𝐴𝐴, 𝐵𝐵, 𝜋𝜋) to maximize 𝑃𝑃(𝑂𝑂|𝜆𝜆). This problem is solved by the expectation-maximisation (EM) or Baum-Welch algorithm [20]. •

3. Implementation This work targets to recognise operator activities and to estimate operator intentions for the illustrative task of assembling a toy cement mixer truck by a human operator, see Fig. 2. Here, activities and intentions are defined as

258 4

Martijn Cramer et al. / Procedia CIRP 78 (2018) 255–260 M. Cramer et al. / Procedia CIRP 00 (2018) 000–000

respectively the current and next step (task) of the sequential assembly plan the operator is or will be performing. In order to recognise these activities and intentions the concept of object affordances is applied to HMMs.

Fig. 2. Illustrative task for validating the proposed method: assembling a wooden toy cement mixer truck.

3.1. Object recognition The first and main focus of this research is to demonstrate the proof of concept of our intention recognition framework. Therefore, at this stage, we have adopted a simplified but very robust object recognition approach as shown in Fig. 3. The operator’s workspace is divided into 6 different zones, with one type or only a limited number of different types of objects from the assembly assigned to each zone. Recognition of these objects is subsequently limited to only detecting and registering the zones that are accessed by the operator hand i.e. we assume that if the operator’s hand stays for more than an empirically determined time of 300 ms in a specific zone, this corresponds to the operator picking the object lying in that zone. For this, a Microsoft Kinect v1 RGB-D camera was mounted above the work table and OpenNI2 drivers were installed to obtain the Kinect’s 2D camera images. The images are processed using the open source computer vision and machine learning software library OpenCV 3.4.0 whereby the programming is done in C++ in Visual Studio 2017. Since gloves are often used in an industrial setting by operators for protection against dirt or injuries, hand tracking based on skin colour detection is difficult. Therefore, brightly coloured blue latex gloves were used, and a colour filter was applied on the images. Afterwards noise pixels, caused by blue accents from the environment, are removed by erosion and dilation with an elliptical-shaped kernel, resulting in one region of white pixels representing the operator hand. Finally, this region is tracked by means of blob detection, which identifies regions that differ in properties (colour) compared to surrounding regions. Subsequently, the sequence of zones being accessed by the operator hand is registered and delivered as input for the HMMs in the activity recognition and intention estimation phase. 3.2. Activity recognition and intention estimation Figure 4 shows the applied HMM in the dynamic Bayesian network representation, which consists of a set of hidden and observable variables connected by their conditional dependencies. The hidden layer corresponds to the different tasks of the assembly plan, for example: mounting the bumper on front of the truck, whereas the observable layer represents the zones containing the objects being grasped by the operator

Fig. 3. Object recognition. (a) extracting the operator hand by applying a blue colour filter. Some error pixels are left in the image; (b) removing error pixels from image ‘a’ by applying erosion and dilation with an elliptical-shaped kernel; (c) detecting the operator hand by applying blob detection on image ‘b’. The red circle denotes the detected hand; (d) object recognition by detecting the accessed zone by the operator hand after dividing the workspace into 6 different zones, containing specific objects.

hand during execution of these tasks, for instance: zone 5 holding the bumper. This is where the concept of object affordance comes in. Given the observed sequence of accessed zones, i.e. the manipulated objects during assembly, by means of HMMs it is possible to infer the sequence of activities that required the manipulation of these objects. In other words, the unique relationship between objects and activities is employed to estimate current and future operator activities (intentions).

Fig. 4. Sections of the applied HMM’s initial, emission and state transition matrices after supervised training on a labelled dataset of 30 potential assembly sequences.

Four participants were asked to assemble the cement mixer truck, shown in Fig. 2, resulting in 30 potential assembly sequences. This task consisted of 13 parts, divided over 6 zones, and assembled in 10 steps (activities). During execution, the entered zones as well as the underlying activities were being registered. Subsequently, the HMM’s parameters are learned by supervised training on the labelled dataset of 30 assembly sequences, resulting in the matrices depicted in Fig. 4. Finally,



Martijn Cramer et al. / Procedia CIRP 78 (2018) 255–260 Author name / Procedia CIRP 00 (2018) 000–000

our framework is validated by assembling the cement mixer truck by one of the participants whereby the sequence of accessed zones is recorded by the described approach in Section 3.1 and delivered as input for the HMM. The current operator activity is estimated by the Viterbi algorithm, resulting in the most likely sequence of activities (hidden states) given the observed sequence of accessed zones (observations) and the therewith related grasped objects. The future operator activity for time step 𝑡𝑡 + 1, referred to as the operator intention, is estimated by means of an extension of the forward algorithm, shown in the following pseudocode. For each hidden state in time step 𝑡𝑡 + 1 the joint probability 𝑃𝑃(𝑂𝑂1 , 𝑂𝑂2 , … , 𝑂𝑂𝑇𝑇 , 𝑆𝑆𝑖𝑖 ) of being in this particular state 𝑆𝑆𝑖𝑖 after seeing T observations is calculated, where the hidden state (operator activity) with the highest probability represents the operator intent. Forward algorithm [20] extended with future state prediction Input: • • Output: •

observation sequence 𝑂𝑂 = {𝑂𝑂1 , 𝑂𝑂2 , … , 𝑂𝑂𝑇𝑇 } with length 𝑇𝑇, generated by the process. HMM defined by 𝜆𝜆 = (𝐴𝐴, 𝐵𝐵, 𝜋𝜋) consisting of 𝑁𝑁 hidden states. pointer to the most likely future state at time 𝑇𝑇 + 1.

1. Initialisation: create a probability matrix 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓[𝑁𝑁 + 2, 𝑇𝑇] create a future state matrix 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓[𝑁𝑁, 1] for each state 𝑠𝑠 from 1 to 𝑁𝑁 do 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓[𝑠𝑠, 1] ← 𝑎𝑎0,𝑠𝑠 ∗ 𝑏𝑏𝑠𝑠 (𝑂𝑂1 ) 2. Recursion: for each time step 𝑡𝑡 from 2 to 𝑇𝑇 do for each state 𝑠𝑠 from 1 to 𝑁𝑁 do 𝑁𝑁

𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓[𝑠𝑠, 𝑡𝑡] ← ∑

𝑠𝑠 ′ =1

𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓[𝑠𝑠 ′ , 𝑡𝑡 − 1] ∗ 𝑎𝑎𝑠𝑠′,𝑠𝑠 ∗ 𝑏𝑏𝑠𝑠 (𝑂𝑂𝑡𝑡 )

3. Future state prediction (𝑡𝑡 + 1): 𝑁𝑁

𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓[𝑠𝑠] ← ∑ 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 ←

𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓[𝑠𝑠′, 𝑇𝑇] ∗ 𝑎𝑎𝑠𝑠′,𝑠𝑠

𝑠𝑠 ′ =1 argmax𝑠𝑠𝑁𝑁′=1

return 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓

𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑠𝑠[𝑠𝑠 ′ ]

4. Discussion and future work The previous sections offered an insight into the proposed new method of human activity recognition based on the concept of object affordances, followed by a description of the initial steps towards a more natural human-robot collaboration in which the current and future activities (intentions) of an operator are estimated during an assembly task. This section discusses the obtained results and directions of future research to overcome current shortcomings. In this work human activity recognition is carried out based on the concept of object affordances by using an HMM. The first results in Fig. 5 show that with this method, the followed assembly sequence by the operator as well as the current and future activity (intention) can be recognised for the specified assembly task, namely the assembly of a toy cement mixer truck.

259 5

Fig. 5. Predicted activities by our trained HMM plotted as log-likelihoods, and the actual activities (red) of the human operator during the assembly of the toy cement mixer truck.

In order to demonstrate the proof of concept of the proposed intention recognition framework, at this stage it was assumed that all observations happen noiseless. This resulted in a deterministic HMM of which the initial state distribution, the state transition probability matrix and the emission probability matrix were supervisory trained. The next step will consist of training the HMM beforehand based on potential assembly sequences generated from the product’s CAD data. Afterwards, the HMM will be improved and adapted to operator preferences by incrementally training during task execution. This method has the advantage that robot assistance can be offered from the very beginning, without the need for extensive training on recorded data, and that this assistance will be improved and better attuned to the individual operator over time. Furthermore, no fixtures and tools were involved. Future work will therefore consist of integrating essential assembly equipment such as fixtures and tools into the model and valuable process knowledge will be extracted from the product’s CAD data, such as parts’ geometries, geometric dimensions and tolerances (GD&T), spatial object relations, etc. From this process knowledge will follow which parts of the assembly require the use of tools and will be linked to the relevant tasks in the HMM. The observation of these tools then provides information about the activities being performed, just as the other parts do. Besides, the model presented is appropriate for capturing correct, executable assembly sequences and predicting the next action to be executed. Future work will also focus on how an inappropriate action and assembly sequence could be deduced from observations. One of the weaknesses of conventional HMMs is the inherent geometric state duration probability density (when time is discrete) caused by self-transitions. For most applications, explicitly modelling state duration via a certain duration density is more appropriate, in which state transitions are only made after staying a specific amount of time in this state, associated with the number of observations produced while being in this state [13, 21]. Hidden Semi-Markov Models (HSMMs) allow for this explicit modelling of state duration whereby self-transitions are excluded and the hidden states can emit sequences of observations instead of a single observation for conventional HMMs [21]. Another extension of the widely used HMM is the Hierarchical Hidden Markov Model (HH-

260 6

Martijn Cramer et al. / Procedia CIRP 78 (2018) 255–260 M. Cramer et al. / Procedia CIRP 00 (2018) 000–000

MM), which can handle a topology of hidden states based on parent-child relationships between multiple levels [22, 23]. HHMM have already been applied in human activity recognition in which this model allows to derive activities at different levels. For example, simpler sub-activities, such as hand gestures when reaching for an object, can be part of a large-scale activity, such as assembling a product [22]. A similar hierarchy can also be found in assembly tasks when taking the concept of object affordances into account. For example, the assembly of a product typically requires several subassemblies. By using HHMM it becomes possible to recognise activities and intentions on multiple levels given the observations of the parts required for these subassemblies. Future work could include to combine these two models, HSMM and HHMM, in order to explicitly model state duration and estimate activities and intentions in multiple levels of the hierarchical assembly structure. Also, other probabilistic models such as Conditional Random Fields (CRF), Naive Bayes Classifiers (NBC), and other Dynamic Bayesian Networks (DBN) [7, 24] as well as their variants will be considered as alternatives to our implementation of the conventional HMMs. It is the authors’ long-term vision to work towards a robust and intention-based collaboration between humans and robots. The first step is to recognise the intentions of the human, in order for the robot to anticipate appropriately and in time. The aim of this work is to introduce the principle of object affordances in this domain and demonstrate its feasibility to estimate operator activities and intentions with a view to assembly tasks. Next steps will consist of generating anticipatory tasks and commands for the robot given these human intentions and studying the effects of those interactions on the observation layer and the proposed model. Moreover, to gain acceptance by industry, the framework also needs to be robust and scalable. Future research will therefore comprise of examining the scalability of the proposed HMM and its computational demand when evolving from this toy example to an industrial assembly task as well as improving the robustness of the object recognition by using Point Pair Feature Matching (PPFM) together with the geometric information contained in the product’s CAD data [25].

examines the use of object affordances to determine current and future operator activities in an assembly task. The feasibility of the proposed method has been shown by the conducted experiments in this research. References [1] P. Tsarouchi, S. Makris and G. Chryssolouris, “Human-robot interaction review and challenges on task planning and programming,” International Journal of Computer Integrated Manufacturing, vol. 29, no. 8, pp. 916-931, 2016. [2] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: a review,” ACM Computing Surveys, vol. 43, no. 3, pp. 1-43, 2011. [3] M. Abdullah-Al-Wadud, “A human activity recognition system based on sensory data related to object usage,” International Journal of Computer and Information Engineering, vol. 8, no. 1, pp. 34-36, 2014. [4] J. J. Gibson, The ecological approach to visual perception, Boston, Massachusetts: Houghton Mifflin, 1979. [5] J. M. Carroll, “Activity theory,” in HCI models, theories, and frameworks: toward a multidisciplinary science, San Francisco, California, Morgan Kaufmann Publishers Inc., 2003, pp. 291-324. [6] B. Nardie, Context and consciousness: activity theory and human-computer interaction., vol. 14, Cambridge, Massachusetts: MIT Press, 1996, pp. 301-304. [7] J. Yang, J. Lee and J. Choi, “Activity recognition based on RFID object usage for smart mobile devices,” Journal of Computer Science and Technology, vol. 26, no. 2, pp. 239-246, 2011. [8] C. Flores-Vázquez and J. Aranda, “Human activity recognition from object interaction in domestic scenarios,” 2016 IEEE Ecuador Technical Chapters Meeting (ETCM), pp. 1-6, 2016. [9] K. Duncan, S. Sarkar, R. Alqasemi and R. Dubey, “Scene-dependent intention recognition for task communication with reduced human-robot interaction,” Computer Vision - ECCV 2014 Workshops. Lecture Notes in Computer Science, vol. 8927, 2014. [10] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose and J. Rehg, “A scalable approach to activity recognition based on object use,” 2007 IEEE 11th International Conference on Computer Vision, pp. 1-8, 2007. [11] M. Humayun Kabir, M. Robiul Hoque, K. Thapa and S.-H. Yang, “Two-layer hidden Markov model for human activity recognition in home environments,” International Journal of Distributed Sensor Networks, vol. 2016, pp. 1-12, 2015. [12] H. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 14-29, 2016. [13] L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, pp. 257-286, 1989. [14] L. E. Baum and T. Petrie, “Statistical interference for probabilistic functions of finite state Markov chains,” Annals of Mathematical Statistics, vol. 37, pp. 1554-1563, 1966. [15] L. E. Baum and J. A. Egon, “An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology,” Bulletin of the American Meteorological Society, vol. 73, pp. 360-363, 1967. [16] L. E. Baum and G. R. Sell, “Growth functions for transformations on manifolds,” Pacific Journal of Mathematics, vol. 27, no. 2, pp. 211-227, 1968. [17] L. E. Baum, T. Petrie, G. Soules and N. Weiss, “A maximization technique occuring

[18]

5. Conclusion In this work, first steps are made at transferring human collaborative behaviour to robots to achieve more flexible, efficient and overall safer human-robot collaboration in manufacturing industry. Current mainstream research focuses on recognising activities using a machine learning framework trained with labelled data of movements and gestures. This ignores the valuable relationship between the activity and the objects being manipulated referred to as ‘object affordances’. In literature, some applications of object affordances for human activity recognition in domestic environments exist. However, the application in manufacturing industry, more specifically assembly tasks, as well as the adoption for intention estimation is rather limited. Therefore, a feasibility study is conducted that

[19] [20] [21] [22]

[23] [24]

[25]

in the statistical analysis of probabilistic functions of Markov chains,” Annals of Mathematical Statistics, vol. 41, no. 1, pp. 164-171, 1970. L. E. Baum, “An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes,” Inequalities, vol. 3, pp. 1-8, 1972. D. Ramage, “Hidden Markov models fundamentals,” in CS229 Section Notes, Colorado, 2007, pp. 1-13. D. Jurafsky and J. Martin, “Hidden Markov models,” in Speech and Language Processing, United States, Pearson Education (US), 2017, p. 1024. S.-Z. Yu, “Hidden semi-Markov models,” Artificial Intelligence, vol. 174, pp. 215243, 2010. A. Roitberg, N. Somani, A. Perzylo, M. Rickert and K. A., “Multimodal human activity recognition for industrial manufacturing processes in robotic workcells,” Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 259-266, 2015. S. Fine, Y. Singer and N. Tishby, “The hierarchical hidden Markov model: analysis and applications,” Machine Learning, vol. 32, no. 1, pp. 41-62, 1998. G. Metta, L. Natale, F. Nori, G. Sandini, D. Vernon, L. Fadiga, C. von Hofsten, K. Rosander, M. Lopes, J. Santos-Victor, A. Bernardino and L. Montesano, "The iCub humanoid robot: an open-systems platform for research in cognitive development," Neural Networks, vol. 23, no. 8–9, pp. 1125-1134, 2010. B. Drost, M. Ulrich, N. Navab and S. Ilic, “Model globally, match locally: Efficient and robust 3D object recognition,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, California, 2010.