Representation and classification of whole-body motion integrated with finger motion

Representation and classification of whole-body motion integrated with finger motion

Journal Pre-proof Representation and classification of whole-body motion integrated with finger motion Wataru Takano, Yusuke Murakami, Yoshihiko Nakam...

2MB Sizes 0 Downloads 19 Views

Journal Pre-proof Representation and classification of whole-body motion integrated with finger motion Wataru Takano, Yusuke Murakami, Yoshihiko Nakamura PII: DOI: Reference:

S0921-8890(19)30484-1 https://doi.org/10.1016/j.robot.2019.103378 ROBOT 103378

To appear in:

Robotics and Autonomous Systems

Received date : 12 June 2019 Revised date : 17 October 2019 Accepted date : 13 November 2019 Please cite this article as: W. Takano, Y. Murakami and Y. Nakamura, Representation and classification of whole-body motion integrated with finger motion, Robotics and Autonomous Systems (2019), doi: https://doi.org/10.1016/j.robot.2019.103378. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof

Representation and Classification of Whole-body Motion Integrated with Finger Motion

pro of

Wataru Takano∗ Center for Mathematical Modeling and Data Science, 1-3 Machikaneyamacho, Toyonaka, Osaka, 560-8531, Japan

Yusuke Murakami

Mechano-Informatics, University of Tokyo, 7-3-1 Hongo, Bunkyoku, Tokyo,113-8656, Japan

Yoshihiko Nakamura

Mechano-Informatics, University of Tokyo, 7-3-1 Hongo, Bunkyoku, Tokyo,113-8656, Japan

re-

Abstract

urn a

lP

This paper presents a novel approach toward representing human whole-body motions with fingers and classification of human motions while performing tasks that require delicate finger movements, such as holding or grasping. Human whole-body motions are recorded using an optical motion capture system that measures positions of markers attached to a performer. Additionally, the performer wears data gloves with strain gauges fixed at the finger joints to measure flexions and extensions. Combining whole-body motion with finger motions forms a representation of integrated motion, which is subsequently encoded into a probabilistic model whose parameters are optimized such that the model most likely generates the training data for the integrated motion. Observations of integrated motion are classified into the relevant probabilistic model with the largest probability of generating the observation. Synchronous measurements of human whole-body and finger motions created a dataset of integrated human motions. We tested our proposed approach on this dataset, thereby demonstrating that representations of whole-body motion integrated with finger motions improved classification of human motions while manipulating objects. Keywords: Motion primitive, Motion classification, Stochastic model

1. Introduction

Jo

Many advanced countries are facing serious labor shortages. One reason for this is increasing elderly populations amid a shrinking total population. Because it is difficult to replenish a dwindling workforce, the use of humanoid robots is expected to provide one solution to this critical ∗ Corresponding

author. Tel: +81-6-6850-8298, Fax:+81-6-6850-6092 Email addresses: [email protected] (Wataru Takano ), [email protected] (Yusuke Murakami), [email protected] (Yoshihiko Nakamura) Preprint submitted to Robotics and Autonomous Systems October 17, 2019

Journal Pre-proof

urn a

2. Related Work

lP

re-

pro of

situation. Humanoid robots can perform human-like motions but they do not need adapt to environments that are hostile to humans. This makes it possible to supplement labor forces with humanoid robots. Research on humanoid robots has been pursued from various perspectives in recent years. Intensive research has particularly focused on increasing integration density and improving accuracy of hardware technology [1][2][3][4]. Software elements for perception and action are essential for developing artificial intelligence for humanoid robots and these elements are expected to enable autonomy and allow robots to replace human workers. It is difficult to manually program whole-body motions for humanoid robots with many degrees of freedom. There has thus been considerable research on learning by imitation or programming by demonstration, where human whole-body motions are encoded as mathematical models. Conventional frameworks allow humanoid robots to not only synthesize humanlike whole-body motions, but also to recognize human actions. However, the movements to be executed are limited to whole-body movements, which do not consider finger movements. Finger movements must be well controlled if humanoid robots are to perform complicated tasks such as grasping, holding, and manipulating objects, so information related to finger movements would be helpful for precisely and correctly recognizing human actions. A model for human and robotic motions thus needs to be established to represent the finger movements that often accompany whole-body movements. In this study, we apply an imitative learning framework to human integrated motions including whole-body and finger movements. Human whole-body and finger movements are measured using an optical motion capture system along with data gloves equipped with strain gauges. The resultant integrated human motions are encoded as probabilistic models used as a motion classifier. To our knowledge, this is the first approach that records whole body movements in synchronization with the finger movements, considers both whole-body and finger movements for motion representation and the classifier. We tested our approach toward motion classification by comparing classifications of human whole-body motions with those of integrated human motions, thereby demonstrating our method’s validity.

Jo

There has been considerable research on mathematical modeling of human and robot movements, particularly through imitative learning or programming by demonstration. This provides a potential approach toward artificial intelligence allowing robots to synthesize motions similar to training data and to recognize various motions. There are two popular approaches to motion modeling: dynamical systems [5][6][7][8] and stochastic systems [9][10][11][12][13][14]. Dynamical systems represent motion data as differential equations in the state space. They can predict what movements will follow a currently observed motion according to these differential equations and can subsequently compute error between predictions and actual observations. They consequently recognize motions by finding the motion model with the least error, implying that this motion model has high similarity to the observation. In contrast, stochastic systems represent motion data as a probabilistic graphical model with motion transitions and distributions. These models can compute the probability of the current motion observation being generated by each motion model, and subsequently classify the observation according to the motion model category with the highest probability. Thus, both dynamical and stochastic systems encode motion data into model parameters. In other words, motion data are discretely represented as points in the parameter space. These points represent, or symbolize, the motion, so the model is referred 2

Journal Pre-proof

Jo

urn a

lP

re-

pro of

to as a “motion symbol.” Motion symbols can be used as motion classifiers, but they do not provide an intuitive interface between robots and humans. Specifically, when we receive a response from the motion model regarding how the observation is classified, we cannot understand this from model parameters alone as a motion category such as “squatting,” “walking,” or “running.” Translation between motion symbols and human-readable representations is thus necessary to improve ease of use. Natural language is one typical representation. In the fields of robotics and computer graphics, intensive research has been conducted to connect motions with natural language. Rose et al. presented the novel concept of “verbs” and “adverbs” in the motion space [15]. Here, a “verb” denotes a group of similar motions, and an “adverb” denotes a difference between two motions within the same group. “Adverb” parameters control interpolation between two motions, thereby creating a new motion from “verb” and “adverb” inputs. Arikan et al. developed a method for creating character motions from words [16]. This framework establishes a database of motions with verb labels attached to their frames. This database allows finding continuous sequences of motions relevant to input verbs. More specifically, dynamic programming can efficiently search for sequences of motion frames to which input verbs are attached while satisfying two constraints: continuity and attachment of input verbs. Inspired by this computation of machine translation, we applied the stochastic translation model to relations between two sequences, namely human whole-body motions and verb labels attached to them [17][18]. This technique for machine translation creates bidirectional mappings between motions and verbs, allowing translation between them. These approaches have successfully linked motion data to their descriptive words, but have not reached the level of assembling words into sentences in accordance with grammatical rules. Sugita et al. proposed a novel framework for connecting motions with descriptive sentences by using two recurrent neural networks [19]. One neural network represents the dynamics of motions, and the other represents the dynamics of words in sentences. These neural networks share bias parameters that bridge motions and sentences. This framework realizes motion synthesis from sentences. Ogata et al. extended this framework to the generation of sentences from motions [20]. Motions generate multiple sentence candidates, and each candidate is converted to a motion in the same way as Sugita et al. An appropriate sentence is selected by comparing generated motions with original ones. This framework has been further improved to handle a broader variety of motions and sentences [21][22]. We have also presented a framework for integrating human whole-body motions with natural language by using probabilistic models [23][24][25]. This framework comprises two probabilistic modules for semantic and syntactic functions. One module represents relations between motions and their relevant words for the semantic function, and the other represents word order for the syntactic function. These two modules can compute the probability of a sentence being generated from a given motion, then search for the sentence with the greatest probability of describing the motion. A recent trend in deep learning [26][27][28] has been accelerating techniques for translating human motions to descriptive sentences. Plappert et al. used deep-learning techniques to link human whole-body motions with descriptions [29]. One recurrent neural network computes context features from motions, and the other predicts words from previous words and their contexts. Iterating these computations produces descriptions of human motions. Yamada et al. also developed an approach toward bidirectional translation between motions and descriptions based on two autoencoders for perceptions and sentences. These two autoencoders share context features to be tuned to reduce error between the context features. Context features allow connections between motions and sentences [30]. Ahn et al. also adopted the deep learning technique of a generative adversarial network for linking human motions to descriptions [31]. Specifically, a 3

Journal Pre-proof

pro of

generator extracts a text feature from a sentence, and subsequently generates a human motion from the resultant feature. A discriminator extracts a text feature from a sentence, and subsequently differentiates actual motions from generated human motions. These modules develop mutually as a pair through training relations between the motions and sentences. The approaches described above have developed mathematical models for training human whole-body motions. Human whole-body motions are indeed important features for classifying human actions but are not necessarily sufficient for their representation. These approaches are incomplete and need to be extended to handle more delicate movements such that they can classify human motions that consist of similar whole body postures. As reported in several previous studies, an object with which a human body physically interacts is helpful for correctly and precisely classifying human actions [32][33]. Representing human-object interaction is crucial for recognizing human actions. Since human fingers directly interact with objects by holding, grasping, and manipulating them, these movements would provide important information for action classification.

re-

3. Representation and classification of human motions 3.1. Motion representation

Jo

urn a

lP

Measurement techniques for human whole-body motions have been developed with widely used inertial measurement unit (IMU) sensors or optical motion capture systems. IMU sensors attached to a performer output the locations and orientations of the sensors, and a kinematic computation algorithm estimates joint angles or positions in the human body. IMU sensors do not limit the measurement space to a laboratory setting but do have the drawback of drift error. An optical motion capture system detects markers attached to performers in images containing their actions as viewed from multiple cameras, thereby estimating their locations. These marker locations reveal the posture of a human character with virtual markers attached through forward and inverse kinematic computation algorithms. This computation allows estimations of joint angles and positions of the human body. An optical motion capture system requires installing multiple cameras in a laboratory, limiting the measurement space. However, they do not suffer from measurement uncertainty due to drift error. Optical motion capture systems are superior to IMU sensors in terms of measurement accuracy. Although optical motion capture systems can accurately measure the angles or positions of all joints in the human body, it is difficult to measure joint angles in a human hand. The human hand has many degrees of freedom, with many joints densely arranged within a small area. This makes it difficult to attach small markers to the fingers and to detect these densely arranged markers when attempting to measure the angles and positions of each joint in the human hand. . Use of strain gauges is an old method for measuring object deflection. A data glove with strain gauges fixed at the finger joints has been developed to record flexion and extension of fingers. This data glove can measure the angles of all finger joints. Human motion representation in this research integrates all joint angles in the human body with all finger joint angles in both hands. Specifically, we concatenate two feature vectors, xW and xH , into an integrated feature vector xI , as shown in Fig. 1. xW is a feature vector whose elements are joint angles in the whole human body, and xH is a feature vector whose elements are finger joint angles in human  T hands. Feature xI is written as xI = xTW , xTH . A sequence of features xI forms a matrix X = [xI (1), xI (2), · · · , xI (T )] for the integrated human motion. 4

xI =

xW xH

x H : Finger motion

re-

: Whole body motion

: Integrated human motion

lP

xW

pro of

Journal Pre-proof

urn a

Figure 1: Integrated human motion is defined as a sequence of features comprising angles of all joints in the human body and angles of all finger joints in both hands.

Jo

3.2. Motion encoding Integrated human motion data are encoded into a set of parameters for a mathematical model. In this study, we adopted a hidden Markov model (HMM), a popular probabilistic graphical model that has the advantage of being able to handle spatiotemporal data. Figure 2 shows an overview of an HMM structure. This HMM is defined in a compact notation as λ = {A, B, Π}, where A is the matrix whose entries ai j are the probability of transitioning from the ith to the jth node, B is the set of data distributions to be generated from each node, and Π is a vector whose entries πi are the probability of starting at the ith node. According to the transition probabilities A and initial node probabilities Π, it moves to a node. Node transition represents the temporal features. The node subsequently generates data according to output probabilities B. It represents the spatial features. In this manner, an HMM can encapsulate the spatiotemporal features in the human motions. These parameters are optimized by the Baum–Welch expectation maximization algorithm to maximize the probability of generating the training data [34]. HMMs can be used as motion classifiers. An HMM should generate motion data similar to its training data (i.e., it should not generate data very different from its training data). Specifically, 5

Journal Pre-proof

a22

1 b1(x)

a12

2

aii ai-1 i

a23

b1(x)

b2(x)

i

bi(x)

ann ai i+ 1 bi(x)

an-1 n

n

bn(x)

pro of

a11

Motion data

Figure 2: Human motion is encoded as a hidden Markov model. This model is defined as a set of parameters of node transitions and data distributions.

re-

probability P(X|λ) of generating motion data X by an HMM λ should be large when that motion data is similar to the HMM training data. The motion data is thereby classified as the HMM λR with the highest probability among a set of HMMs S = {λ1 , λ2 , · · · , λK }. λR = arg max P(X|λ) λ∈S

(1)

This probability P(X|λ) can be efficiently computed by a recursive forward algorithm, namely, αi (1) = =

lP

αi (t)

πi bi (xI (1))  α j (t − 1)a ji bi (xI (t))

(2) (3)

j

P(X|λ) =



αi (T ),

(4)

i

urn a

where bi (x) is the probability of data x being generated from the ith node, and αi (t) is the probability of sequence data until time t being generated and staying at the ith node at time t. T is the length of motion data X. 4. Experiments

Jo

We investigate the importance of integrating human whole-body motions with finger motions in both hands for classification of human actions. We have tested our proposed method on human whole body motion data and finger motion data. We created a human figure model with 34 degrees of freedom in the body and 25 degrees of freedom in each hand. Body joint arrangements followed our previous research [35]. Twenty-one joints were arranged for each hand. Figure 3 shows the arrangements, joint names, and joint types. We converted measurement data to angles of all body and hand joints in the human figure model. Human whole-body motions were recorded using an optical motion capture system with 12 infrared cameras installed in a laboratory. The camera system captured the positions of 35 markers attached to a performer according to the Helen Hayes marker set placement [36]. These positions were converted to angles of all body joints using forward and inverse kinematic algorithms. Performers also wore commercial data gloves provided by CyberGlove systems. Each data gloves are equipped with 22 strain gauges. The data gloves output the joint angles in both 6

Journal Pre-proof

13. MiddleTIP 17. MedicinalTIP

12. MiddleDIP 20. LittleDIP 15. MedicinalPIP 19. LittlePIP

11. MiddlePIP

8. IndexDIP

5. ThumbTIP

7. IndexPIP

14. MedicinalMP 18. LittleMP

pro of

9. IndexTIP

21. LittleTIP 16. MedicinalDIP

4. ThumbDIP

10. MiddleMP

6. IndexMP

3. ThumbPIP

Fixed joint Rotational joint Spherical joint

re-

2. ThumbMP

1. Palm

lP

Figure 3: Twenty-one joints were arranged for each hand. The color of each circle denotes the joint type: fixed joint (red), rotational joint (green), or spherical joint (blue). The hand has 25 degrees of freedom.

Jo

urn a

hands from strains. Data gloves on both hands measured all finger joint angles in synchronization with the optical motion capture system. Figure 4 shows the setup for measuring body and finger motions. We applied these synchronized measurements to a human figure model to create the integrated human motion shown in Fig. 5. We recorded four trials for each of 32 motion patterns by one performer, resulting in a dataset containing 128 motions. Table 1 lists those motion patterns, and Fig. 6 shows examples of the measured motions “blow nose,” “drink from a bottle,” “sit on a chair,” “close a bottle,” “sweep with a broom,” and “throw away garbage.” We compared motion classifications based only on human whole-body motions with classifications based on both body and finger motions. We began by testing classifications of human whole-body motions. We created datasets by dividing whole-body motion data from the 128 trials into four sets, ensuring that each dataset included at least one of each motion pattern. We used three datasets as training data, and the remaining dataset as test data. We conducted crossvalidation tests through iterations in which each dataset was used as test data once. Specifically, three motion patterns in the training dataset were encoded into an individual HMM. 32 HMMs were thus created in the training phase. Motion data were manually grouped into 32 motion patterns in advance. A test motion data was classified as the HMM that is the most likely to generate the data in the testing phase. The test motion data, which was classified into the same motion patterns as manually grouping, was counted as the correct classification. A classification rate was calculated as the ratio of the number of correctly classified motion data to the number 7

Journal Pre-proof

pro of

Markers

Infrared camera

lP

re-

Data gloves

Figure 4: Thirty-five markers were attached to a performer. An optical motion capture system measured the marker positions for whole-body motions. The performer wore data gloves equipped with 22 strain gauges each. Finger postures were measured for finger motions.

Jo

urn a

of all the test motion data. In this experiment, 128 test motions were classified, with 117 motions classified into the correct motion category. The correct classification rate was thus 0.914. Figure 7 shows a confusion matrix illustrating the detailed classification results. The color of cell (i, j) is associated with the probability of the motion data in category i being classified into category j. Four “drink from a mug” motions were classified, two into the correct category and two wrongly classified as “drink from a bottle.” Note, however, that both motion categories were related to a drinking motion. Finger movements when grasping a mug differ from those when grasping a bottle, but the whole-body posture is similar in both cases, which was the likely cause of this misclassification. Only one “open an umbrella” motion was correctly classified, with the three other motions misclassified as “close an umbrella.” This too may be a result of the similar whole-body postures while holding an umbrella. We also tested classification of motions integrating human whole-body motions with finger motions. Datasets and procedures for cross-validation testing were as described above. A total of 120 motions were correctly classified, giving a correct classification rate of 0.938. Figure 8 shows a confusion matrix for classification of integrated human motions. All four “drink from a mug” motions were correctly classified, in contrast to the 0.5 classification rate when using only whole-body motion data. As another example, one “open a can” motion was misclassified as 8

re-

pro of

Journal Pre-proof

lP

Figure 5: Human whole-body motions were measured using an optical motion capture system. Finger motions were measured using data gloves synchronized with the whole-body motions. These motions were subsequently integrated for motion representation and classification.

Jo

urn a

“pour water into a cup” when only whole-body motions were used, but including finger motions resulted in a correct classification. The classification rate for the “open an umbrella” motion, however, remained unimproved at 0.25 when including finger motions. Overall, adding finger motions increased the classification rate by 0.023 compared with using whole-body motions alone. These experiments showed that introducing finger motion data aided human action classification, validating the use of integrated human motions.

9

pro of

Journal Pre-proof

1. Blow nose

18. Close a bottle

25. Throw away garbage

lP

16. Sit on a chair

19. Sweep with a broom

re-

4. Drink from a bottle

Figure 6: Examples of integrated human motions. Data gloves measured finger motions synchronized with an optical motion capture system measuring whole-body motions, integrating the whole-body motion with finger motions.

urn a

5. Conclusion

The results of this research are as follows:

Jo

1. We created a dataset of human whole-body motions and finger motions. Human wholebody motions were measured using an optical motion capture system, and finger motions were measured using data gloves synchronized with those whole-body motions. The human whole body motions and finger motions were expressed by all the joint angles in a human figure model with 34 degrees of freedom and in both hand models with 25 degrees of freedom in each. The dataset included four trial motions for each of 32 motion patterns, resulting 128 motion data. 2. We integrated human whole-body motions with finger motions and designed representations for integrated human motions. More specifically, the integrated human motions were expressed by sequences of feature vectors whose entries are all the joint angles in the human figure model and both hand models. These integrated human motions were subsequently encoded as probabilistic graphical models, namely as hidden Markov models (HMMs) that should generate motions similar to the training data. These HMMs were used as motion classifiers according to probability of the HMM generating the observed human motion. 10

Journal Pre-proof

pro of

The observed human motion was classified as the HMM that is the most likely to generate the observation. 3. We conducted experiments classifying observations of human motions. We compared classifications of human whole-body motions with those using integrated human motions including both whole-body and finger motions. The classifier training only human whole body motions achieved the classification rate of 0.914. The classifier training both human whole body motions and finger motions achieved the classification rate of 0.938. More concretely, “drink from a mug” motion are similar to “drink from a bottle” motion, and these motions were wrongly classified only on human whole-body motions. However, these motions were correctly discriminated from not only the whole body motions but also the finger motions. we demonstrated that the addition of finger motions improves classification of human actions, and that finger motions are helpful for differentiating between two human actions involving similar whole-body postures.

lP

Acknowledgements

re-

We have extended the representation of human actions to include the human whole-body motions and finger motions, and constructed the action classifier based on this representation by using HMMs. An HMM is the probabilistic generative model, and is potentially used not only for an action classifier but also for an action synthesizer. We need to develop an action synthesizer based on HMMs training the whole body motions and finger motions such that a humanoid robot can control its whole body motions and finger motions to manipulate objects with high dexterity.

urn a

This research was partially supported by a Grant-in-Aid for Challenging Research (Exploratory) (No. 17K20000) and a Grant-in-Aid for Scientific Research (A) (No. 17H00766) from the Japan Society for the Promotion of Science.

References

Jo

[1] S. Sugano and I. Kato. Wabot-2: Autonomous robot with dexterous finger-arm–finger-arm coordination control in keyboard performance. In Proceedings of 1987 IEEE International Conference on Robotics and Automation, volume 4, pages 90–97, 1987. [2] Y. Kuroki, M. Fujita, T. Ishida, K. Nagasaka, and J. Yamaguchi. A small biped entertainment robot exploring attractive applications. In Proceedings of the IEEE International Conference on Robotics and Automation, volume 1, pages 471–476, 2003. [3] K. Kaneko, F. Kanehiro, S. Kajita, K. Yokoyama, K. Akachi, T. Kawasaki, S. Ota, and T. Isozumi. Design of prototype humanoid robotics platform for hrp. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems, volume 3, pages 2431–2436, 2002. [4] G. Cheng, S. H. Hyon, J. Morimoto, A. Ude, J. G. Hale, G. Colvin, W. Scroggin, and S. C. Jacobsen. Cb: A humanoid research platform for exploring neuroscience. Advanced Robotics, 21(10):1097–1114, 2007. [5] M. Okada, K. Tatani, and Y. Nakamura. Polynomial design of the nonlinear dynamics for the brain-like information processing of whole body motion. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1410–1415, 2002. [6] A. J. Ijspeert, J. Nakanishi, and S. Shaal. Learning control policies for movement imitation and movement recognition. Neural Information Processing System, 15:1547–1554, 2003. [7] J. Tani and M. Ito. Self-organization of behavioral primitives as multiple attractor dynamics: A robot experiment. IEEE Transactions on Systems, Man and Cybernetics Part A: Systems and Humans, 33(4):481–488, 2003.

11

Journal Pre-proof

Jo

urn a

lP

re-

pro of

[8] H. Kadone and Y. Nakamura. Symbolic memory for humanoid robots using hierarchical bifurcations of attractors in nonmonotonic neural networks. In Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2900–2905, 2005. [9] T. Inamura, I. Toshima, H. Tanie, and Y. Nakamura. Embodied symbol emergence based on mimesis theory. International Journal of Robotics Research, 23(4):363–377, 2004. [10] A. Billard, S. Calinon, and F. Guenter. Discriminative and adaptive imitation in uni-manual and bi-manual tasks. Robotics and Autonomous Systems, 54:370–384, 2006. [11] T. Asfour, F. Gyarfas, P. Azad, and R. Dillmann. Imitation learning of dual-arm manipulation task in humanoid robots. In Proceedings of the IEEE-RAS International Conference on Humanoid Robots, pages 40–47, 2006. [12] D. Kulic, H. Imagawa, and Y. Nakamura. Online acquisition and visualization of motion primitives for humanoid robots. In Proceedings of the 18th IEEE International Symposium on Robot and Human Interactive Communication, pages 1210–1215, 2009. [13] K. Sugiura, N. Iwahashi, H. Kashioka, and S. Nakamura. Active learning of confidence measure function in robot language acquisition framework. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1774–1779, 2010. [14] I. Mordatch, K. Lowrey, G. Andrew, Z. Popovic, and E. Todorov. Interactive control of diverse complex characters with neural networks. In Proceedings of Advances in Neural Information Processing Systems 28, pages 3132–3140, 2015. [15] C. Rose, B. Bodenheimer, and M. F. Cohen. Verbs and adverbs: Multidimensional motion interpolation. IEEE Computer Graphics and Application, 18(5):32–40, 1998. [16] O. Arikan, D. A. Forsyth, and J. F. O’Brien. Motion synthesis from annotations. ACM Transactions on Graphics, 22(3):402–408, 2003. [17] W. Takano, K. Yamane, and Y. Nakamura. Capture database through symbolization, recognition and generation of motion patterns. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 3092–3097, 2007. [18] W. Takano, D. Kulic, and Y. Nakamura. Interactive topology formation of linguistic space and motion space. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1416–1422, 2007. [19] Y. Sugita and J. Tani. Learning semantic combinatoriality from the interaction between linguistic and behavioral processes. Adaptive Behavior, 18(1):33–52, 2005. [20] T. Ogata, M. Murase, J. Tani, K. Komatani, and H. G. Okuno. Two-way translation of compound sentences and arm motions by recurrent neural networks. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1858–1863, 2007. [21] H. Arie, T. Endo, S. Jeong, M. Lee, S. Sugano, and J. Tani. Interactive learning between language and action: A neuro-robotics experiment. In Proceedings of the 20th International Conference on Artificial Neural Networks, pages 256–265, 2010. [22] T. Ogata and H. G. Okuno. Integration of behaviors and languages with a hierarchical structure self-organized in a neuro-dynamical model. In Proceedings of the IEEE Symposium Series on Computational Intelligence, pages 94–100, 2013. [23] W. Takano and Y. Nakamura. Integrating whole body motion primitives and natural language for humanoid robots. In Proceedings of the IEEE-RAS International Conference on Humanoid Robots, pages 708–713, 2008. [24] W. Takano and Y. Nakamura. Statistically integrated semiotics that enables mutual inference between linguistic and behavioral symbols for humanoid robots. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 646–652, 2009. [25] W. Takano and Y. Nakamura. Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions. International Journal of Robotics Research, 34(10):1314–1328, 2015. [26] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012. [28] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems 27, pages 3104–3112, 2014. [29] M. Plappert, C. Mandery, and T. Asfour. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robotics and Autonomous Systems, 109:13–26, 2018. [30] T. Yamada, H. Matsunaga, and T. Ogata. Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descriptions. IEEE Robotics and Automation Letters, 3(4):3441–3448, 2018. [31] H. Ahn, T. Ha, Y. Choi, H. Yoo, and S. Oh. Text2action: Generative adversarial synthesis from language to action. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 5915–5920, 2018. [32] A. Gupta, A. Kembhavi, and L. S. Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10):1775– 1789, 2009.

12

Journal Pre-proof

Jo

urn a

lP

re-

pro of

[33] B. Yao and L. Fei-Fei. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1691–1703, 2012. [34] L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall, 1993. [35] W. Takano, Y. Yamada, and Y. Nakamura. Linking human motions and objects to language for synthesizing action sentences. Autonomous Robots, 43(4):913–925, 2019. [36] M. P. Kadaba, H. K. Ramakrishnan, and M. E. Wootten. Measurement of lower extremity kinematics during level walking. Journal of Orthopaedic Research, 8(3):383–392, 1990.

13

pro of

Journal Pre-proof

Table 1: Thirty-two motion patterns were performed by a single individual.

Motion category

Motion index

Motion category

1

blow nose

17

split chopsticks

2

catch a ball

18

close a bottle

3

close an umbrella

19

sweep with a broom

4

drink from a bottle

20

swing a racket underhanded

5

drink from a mug

6

eat rice

7

re-

Motion index

swing a racket overhead

22

swing a bat

play with a smartphone

23

tear a piece of paper

8

lift a box

24

throw a ball

9

use a mouse

25

throw away garbage

10

open a bottle

26

type on a keyboard

11

open a can

27

use a fan

urn a

lP

21

pour water into a cup

28

clean with a vacuum

13

pump a shampoo bottle

29

wipe a dish

14

open an umbrella

30

wipe face

15

read a book

31

wring a towel

16

sit on a chair

32

write with a pen

Jo

12

14

pro of

Journal Pre-proof

Classification based on only whole body motions Predicted category 5

10

15

re-

10

15

20

25

urn a

30

lP

True category

5

20

25

30

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0

Jo

Figure 7: Whole-body motion data were classified into the motion category (HMM) with the highest probability of the motion data being generated. The color of cell (i, j) is associated with the probability of motion data in category i being classified into category j.

15

pro of

Journal Pre-proof

Classification based on integrated human motion Predicted category 5

10

15

re-

10

15

20

25

urn a

30

lP

True category

5

20

25

30

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0

Jo

Figure 8: Integrated human motions were classified into the motion category (HMM) with the highest probability of generating that motion data. The color of cell (i, j) is associated with the probability of the motion data in category i being classified into category j.

16

Journal Pre-proof

Wataru Takano is a Specially Appointed Professor at a Center for Mathematical Modeling and Data Science, Osaka University, He was born in Kyoto, Japan, in 1976. He received the B.S and M.S degrees from Kyoto University, Japan, in Precision Engineering in 1999 and 2001, Ph.D degree from MechanoInformatics, the University of Tokyo, Japan, in 2006. He had worked as a Project

pro of

Assistant Professor, an Assistant Professor, and an Associate Professor for department of mechano-informatics, the University of Tokyo from 2006 to 2015, and as a Researcher for Project of Information Environment and Humans, Presto, Japan Science and Technology Agency from 2010 to 2014. His field of research includes kinematics, dynamics, artificial intelligence of humanoid robots, and intelligent vehicles. He is a member of IEEE, Robotics Society of Japan, and Information Processing Society of Japan. He has been the chair of Technical

Jo

urn a

lP

re-

Committee of Robot Learning, IEEE RAS.

Journal Pre-proof

Yusuke Murakami received his BS degree and MS degrees from MechanoInformatics, the University of Tokyo, Japan, in 2015 and 2017, respectively. His research interest is computation on the kinematics and dynamics for humanoid

Jo

urn a

lP

re-

pro of

robots.

Journal Pre-proof

Yoshihiko Nakamura is a Professor at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo. He was born in Osaka, Japan, in 1954. He received the B.S., M.S., and Ph.D. degrees from Kyoto University, Japan, in precision engineering in 1977, 1978, and 1985, respectively. He was an Assistant Professor at the Automation Research

pro of

Laboratory, Kyoto University, from 1982 to 1987. He joined the Department of Mechanical and Environmental Engineering, University of California, Santa Barbara, in 1987 as an Assistant Professor, and became an Associate Professor in 1990. He was also a co-director of the Center for Robotic Systems and Manufacturing at UCSB. He moved to University of Tokyo as an Associate Professor of Department of Mechano-Informatics, University of Tokyo, Japan, in 1991. His fields of research include the kinematics, dynamics, control and intelligence of robots ? particularly, robots with non-holonomic constraints,

re-

computational brain information processing, humanoid robots, human-figure kinetics, and surgical robots. He is a member of IEEE, ASME, SICE, Robotics Society of Japan, the Institute of Systems, Control, and Information Engineers, and the Japan Society of Computer Aided Surgery. He was honored with a

lP

fellowship from the Japan Society of Mechanical Engineers. Since 2005, he has been the president of Japan IFToMM Congress. He is a foreign member of the

Jo

urn a

Academy of Engineering in Serbia and Montenegro.

Journal Pre-proof



This paper proposes the representation of integrated motion including human whole body motion and finger motions.



The integrated motions are encoded into probabilistic models that work as an action classifier. Experiment demonstrates the validity of inclusion of finger motions for

urn a

lP

re-

pro of

action classification.

Jo



Journal Pre-proof

Robotics and Autonomous Systems Dear Editors This manuscript has not been published or presented elsewhere in part or in

pro of

entirety, and is not under consideration by another journal.

This research was supported by a Grant-in-Aid for Challenging Research (Exploratory) (No.~17K20000) and a Grant-in-Aid for Scientific Research (A) (No.~17H00766) from the Japan Society for the Promotion of Science. There are no other conflicts of interest to declare. Sincerely, Osaka University,

re-

Wataru Takano

1-3 Machikaneyamacho Toyonaka Osaka, Japan Tel.: +81-6-6850-8298

Jo

urn a

lP

[email protected]