Human–robot interaction based on gesture and movement recognition

Human–robot interaction based on gesture and movement recognition

Journal Pre-proof Human robot interaction based on gesture and movement recognition Xing Li PII: DOI: Reference: S0923-5965(19)30709-X https://doi.o...

1MB Sizes 0 Downloads 64 Views

Journal Pre-proof Human robot interaction based on gesture and movement recognition Xing Li

PII: DOI: Reference:

S0923-5965(19)30709-X https://doi.org/10.1016/j.image.2019.115686 IMAGE 115686

To appear in:

Signal Processing: Image Communication

Received date : 16 July 2019 Revised date : 22 October 2019 Accepted date : 3 November 2019 Please cite this article as: X. Li, Human robot interaction based on gesture and movement recognition, Signal Processing: Image Communication (2019), doi: https://doi.org/10.1016/j.image.2019.115686. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Journal Pre-proof

Human Robot Interaction based on Gesture and Movement Recognition Xing Li1,a State Key Laboratory of Synthetical Automation for Process Industries, Northeastern

of

1

a

pro

University, Shenyang, Liaoning 110819, China [email protected] *Corresponding author

re-

Abstract- Human-robot interaction (HRI) has become a research hotspot in computer vision and robotics due to its wide application in human-computer interaction (HCI) domain. Based on the explored algorithms of gesture recognition and limb movement recognition in somatosensory interaction, an HRI model of a robotic arm is proposed for robot arm manipulation. More specifically, 3D SSD architecture is used for the location and identification of gesture and arm movement. Then, DTW template matching algorithm is adopted to trace the dynamic gestures. The interactive scenarios and interactive modes are designed for experiment and implementation. Virtual interactive experimental results have demonstrated the usefulness of our method. Index Terms—gesture recognition; movement recognition; DTW; 3D SSD; human-robot interaction.

urn al P

I. INTRODUCTION

The rapid development of robotics and the acceleration of industrialization make robots increasingly the best helper for humans. In recent years, the mobile robot arm (that is, a movable platform and one or several robot arms) has been widely used as an important branch of robots in medical centers, home services, and space exploration. However, for some complicated environments or difficult work, relying on the mobile robot arm alone is difficult to complete, and it is necessary to move the robot arm to cooperate with each other. The researchers proposed a human-computer intelligent fusion method based on human-computer interaction, pointing out that in the humanmachine-environment integration system, human intervention and coordination can effectively improve the performance of the system. Therefore, combining the human-computer interaction technology with the mobile robot arm can effectively improve the intelligence level and working ability of the mobile robot arm. The main research goal of human-computer interaction is to realize the natural interaction between human and robot, so that the robot can complete the transaction efficiently. The interaction between humans and robots will open up new horizons for

Jo

human-machine interfaces and will revolutionize people's lifestyles and environments. Therefore, how to realize the efficient cooperation between human and mobile robot arm through human-computer interaction technology is the hotspot and difficulty of robot research. Somatosensory interaction is a kind of new human-computer interactive technology. Through this technology, users can directly use the limbs to interact with the devices or scenes in the way that users can interact with the target objects or content without using any advanced control devices. According to the different ways of body interaction, it can be divided into three main categories: inertial sensing, optical sensing and joint sensing of inertia

Journal Pre-proof

and optics. The inertial sensing method measures the motion signals of the user through a gravity sensor or an acceleration sensor attached to the user, and then converts the motion signals into control signals to control the interaction object to achieve the purpose of interaction. The advantages of the interaction mode are high accuracy, reliability, sensitivity, and so on, whereas these sensors are either very cumbersome, not user-friendly, or expensive and difficult to widely use. The optical sensing method captures image information by extracting user motion or

of

state information from images taken by optical sensors (ie, cameras and cameras), then converts these extracted user actions or state information into control signals to manipulate the movement of interactive objects. This interaction has the advantages of natural, intuitive, operability, non-intrusive, etc., but the image is easily contaminated by noises such as illumination, background, motion, etc. The extraction step is more demanding on the algorithm and is

pro

also susceptible to occlusion by other users or the user's own joints.

The most important thing in the interaction between humans and robots is that robots recognize human behavior. The correct perception of human behavior will directly affect the quality and efficiency of human interaction with robots. Cognition of expressions, gestures and language is an important direction for humans to interact with robots, and is the basis for robots to correctly recognize human intentions. The current control interactive method of the

re-

robot is mainly by the command, which is sometimes delayed, inflexible, and inconvenient. For some complicated work tasks, precise operations are required to complete, and simple command control methods can perform operations that are pre-defined combinations of basic operations, which cannot accomplish such complex tasks. Moreover, in the command control mode, the operation task needs to be first interpreted as a command mode, and in

urn al P

the execution process, the command is restored to a specific operation, and it is inevitable that there will be no interpretation deviation in the middle. Therefore, in human-computer interaction, how robots correctly recognize human intentions through gesture recognition technology is an important factor to improve the performance of human-computer interaction systems.

In recent years, the release of Microsoft Kinect has brought new opportunities in this field. Kinect devices can collect depth maps in real-time. Compared with traditional color images, depth maps have many advantages. Firstly, the depth map sequence is essentially a four-dimensional space and not sensitive to changes in lighting conditions. Moreover, it can contain more action information and estimate human contours and bones more reliably. In this paper, based on the explored algorithms of gesture recognition and limb recognition in somatosensory interaction, an HCI model of robotic arm based on somatosensory is proposed for robot arm manipulation. The main contributions of this paper are as follows: (1)

In the model, the features of RGB video frames and depth images are extracted by 3D SSD architecture for

the location and identification of gesture and arm.

The dynamic gesture segmentation method is designed to determine the start and end of the dynamic

Jo

(2)

gesture through the pause time of the palm. Then, the DTW template matching algorithm is used to identify the dynamic gestures effectively and efficiently. (3)

The interactive scenarios and interactive modes are designed for experiment and implementation. The man-

machine interaction scene of the robotic arm is designed. Meanwhile, the simulation of the virtual robot arm through

Journal Pre-proof

the somatosensory interaction control is realized. The HCI modes include the detection and identification of the static and dynamic gestures of the hands and the limbs.

II. RELATED WORKS Human motion recognition is based on analysis and understanding of human motion, which is considered as

of

interdisciplinary disciplines such as biology, psychology and multimedia. Human motion recognition includes human motion feature extraction and classification recognition. A. Gesture Identification

pro

A.Malima et al.[1] uses the R/G slope as a color feature to establish a bandwidth model for the gesture skin color, which enables the detection of the human hand area. Based on the chrominance and saturation information in the HSI color space, E.sanchez et al. [2] established a rectangular model of the gesture skin color, and realized the segmentation of the gesture. CC Wang [3] proposed a gesture recognition method to improve the naturalness of human interaction with robots based on discrete Adaboost learning algorithms and SIFT, which solved the problem

re-

of background noise and image rotation deformation detection in training images, realized multi-view gesture detection and accurate recognition of multiple gestures. M. Tang et al. [4] proposed a Kinect-based gesture recognition model, which processes the RGB image information and depth data by SURF algorithm to achieve highaccuracy gesture recognition and improve the intelligence of human-computer interaction. Aiming at the problem that the traditional gesture recognition system affected by lighting conditions, large amount of calculation, long time,

urn al P

long training process, etc., based on the depth data obtained by RGB-D camera, A.Ramey et al. [5] build the human skeleton model to extract gestures from the 3D skeleton model, then uses a finite state machine to encode the direction of the gesture in different states. Finally the model realizes natural human-computer interaction by using a template-based classifier to recognize the gesture.

M.V.Bergh proposed a Haarlet-based gesture recognition system, which detects the 3D pointing of gestures according to the depth information acquired by Kinect, then converts gestures into interactive commands to realize real-time human-computer interaction [6]. B.Burger et al. [7] designed a multi-target tracker for the home robot to track the interactive distribution of the two-hand gesture and the three-dimensional position of the head, enabling the robot to have a visual perception, speech processing and multi-channel communication function. J.Gast et al. [8] proposed a new real-time processing framework for multi-channel data, which is used to analyze the interaction factors and dynamic characteristics in human-computer interaction, and to create an HCI system by recording speech, facial expressions, gestures and other physiological data. The HCI system takes into account human factors

Jo

and identifies and reacts to these parameters. P.H.Kahn et al. [9] studied the pattern design in human-computer interaction, and designed the model for the robots from the initial social self-introduction, teaching exchange, human-machine interaction, and physical intimate contact. The model established the foundation of the higher-level man-machine mode.

Chen Q et al. used the Haar-like feature and the low-level statistical method of the AdaBoost learning algorithm to detect the gestures, and obtained a sequence of strings. Then, based on advanced syntax analysis methods, the most probable gestures were analyzed through the corresponding rules [10]. Rautaray S S obtained the gesture

Journal Pre-proof

image by optical flow method, and then used PCA to obtain the feature vector of the gesture image. Finally, KNN was used for gesture recognition to control the media player [11]. Thangali A et al. used the deformed Bayesian network to supervise the hand shape information of the learning gestures, obtain the model parameters, and then use the non-rigid image alignment algorithm to calculate the image observation probability of the model, finally enhance the robustness under the condition of the hand shape change. The hand shape reasoning algorithm was tested by a

of

database containing 1500 sign language vocabularies with sufficiently good performance [12]. Cooper H et al. identify sign language based on three basic unit data: appearance data, 2D tracking data, 3D tracking data. Then two gesture-level classifiers to unite the basic units are utilized. The correlation and distinguishing features of the basic unit of the Cove model and the sequential pattern boosting coding have achieved good robustness and recognition

pro

rate [13]. Wang X et al. used the Adaboost algorithm and the HOG feature to perform hand detection, which adopted the condensation particle filter and the partitioned sample to track the hand, and then exploited the cubic Bspline to fit the hand contour point into a curve, and finally utilize the curve as the global feature. Since the moment of change represents the trajectory of the gesture together with the direction of the hand as a local feature, the HMM

B. Kinect-based Robotics Research

re-

is used to identify the dynamic gesture [14].

Due to the low price, high performance and convenient development of Kinect, many researchers have used Kinect for research and development in related fields. Among them, Kinect has made certain research results as a new sensor for robotics research.

urn al P

YH Zhou et al. [15] proposed an indoor positioning and 3D scene reconstruction method using Kinect sensor, which can simultaneously estimate the position and orientation of the handheld Kinect, and generate a dense indoor environment 3D model. As such, the system can process complex indoor environment data in real-time. J. Biswas et al. [16] introduced the Fast Sampling Plane Filtering Algorithm (FSPF), which classifies local grouping point sets by sampling from depth images, filtering points on the plane and points that do not match the specified error, and reducing the dimension of the 3D point cloud. An efficient and real-time environmental map was constructed to realize the positioning and navigation of indoor robots. W.B. Song et al. [17] designed a humanoid robot control system based on the Kinect sensor's remote control operation, in which the computer can directly communicate with the robot by means of body motion, thereby controlling the remote robot for real-time, coordinated and unified action. J. Hartmann et al. [18] proposed a method based on Kinect sensor for synchronous positioning in maps. The directional FAST and rotating BRIEF (ORB) algorithm were used to extract the depth information acquired by Kinect, which realized high-efficiency visual SLAM and accurate Positioning. I. Siradjuddin et al. [19] proposed a location-based redundant robotic vision tracking system. In the system, the Kinect is used to obtain the 3D

Jo

information of the target object, and the Kalman filter is used to predict the position and velocity of the target. The position-based visual servo control algorithm is simplified. The stability of the method is proved by the control of the 7-DOF PowerCube robot. M. Schroder et al. [20] proposed a real-time human hand tracking and attitude estimation method to realize the control of the humanoid robot through gestures. According to the depth information obtained by Kinect, the color-sensitive iterative closest point to the triangle algorithm accurately estimates the

Journal Pre-proof

absolute position and direction of the hand, finally realizes the human-computer interaction task of controlling 20 degrees of freedom humanoid robot.

re-

pro

of

III. METHODS

Fig 1. The architecture of HCI model for robotic arm based on somatosensory motion. The input data is generated by Kinect sensors for human movement of arms. Then, the model conducts three tasks including static gesture recognition, arm movement recognition and dynamic gesture recognition. Finally, the information of recognition is used to control the robotic arm accurately.

We proposed a vision-based HCI architecture for robotic arm by identifying somatosensory motion, shown in Fig.

urn al P

1. The input of the model is collected by Kinect sensors from human body movement. Then, the model captures the static information by conducting static gesture recognition and arm movement recognition. The static information of recognition is used to make the robotic arm stay in approximate gesture and posture. The ongoing dynamic gesture recognition task is utilized to fine-tune the accurate gesture and posture. A. Architecture of the Identifier for Static Gesture and Arm Movement Illuminated by the advantage of SSD architecture such as fast recognition speed and high identification accuracy, in order to extract not only the spatial features of video single frame images, but the temporal features of short sequences as well, we considered the 3D version of the SSD to process image sequence as the detector and identifier

Jo

for gestures and arm postures. The image sequences include RGB image sequence and depth image sequence.

Fig 2. The SSD architecture for identifying static gesture and arm movement

Journal Pre-proof

In order to conduct more efficient three-dimensional image detection, the model adopts convolutional and deconvolutional layers with kernels in the sizes of 3×3 and 1×1. Moreover, max-pooling layers with stride of 2 are exploited to downsample the input feature maps to the ones in the half size. In addition, the architecture utilizes the fusion mechanism of feature maps in different resolutions. This mechanism adds two direct interlayer pathways between the 3-rd and the 5-th convolutional layers, and between the 4-th and 6-th convolutional layers. The

of

pathways can not only decrease the memory consumption for weight parameters, but improve the detection accuracy at the pixel level as well. In order to reduce memory loss, the architecture conducts dimensionality reduction for feature maps by convolutional layers with 1×1 kernel after feature fusion.

After the final feature maps are extracted, three anchors in scales of [5,10,20] were placed in each pixel in the last

pro

outputted feature maps. In order to speed up the execution of the model, we considered two square areas with [2:1] and [1:2] as scale ratios of anchor containing hand gestures or arm body. For each anchor, the classification loss Lcls and the regression loss Lreg are combined to update the parameters of the model. The Lcls is related to the probability that bounding box contains objects while the regression loss Lreg calculates the regression loss of offset of the defined as: Where pi and

pi*

re-

coordinates and side length of bounding boxes respectively. The multi-task loss function of the i-th anchor point is 𝐿 𝑝 ,𝑐 𝛾𝐿 𝑝 , 𝑝∗ 𝑝∗ 𝐿 𝑐 , 𝑐∗ 1 are the prediction probability and ground truth while ci and 𝑐 ∗ are the predicted value of relative

coordinates and a relative side length of the ith anchor and its ground truth. Without loss of generality, we only define the vector ci as: 𝑥

𝑥

𝑦

𝑦

urn al P

𝑙 2 𝑙 𝑙 𝑙 Wherein x and y denote the predicted coordinates of the bounding box while l is its side length. (xa, ya, la) is the 𝑐

,

, log

ground truth values of coordinate and scale of the ith anchor. Loss Lcls is calculated by the binary cross entropy function while Lreg employs L1 regularization regression loss function. C. Dynamic Gesture Recognition

The dynamic gestures include not only the change of the hand shape, but also that in the time and the spatial dimension, such as position, speed, etc., which makes the recognition of the dynamic gesture more complicated. However, compared to the change of the hand shape, the position of the hand and its change have a more important decisive effect on the meaning of the dynamic gesture. Therefore, in the recognition of the dynamic gesture, the position of the gesture is extracted and then used to identify dynamic gestures while the hand shape information is [23].

Jo

usually ignored. The algorithms used for dynamic gesture recognition mainly include DTW [21] and HMM [22] The dynamic time warping (DTW) method is a template matching method, which can calculate the distances of the sequences in the two sets of time dimensions with different lengths. The algorithm can be formalized in details. There are two sequences R and T in time dimension, whose lengths are independently M and N, respectively. R = (r1, r2, …, rm) is the reference template and T = (t1, t2, …, tn) is the template to be tested. The DTW distance is a

Journal Pre-proof

measurement. When the similarity between the sequence R and T is higher, the DTW distance is smaller. Conversely, the lower the similarity between R and T, the larger the DTW distance. The distance between two sequences R and T is denoted by the distance matrix D = { dij | 1≤i≤M, 1≤j≤N }, wherein dij is the distance between component ti and rj. The DTW distance R and T sequence is the optimal path in matrix D, which means the sum of the consumption of all passing points on the path is the smallest. The optimal

of

path can be solved by recursively. The total consumption of the optimal path from point (1, 1) to point (i, j) is represented by TC(i, j). The recursive formula is:

𝑖𝑓 𝑗 1 𝑇𝐶 𝑖 1, 𝑗 𝑑 3 𝑇𝐶 𝑖, 𝑗 2𝑑 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ⎨𝑚𝑖𝑛 𝑇𝐶 𝑖 1, 𝑗 1 ⎪ 𝑇𝐶 𝑖, 𝑗 1 𝑑 ⎩ The final TC(N, M) is the DTW distance between the sequence R and the sequence T.

𝑑

pro

⎧ ⎪

IV. EXPERIMENTS AND ANALYSIS

A. Gesture and Arm Motion Recognition Datasets

re-

In this section, we present the execution details and experimental results for the proposed model.

MSRC-12 Kinect Gesture Dataset and 2013 Chalearn Multimodal Gesture Recognition(CMGR) dataset[36] are chosen as training set of gesture and arm motion recognition.

urn al P

The videos in two datasets are both recorded by a Kinect camera. MSRC-12 database consists of 594 sequences of human movements collected from 30 people performing 12 labeled gestures. CMGR database is a large video database including 13, 858 gestures from 20 Italian gesture categories. In order for arm motion recognition, the region coordinates and the label with regard to the motion of arm are labeled manually. B. Implementation Details

For the architecture of the identifier for static gesture and arm movement, each anchor is labeled with three values of [-1,0,1] for binary classification. The anchor with IoU over 0.7 is labeled 1 while The ones with IoU less than 0.3 are considered negative with label -1. The others are ignored with the label 0 during the training session. After confidence coefficients of all anchors are calculated, the model applied non-maximum suppression operation on anchors with coefficients greater than 0.5 to merge the detection targets so as to achieve the final detection and identification results. During training session, Adam optimization method is adopted for 150 epochs with batch-size of 12. In order to increase the convergence rate of the model, the initial learning rate α was set as 0.01 and β as [0.9,

Jo

0.99].

DTW dynamic gesture recognition first needs to use the Kinect sensor to obtain the user's joint information and extract the user's palm position. After the palm position is obtained, the dynamic gesture segmentation is performed. If the user demonstrates a dynamic gesture, the dynamic gesture segmentation step will obtain a palm trajectory sequence, which is denoted by T. The next step is to compare T with the samples in the dynamic gesture library. If T is similar to a sample in the library, then T is considered to be in the same dynamic gesture category as the sample.

Journal Pre-proof

The dynamic gesture template sequences in length of p in the dynamic gesture sample library is represented as (r1, r2, ..., rp), corresponding to the dynamic gesture categories (c1, c2, ..., cp). Each template sequence has its own DTW distance threshold, which is set to (rt1, rt2, ..., rtp). The classification of the trace sequence T belongs to a particular dynamic gesture in the template library complies with the following steps. First, the DTW distance between the T and the first template sequence is calculated. If the

of

obtained DTW distance is less than the distance threshold rt1 of the template, it is considered that T and the template r1 belong to the same dynamic gesture category. Otherwise, the next template r2 is taken out. Continue to calculate the DTW distance between T and r2 until you get the category of T. If the dynamic mask category of T is not

pro

obtained until the last template is taken, the track sequence T is considered not to be a meaningful dynamic gesture. C. Experimental Results

The virtual interaction of screwing into the hole is shown in Fig 4. The virtual interaction window observes the changes of each target in the scene from multiple angles. The interactive interface is divided into multiple windows, one of which has a main window and multiple secondary windows.

In real-time interactive experiments, static gestures and dynamic gestures are done by the left hand while limb

re-

movements are done by the right and right arms. The user's hands and limbs work together to complete the interaction process. The whole interaction process is as follows: firstly, the mechanical arm is controlled by the arm movement and the static gesture recognition, and the mechanical arm is moved close to the screw on the cabin. Then the movement of the mechanical arm is adjusted to fine-tune the movement. After the screw is caught, the screw will

urn al P

move together with the arm, then move the screw to the vicinity of the screw hole. After the screw is completely inserted into the screw hole, the movement mode of the adjustment arm is adjusted to a coarse movement. At the same time, the robot arm is loosened by a dynamic gesture, then adjusted to the fine-tuning motion. If the distance between the mechanical claw and the target (screw and screw hole) is improper, the wrist can be adjusted by dynamic gestures to control the back and forth movement; at any moment of interaction, the main window can be switched by dynamic gestures, from better Observe the robot arm and the operating object from the perspective to

Jo

complete the interaction process.

Fig 4. Simulation of manipulating the robotic arm to screw into the hole

Journal Pre-proof

The experiment results demonstrate that the well-designed HCI robotic arm achieves the expected goal efficiently and effectively. IV. CONCLUSION Human robot interaction is a key technique in modern HCI systems. In this paper, in order to facilitate the development of human-computer interactive system, we propose a vision-based HCI framework for robotic arm

of

manipulation by recognizing somatosensory motion. Within the model, the somatosensitive interaction modes of dynamic and static gestures and body movements based on SSD and DTW are designed to conduct somatosensitive interaction experiments with the robotic arm. Virtual interactive experimental results have demonstrated the

pro

usefulness of our method.

re-

Acknowledgements The authors acknowledge the National Natural Science Foundation of China (Grant: 61873054 and 61503070) . Fundamental Research Funds for the Central Universities(N170804006)

REFERENCES

Malima, A. K., Özgür, E., & Çetin, M. (2006). A fast algorithm for vision-based hand gesture recognition for robot control. Sánchez-Nielsen, E., Antón-Canalís, L., & Hernández-Tejera, M. (2004). Hand gesture recognition for human-machine interaction. Wang, C. C., & Wang, K. C. (2007). Hand posture recognition using adaboost with sift for human robot interaction. In Recent progress in robotics: viable robotic service to human (pp. 317-329). Springer, Berlin, Heidelberg. [4] Tang, M. (2011). Recognizing hand gestures with microsoft’s kinect. Palo Alto: Department of Electrical Engineering of Stanford University:[sn]. [5] Ramey, A., González-Pacheco, V., & Salichs, M. A. (2011, March). Integration of a low-cost RGB-D sensor in a social robot for gesture recognition. In 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp. 229-230). IEEE. [6] Van den Bergh, M., Carton, D., De Nijs, R., Mitsou, N., Landsiedel, C., Kuehnlenz, K., ... & Buss, M. (2011, July). Real-time 3D hand gesture interaction with a robot for understanding directions from humans. In 2011 Ro-Man (pp. 357-362). IEEE. [7] Burger, B., Ferrané, I., & Lerasle, F. (2008, May). Multimodal interaction abilities for a robot companion. In International conference on computer vision systems (pp. 549-558). Springer, Berlin, Heidelberg. [8] Gast, J., Bannat, A., Rehrl, T., Wallhoff, F., Rigoll, G., Wendt, C., ... & Farber, B. (2009, May). Real-time framework for multimodal human-robot interaction. In 2009 2nd Conference on Human System Interactions (pp. 276-283). IEEE. [9] Kahn, P. H., Freier, N. G., Kanda, T., Ishiguro, H., Ruckert, J. H., Severson, R. L., & Kane, S. K. (2008, March). Design patterns for sociality in human-robot interaction. In Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction (pp. 97104). ACM. [10] Chen, Q., Georganas, N. D., & Petriu, E. M. (2008). Hand gesture recognition using haar-like features and a stochastic context-free grammar. IEEE transactions on instrumentation and measurement, 57(8), 1562-1571. [11] Rautaray, S. S., & Agrawal, A. (2010, December). A novel human computer interface based on hand gesture recognition using computer vision techniques. In Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia (pp. 292296). ACM. [12] Thangali, A., Nash, J. P., Sclaroff, S., & Neidle, C. (2011, June). Exploiting phonological constraints for handshape inference in ASL video. In CVPR 2011 (pp. 521-528). IEEE. [13] Cooper, H., Ong, E. J., Pugeault, N., & Bowden, R. (2012). Sign language recognition using sub-units. Journal of Machine Learning Research, 13(Jul), 2205-2231. [14] Wang, X., Xia, M., Cai, H., Gao, Y., & Cattani, C. (2012). Hidden-markov-models-based dynamic hand gesture recognition. Mathematical Problems in Engineering, 2012. [15] Zou, Y., Chen, W., Wu, X., & Liu, Z. (2012, July). Indoor localization and 3D scene reconstruction for mobile robots using the Microsoft Kinect sensor. In IEEE 10th International Conference on Industrial Informatics (pp. 1182-1187). IEEE. [16] Biswas, J., & Veloso, M. (2012, May). Depth camera based indoor mobile robot localization and navigation. In 2012 IEEE International Conference on Robotics and Automation (pp. 1697-1702). IEEE. [17] Song, W., Guo, X., Jiang, F., Yang, S., Jiang, G., & Shi, Y. (2012, August). Teleoperation humanoid robot control system based on kinect sensor. In 2012 4th international conference on intelligent human-machine systems and cybernetics (Vol. 2, pp. 264-267). IEEE. [18] Hartmann, J., Forouher, D., Litza, M., Kluessendorff, J. H., & Maehle, E. (2012, May). Real-time visual slam using fastslam and the microsoft kinect camera. In ROBOTIK 2012; 7th German Conference on Robotics (pp. 1-6). VDE. [19] Siradjuddin, I., Behera, L., McGinnity, T. M., & Coleman, S. (2012, June). A position based visual tracking system for a 7 DOF robot manipulator using a Kinect camera. In The 2012 international joint conference on neural networks (IJCNN) (pp. 1-7). IEEE. [20] Schröder, M., Elbrechter, C., Maycock, J., Haschke, R., Botsch, M., & Ritter, H. (2012, November). Real-time hand tracking with a color glove for the actuation of anthropomorphic robot hands. In 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012) (pp. 262-269). IEEE.

Jo

urn al P

[1] [2] [3]

Journal Pre-proof

of

[21] Jambhale, S. S., & Khaparde, A. (2014, February). Gesture recognition using DTW & piecewise DTW. In 2014 International Conference on Electronics and Communication Systems (ICECS) (pp. 1-5). IEEE. [22] Wang, X., Xia, M., Cai, H., Gao, Y., & Cattani, C. (2012). Hidden-markov-models-based dynamic hand gesture recognition. Mathematical Problems in Engineering, 2012. [23] Elmezain, M., Al-Hamadi, A., & Michaelis, B. (2008). Real-time capable system for hand gesture recognition using hidden markov models in stereo color image sequences.

Jo

urn al P

re-

pro

Xing Li was born in 1982. She received the Dr. Eng.degree in Power electronics and power transmission from Northeast University.She is currently an assistant Professor with the State Key Laboratory of Synthetical Automation for Process Industries, Northeast University, Liaoning, China. A number of research results has gained the support of the National Natural Science Fund and Liaoning Natural Science Fund. She is also the member of the Big Data Committee and Process Control Committee for the Chinese Association of Automation. She is current research interests include adaptive/robust control, motion control, Intelligent robot system, etc.

Journal Pre-proof

(1) In the model, the features of RGB video frames and depth images are extracted by 3D SSD architecture for the location and identification of gesture and arm. (2) The dynamic gesture segmentation method is designed to determine the start and end of the dynamic gesture through the pause time of the palm. Then, the DTW template matching algorithm is used to identify the dynamic gestures effectively and efficiently. (3) The interactive scenarios and interactive modes are designed for experiment and implementation. The man-machine interaction scene of the robotic arm is designed. Meanwhile, the simulation of the

of

virtual robot arm through the somatosensory interaction control is realized. The HCI modes include the

Jo

urn al P

re-

pro

detection and identification of the static and dynamic gestures of the hands and the limbs.

Journal Pre-proof

Jo

urn al P

re-

pro

of

There is no conflict of interest.