Robotics and Autonomous Systems 55 (2007) 643–657 www.elsevier.com/locate/robot
Adaptive visual gesture recognition for human–robot interaction using a knowledge-based software platform Md. Hasanuzzaman a,∗ , T. Zhang a,1 , V. Ampornaramveth a , H. Gotoda a , Y. Shirai b , H. Ueno a a National Institute of Informatics, Intelligent Systems Research Division, 2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan b Department of Human and Computer Intelligence, School of Information Science and Engineering, Ritsumeikan University, 1-1-1 Nojihigashi, Kusatsu,
Shiga, 525-8577, Japan Received 22 November 2005; received in revised form 20 December 2006; accepted 14 March 2007 Available online 5 April 2007
Abstract In human–human communication we can adapt or learn new gestures or new users using intelligence and contextual information. Achieving natural gesture-based interaction between humans and robots, the system should be adaptable to new users, gestures and robot behaviors. This paper presents an adaptive visual gesture recognition method for human–robot interaction using a knowledge-based software platform. The system is capable of recognizing users, static gestures comprised of the face and hand poses, and dynamic gestures of face in motion. The system learns new users, poses using multi-cluster approach, and combines computer vision and knowledge-based approaches in order to adapt to new users, gestures and robot behaviors. In the proposed method, a frame-based knowledge model is defined for the person-centric gesture interpretation and human–robot interaction. It is implemented using the frame-based Software Platform for Agent and Knowledge Management (SPAK). The effectiveness of this method has been demonstrated by an experimental human–robot interaction system using a humanoid robot ‘Robovie’. c 2007 Elsevier B.V. All rights reserved.
Keywords: Adaptive visual gesture recognition; Human–robot interaction; Multi-cluster based learning; SPAK
1. Introduction For the last few decades, robots have been used in the factory for performing continuous, heavy or hazardous work. In recent years, human–robot symbiotic systems have been studied extensively due to the increasing demand for welfare services for the aged and handicapped under the situation of a decrease in the younger generation. To realize a symbiotic relationship between human and robot, it is crucial to establish a human–robot natural interaction. Ueno proposed a concept of Symbiotic Information Systems (SIS) as well as a symbiotic robotics system as one application of SIS, where humans and robots can communicate with each other in human ways using speech and gesture [1]. The objective of SIS is to allow non-expert users, who ∗ Corresponding address: University of Dhaka, Department of Computer Science and Engineering, 1000 Dhaka, Bangladesh. Tel.: +880 2 9670734; fax: +880 2 8615583. E-mail address:
[email protected] (Md. Hasanuzzaman). 1 Present address: Department of Automation, Tsinghua University, Beijing 100084, China.
c 2007 Elsevier B.V. All rights reserved. 0921-8890/$ - see front matter doi:10.1016/j.robot.2007.03.002
might not even be able to operate a computer keyboard, to operate robots. Therefore, it is necessary that these robots be equipped with natural interfaces using speech and gesture. Fig. 1 depicts an example scenario of a human–robot symbiotic system where humans and robots can interact with each other using a knowledge-based software platform SPAK (Software Platform for Agent and Knowledge Management) [2]. SPAK has been developed to be a platform at our laboratory on which various software components for different robotic tasks can be integrated over a networked environment. It co-ordinates the operation of these components by means of frame-based knowledge modeling [3]. SPAK works as a knowledge and data management system, communication channel, intelligent recognizer, intelligent scheduler, and so on. Zhang et al. [4] have developed an Industrial Robot Arm control system using SPAK. In that system SPAK works as a communication channel and intelligent robot actions scheduler. Kiatisevi et al. [5] have proposed a distributed architecture for knowledge-based interactive robots and using SPAK they have implemented dialogue-based human–robot (‘Robovie’) interaction for greeting scenarios. In the case of multi-robots,
644
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
Fig. 1. Diagram of a human–robot symbiotic system using a knowledge-based platform SPAK.
they can collaborate with each other or a master robot can coordinate with slave robots through SPAK [6]. Although there is no doubt that the fusion of gesture and speech allows more natural human–robot interaction (HRI), for single modality, gesture recognition can be considered more reliable than speech recognition systems as human voice varies from person to person and the system needs to take care of a large number of datasets for recognizing speech [7]. Gestures are expressive, meaningful body motions such as physical movements of the head, face, fingers, hands or body with the intention to convey information or interact with the environment. Although hands make most human gestures, the same hand gestures may have different meanings in different cultures. For example, the thumb and index finger are joined together to form an ‘O’ to denote ‘Ok’ in the United States, in Latin America and France this gesture is a rude sign, in Brazil and Germany this gesture is obscene, and in Japan this gesture means money [8]. Thus, the interpretation of a recognized gesture is user-dependent. As the skin colors of the hand region and hand shapes are also different for different persons, the recognition process itself is also userdependent. As a result, person identification and adaptation is one of the prime factors in order to realize reliable gesture recognition and interpretation. Adaptivity is a necessary biological property in all creatures in order to survive in the biological world. It is the capability of self-modification that some agents have that allows them to maintain a level of performance in front of environmental changes, or to improve it when confronted repeatedly with the same situation [9]. A gesture-based human–robot natural interaction system could be designed so that it can understand different users, their gestures, the meaning of the gestures and the robot behaviors. There have been significant amounts of research on hand, arm and facial gesture recognition to control robots or intelligent machines in recent years. Pavlovic et al. [10] have made a good review on recent research on visual hand gesture recognition systems, and Sturuman et al. [11] have summarized
gloved-based interface devices. Waldherr et al. have proposed a gesture-based interface for human and service robot interaction [12]. They combined a template-based approach and neural-network-based approach for tracking a person and recognizing gestures involving arm motion. In their work they proposed an illumination adaptation method but did not consider user or hand pose adaptation. Kortenkamp et al. have developed a gesture-based human–mobile robot interface [13]. They have used static arm poses as gestures. Torras proposed a robot adaptivity technique using a neural learning algorithm [9]. This method is computationally inexpensive and there is no way to encode prior knowledge about the environment to gain the efficiency. Bhuiyan et al. detected and tracked the face and eyes for HRI [14]. However, only the largest skin-like region has been considered for detecting the probable face area, which may not be true when two hands are present in the image. However, all of the above papers focus primarily on visual processing and do not maintain knowledge of different users or consider how to deal with them. In our previous research, we have combined computer vision and knowledgebased approaches for gesture-based HRI [15]. In that research we have utilized a pose specific Subspace (separate eigenspaces for each pose) method for face and hand poses classification, and segmented three larger skin-like components from the images assuming that two hands and face may present in the image at the same time. However, in that system we did not consider new users, hand poses or gesture adaptation methods, which are mandatory for biological society. It is essential for the system to cope with different users. The new user should be included using an on-line registration process. When the user is included the user may want to perform a new gesture that has never been used by other persons or himself/herself. In that case, the system should include the new hand poses or gestures with minimal user interaction. This research combines computer vision and knowledgebased approaches for adaptive visual gesture recognition for HRI so that the system can learn new users, hand poses or gestures. The user can define or edit the knowledge frames for gesture recognition/interpretation, and the robot behavior according to his/her desires in the knowledge base, which also contains information regarding the user (profile) to improve the reliability and robustness of image classification. For example, the segmented skin regions are more noise free for known persons because the probable hand and face poses are segmented using the person-centric threshold values for the YIQ components [15]. To adapt to different users and gestures, this system includes multi-cluster-based interactive learning method together with hand and face poses classification. Dynamic gestures are included in this system by tracing the transition of face poses with the classification of static poses. As an application of this system, we have implemented a realtime HRI system using a humanoid robot ‘Robovie’ [16]. 2. System architecture Fig. 2 shows the overall architecture of our adaptive visual gesture-based HRI system using SPAK. This system integrates
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
645
Fig. 2. Architecture of adaptive visual gesture-based HRI system using SPAK.
various kinds of hardware and software components such as vision-based face and gesture recognizer and learner, text to speech converter, robots or robotic devices (robot arms, robot legs, robot neck and robot mouth, etc.) and a knowledgebased software platform. In this system, frames for the known users, poses, robot actions and gestures are predefined in the knowledge base [15,17]. Since the face and two hands may present in the images at the same time, three larger skin-like regions are segmented from the input images using skin color information. In this system, the YIQ (Y is the luminance of the color and I, Q are the chrominance of the color) color representation system is used for skin-like region segmentation because it is typically used in video coding and provides an effective use of chrominance information for modeling the human skin color [18]. The system first detects human face using multiple features from the three larger skin-like regions [15,19] and recognizes the face using an eigenface method [20]. The system uses a skin-colorbased filter to reduce the comparison. If a person is identified then his/her corresponding threshold (person-specific threshold for YIQ components) reduces the comparison and makes the system faster [15]. The user-frame keeps the threshold values for the YIQ components for the skin-like region segmentation as shown in Table 1. The skin color region is determined by applying threshold values ((Y Low < Y < Y High) && (I Low < I < I High) && Q Low < Q < Q High)) [15]. These thresholds are heuristically predetermined from the histogram of YIQ components of a skin region and included Y High, Y Low, I High, I Low, Q High and Q Low for the Y, I, Q components in the knowledge base with corresponding user identity. Non-skin regions are represented by black color (R = G = B = 0). After thresholding the segmented image may contain some noises in the three largest skin-like regions. In order to remove noises segmented images are filtered by morphological dilation
Table 1 Instance-frame ‘Hasan’ of class-frame ‘User’ Frame Type A-kind-of Has-part Semantic-link-from Semantic-link-to
Hasan Instance User NULL NULL NUUL
Slot1: Name Type
Y-High Integer
Slot2: Name Type
Y-Low Integer
Slot3: Name Type
I-High Integer
Slot4: Name Type
I-Low Integer
Slot5: Name Type
Q-High Integer
Slot6: Name Type
Q-Low Integer
and erosion operations [18]. Normalization is done to scale each segment to 60 × 60 size to match with the size of the training images and convert the scale image to a gray image (abs(R + G + B/3)) [21]. The multi-cluster based pose classification and learning method is then used for classifying face and hand poses from these regions. The gestures are interpreted/recognized using frame-based approach [15,17]. The system is capable of learning new users and new hand poses using multi-cluster-
646
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
based incremental learning method and registers new users and new poses in the knowledge base using minimal user interaction. After hand and face poses are classified, the static gestures are recognized using a frame-based approach [15,17]. Known gestures are defined as frames in the SPAK knowledge base. When the required combination of the pose components is found the corresponding gesture frame will be activated. Considering the transition of face poses in a sequence of time steps, dynamic gestures are recognized. After the gesture is recognized, the interaction between human and robot is determined by the knowledge modeled also as a frame hierarchy in SPAK. Using the received gesture and user information, SPAK processes the facts and activates the corresponding behavior (action) frames to carry out predefined robot actions, which may include body movement and speech. 3. Adaptive visual gesture recognition Images containing faces and hand poses are essential for vision-based HRI. Before a system can recognize the face and hand poses, it must possess knowledge of the characteristic features for those poses. This means that the system designer must either build the necessary discriminating rules into the system or the system must learn them. To adapt to new users and new hand poses the system must be able to perceive and extract relevant properties from the unknown faces and hand poses, find common patterns among them and formulate discriminatory criteria consistent with the goals of the recognition process. This form of learning is known as clustering and it is the first steps in any recognition process where discriminating features of the objects are not known in advance [22]. Due to the orientation and illumination variation, the same hand poses may be composed of multiple clusters and the same person may include different face clusters. Satoh et al. have utilized multiple clusters for the same person for face sequence matching for content-based video access [23]. Considering these situations, we propose a multi-cluster-based learning method for face and hand pose classification and adaptation. 3.1. Multi-cluster-based learning method Fig. 3 shows the conceptual hierarchy of face and hand poses classification using a multi-cluster approach. A pose Pi may include a number of clusters and each cluster C j may include a number of images {X 1 , X 2 , . . . , X o } as a member of that cluster. This multi-clustering method is described using following steps. Step 1. Generate eigenvectors [20] from training images that include all the known poses. Step 2. Select m-number of eigenvectors according to higher eigenvalues, these are known as principal components. Step 3. Read the initial cluster image database (initialize with the known cluster images) and cluster information table that holds the starting pointer of each cluster. Project each image
Fig. 3. Multi-cluster hierarchy for pose classification and learning.
onto the eigenspaces and form feature vectors using Eqs. (1) and (2): j
ωk = (u k )T (X j ) j
where, 1 6 k 6 m
j
j
Ω j = [ω1 , ω2 , . . . , ωm ]
(1) (2)
where, (u k ) is the kth eigenvector and X j is the jth image (60 × 60) in the cluster database. Step 4. Read the unlabeled images that should be clustered or labeled. (a) Project each unlabeled image onto the eigenimages and form feature vectors (Ω ) using Eqs. (1) and (2). (b) Calculate Euclidean distance for each eigenimage in the known dataset (cluster) using Eqs. (3) and (4), ε j = kΩ − Ω j k
(3)
ε = arg min{ε j }.
(4)
Step 5. Find the nearest class, (a) If (Ti =< ε <= Tc ) then add the image in the neighboring cluster; increment the value of insertion parameter (α), where Ti is the threshold of Euclidian distance to identify a pose and Tc is the threshold Euclidian distance to select the image that should be cluster. (b) If (ε < Ti ), then the image is recognizable and there is no need to include it in the cluster database. Step 6. If the value of insertion parameter (α) in the known cluster is greater than zero, then update the cluster information table that holds the starting pointer of all clusters. Step 7. Repeat Steps 3–6 until the value of insertion parameter (α) in the known cluster dataset is zero (0). Step 8. If the value of insertion parameter (α) is zero, then check the unlabeled dataset, which follows the condition (Tc < ε <= T f ), where T f is the threshold that defines for removing the image from the unlabeled data set. If the image is non-face
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
647
Fig. 4. Sample outputs of multi-clustering algorithm for learning new persons.
or hazard hand poses but have skin color is removed from the training dataset based on specific threshold value (ε <= T f ). Step 9. If the maximum number of unlabeled data (for a class) is greater than N (predefined), then select one image (based on minimum Euclidian distance) as a member of the new cluster. Then update the cluster information table. Step 10. Repeat Steps 3–9 until the number of unlabeled data is less than N . Step 11. If the height of the newly developed cluster (number of member images in the cluster) is greater than L, then add it as a permanent cluster. A permanent cluster is a cluster that is permanently stored in the training dataset and its associated pose or person name is already defined in the knowledge base. Step 12. After clustering the user defines the associations of the clusters in the knowledge base. Each pose may be associated with multiple clusters. For an undefined cluster there is no associated link. 3.2. User identification and learning We have already argued that a robot should be able to recognize and remember the users and learn about them. This gesture-based HRI system is person centric, therefore, personidentification and adaptation is one of the attractions of this system. A number of techniques have been developed to detect and recognize faces [24]. If the new user comes in front of the robot eye’s camera or system camera, the system identifies the user as an unknown user and asks for registration. For unknown persons, the system is capable of learning new user face information and registering the person with the necessary attributes in the knowledge base using minimal user interaction. Reasoning about images in face spaces provides a means to learn and subsequently recognize new faces in a semisupervised manner. When an image is sufficiently close in a
face space (likeness of face) but is not classified as one of the familiar faces, it is initially labeled as an unknown person face. The system maintains the pattern vectors of corresponding unknown images. If a collection of unknown pattern vectors cluster in the pattern space, the presence of a new but undefined face is postulated [25]. The face is detected using multiple feature-based approaches from three skin-like regions [15,19]. The detected face is filtered in order to remove noise and normalized so that it matches with the size and type of the training image [18]. The detected face is scaled to be a square image with 60 × 60 dimension and converted to be a gray image (abs(R + G + B)/3) [21]. This face pattern is classified using the eigenface method [20], whether it belongs to a known person or unknown person. The face recognition method uses five face classes: normal face or frontal face (NF), right directed face (RF), left directed face (LF), up state face (UF) and down state face (DF) in training images as shown in Fig. 4. The eigenvectors are calculated from the known persons face images for all face classes and a number m of eigenvectors corresponding to the highest eigenvalues are chosen to form principal components. The Euclidean distance is determined between the weight vectors generated from the training face images and the weight vectors generated from the detected face by projecting them onto the eigenspaces [20]. If the minimal Euclidian distance is less than the predefined threshold value then person is known, otherwise unknown. For unknown person, based on judge function learning process will be activated and the system will learn the new user using multi-clustering approach. The judge function is based on the ratio of the number of unknown faces to the total number of detected faces for a specific time slot. The learning function develops new clusters corresponding to the new person. The user defines the person name and skin color
648
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
Fig. 5. Sample outputs of the multi-clustering approach for pose classification.
information in the user profile knowledge base and associates with the corresponding clusters. Fig. 4 shows the sample outputs of multi-clustering approach for recognizing and learning new users. For example, the system is initially trained by the face images (100 face images with five directions) of person 1 (‘Hasan’). The system learns and remembers this person using five clusters (FC1 , FC2 , FC3 , FC4 , FC5 ) as shown in Fig. 4 (top five rows) and the cluster information table (that holds the starting position of each cluster) contents are {1, 12, 17, 27, 44, 53}. For example, in this case if any face image matches with the known member between 12 and 16 then it is classified as face cluster 2 (FC2 ). If the face is classified as any of these five clusters, the person is identified as person 1 (‘Hasan’). Suppose that the system deals with a new person’s 100 face images and it could not identify the person and activates the learning function. Then the system is developed new six clusters (FC6 , FC7 , FC8 , FC9 , FC10 , FC11 ) and updated the cluster information table as {1, 12, 17, 27, 44, 53, 68, 82, 86, 96, 106, 117} as shown in Fig. 4 (rows 6 to 11). The person is registered as person 2 (‘Cho’) and the associations with the clusters are defined. If any detected face image is classified as a known cluster, then the associated person will be identified. 3.3. Pose classification and learning method In real life, we can understand several hand and facial gestures on the spot by using other modalities or context or scene analysis without having prior knowledge of the new gestures. For a machine it is difficult to understand the new
poses without prior knowledge. It is essential to learn new poses based on a judge function or predefined knowledge. The judge function determines the user intention, i.e., intention to create a new gesture. The judge function is based on the ratio of the number of unknown poses to total number poses for a specific time slots. For example, the user shows same hand pose for 10 image frame times that are unknown to the system, that means he/she wants to use it as a new gesture. In this situation, based on the judge function the learning function activates and the system learns the new pose using a multi-clustering approach. The learning function develops new cluster/clusters corresponding to a new pose. The user defines the pose name in the knowledge base and associates it with the corresponding clusters. If the pose is identified then corresponding pose frame will be activated [15]. Fig. 5 shows sample outputs using the multi-cluster approach for learning new poses. The system is first trained by pose ‘ONE’ and forms one cluster for that pose. Then system is trained by pose ‘FIST UP’ where it forms two clusters because the user uses two hands for making that pose. Similarly, other clusters are formed regarding new poses. 3.4. Recognizing and learning gesture Gesture recognition is the process by which gestures made by the user are identified in the system. There are static gestures and dynamic gestures. The recognition of gesture is carried out in two phases. In the first phase, face and hand poses are classified from each captured image frame using the method described in previous section. Then the sequence of poses and combination of poses are analyzed to identify the occurrence
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
649
Fig. 6. Example frame for the gesture ‘TwoHand’.
of the gesture. Interpretation of an identified gesture is userdependent since the meaning of the gesture may differ from person to person based on their culture. For example, when user ‘Hasan’ comes in front of ‘Robovie’ eyes, ‘Robovie’ recognizes the person as ‘Hasan’ and says “Hi Hasan! How are you?”, then ‘Hasan’ raises his ‘Thumb up’ and ‘Robovie’ replies to ‘Hasan’ “Oh! You are not fine today”. In a similar situation, for another user ‘Cho’, ‘Robovie’ says, “Hi, You are fine today”. That means ‘Robovie’ should understand the person-centric meaning of the gesture. To accommodate different user’s desires, our person-centric gesture interpretation is achieved using a frame-based knowledge representation approach [15, 17]. The user predefines these frames into the knowledge base. The system maintains frames with necessary attributes (gesture components, gesture name) for all predefined gestures. Our current system recognizes 13 gestures: 11 static gestures and 2 dynamic facial gestures. These are: ‘TwoHand’ (raise lefthand and right-hand palms), ‘LeftHand’ (raise left-hand palm), ‘RightHand’ (raise right-hand palm), ‘One’ (raise index finger), ‘Two’ (form V sign using index and middle fingers), ‘Three’ (raise index, middle and ring fingers), ‘ThumbUp’ (thumb up), ‘Ok’ (make circle using thumb and index finger), ‘FistUp’ (fist up), ‘PointLeft’ (point left with index finger), ‘PointRight’ (point right with index finger), ‘YES’ (nods face up and down or down and up), ‘NO’ (shakes face left and right or right and left). It is possible to recognize more gestures including new poses and new knowledge frames for the gestures using this system. New poses can be included in the training image database using the learning method and corresponding frames can be defined in the knowledge base to interpret the gesture. To teach the robot a new pose, the user should perform the poses several times (example 10 image frame times). Then the learning method detects it as a new pose and develops cluster/clusters for that pose. Sequentially, it updates the knowledge base for the cluster information. 3.4.1. Static gesture recognition Static gestures are recognized using a frame-based approach with the combination of the pose classification results of the three skin-like regions at a particular time. For examples, if left-hand palm, right-hand palm and one face are present in the input image then recognizes it as ‘TwoHand’ gesture and corresponding gesture frame will be activated. The user predefines these frames into knowledge base using SPAK knowledge editor [Section 4]. The system maintains frames with necessary attributes (gesture components, gesture name) for all predefined gestures. Gesture components are the face
and hand poses. If the pose is identified then the pose name is fed to SPAK and the corresponding instance frame of the pose-frame will be activated. The gesture frame is defined using three slots corresponding to the pose recognition results of three skin regions. Fig. 6 shows the contents of ‘TwoHand’ gesture frame in SPAK. If image analysis and recognition module classifies the ‘FACE’, ‘LeftHand’ and ‘RightHand’ poses at an image frame it sends the pose names (face, lefthand, righthand, etc.) to the SPAK knowledge module. According to pose names corresponding pose frames (‘FACE’, ‘LeftHand’ and ‘RightHand’) will be activated. In this combination the ‘TwoHand’ gesture is recognized and the corresponding frame will be activated. 3.4.2. Dynamic gesture recognition Dynamic gesture is the gesture that uses the motion of the hands or body to emphasize or help to express a thought or feelings. It is difficult to visually recognize dynamic gesture due to large variations in the speed of position change of the physical objects that describe the gesture. This system recognizes two dynamic facial gestures considering the transition of the face poses in a sequence of time steps to accommodate user acknowledgement. If a human face shakes left to right or right to left, then it is recognized as the ‘NO’ gesture. If a human face shakes up and down or down and up, then it is recognized as the ‘YES’ gesture. This method uses a three-layered queue (FIFO) that holds the last three results (different poses) of the detected face poses. This method defines five specific face poses: frontal face (NF), right-rotated face (RF), left-rotated face (LF), up position face (UF) and down position face (DF). For every image frame, face pose is classified using a pose classification method. If a pose is classified as a predefined face pose then it is added to the three-layered queue. If the classified pose value is the same as previous frame, then queue values will remain unchanged. From the combination of three-layered queue values this method determine the gesture. For example, if the queue values are {UF, NF, DF} or {DF, NF, UF} pose sets, then it is recognized as the ‘YES’ gesture. Similarly, if the queues values are {RF, NF, LF} or {LF, NF, RF} pose sets, then it is recognized as the ‘NO’ gesture. After a specific time period the queue values are refreshed. Fig. 7 shows the example pose sequences for dynamic gestures ‘YES’ and ‘NO’. This queue is applicable for these two facial gestures and for other dynamic hand gestures different queues should be designed according to their state transition diagram. If a gesture is recognized, then the corresponding gesture frame will be activated.
650
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
agents. This kind of integration is applicable to many robotic applications including industrial robots. 4.1. Knowledge modeling (a) Gesture ‘YES’ {UF, NF, DF}.
(b) Gesture ‘NO’ {RF, NF, LF}. Fig. 7. Example of face sequences for dynamic gestures ‘YES’ and ‘NO’.
4. Knowledge modeling and management for gesture-based human–robot interaction Knowledge is the theoretical or practical understanding of a subject or a domain. The ‘frame-based approach’ is a knowledge-based problem solving approach based on the so-called ‘frame theory’, first proposed by Minsky [3]. A frame is a data structure for representing a stereotyped unit of human memory including definitive and procedural knowledge. Attached to each frame there are several kinds of information about the particular object or concept it describes such as name and a set of attributes called slots. Collections of related frames are linked together into frame systems. A framed-based approach has been used successfully in many robotic applications [1]. We believe it is quite flexible for realizing a human friendly intelligent service robot as well. Our system employs SPAK, a frame-based knowledge engine, connecting to a group of network software agents such as ‘face recognizers’, ‘gesture recognizers’, ‘voice recognizers’, ‘robot controllers’, etc. Using information received from these agents, and based on the predefined frame knowledge hierarchy, the SPAK inference engine determines the actions to be taken and submits the corresponding commands to the target robot control
Fig. 8 shows the frame hierarchy of the knowledge model for the gesture-based HRI system, organized by the IS A relationship indicated by arrows connecting upper and lower frame boxes. An upper frame describes a group of objects with common attributes. There are upper frames for user, robot, behavior (robot action), gesture and pose [15,17]. An instance of a frame represents a particular object. The user frame includes instances of all known users (instance ‘Hasan’, ‘Cho’, . . . ), robot frame includes instances of all the robots (‘Aibo’, ‘Robovie’, . . . ) that are used by the system. The behavior frame can be sub-classed further into ‘AiboAct’ and ‘RobovieAct’, where ‘AiboAct’ frame includes instances of all the predefined ‘Aibo’ actions and ‘RobovieAct’ frame includes instances of all the predefined actions of ‘Robovie’. The gesture fame is sub-classed into ‘Static’ and ‘Dynamic’ for static and dynamic gesture respectively. Examples of static frame instances are, ‘TwoHand’, ‘One’, etc. Examples of dynamic frame instances are ‘YES’, ‘NO’, etc. The pose frame includes all recognizable poses such as ‘LeftHand’, ‘RightHand’, ‘FACE’, ‘One’, etc. Gesture and user frames are activated when SPAK receives information from a network agent indicating a gesture or a face has been recognized. Behavior frames are activated when the predefined conditions are met. Table 1 illustrates the instance-frame ‘Hasan’ under the class-frame ‘User’. The user frame attributes include the threshold values for the chrominance and the luminance components of the skin colors of that particular user. In this frame, six slots are defined for the min and max values of the YIQ (Y–luminance and I, Q–chrominance) components. These are used for segmenting skin-like regions from the YIQ color space for that person. Similarly, others frames are defined with necessary attributes or slot information [15,17]. 4.2. Knowledge management system The knowledge in this system is maintained by SPAK. SPAK consists of a frame-based knowledge management
Fig. 8. Frame hierarchy for a gesture-based HRI system (SG = static gesture, DG = dynamic gesture).
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
651
Fig. 9. Example of the SPAK knowledge editor.
system and a set of extensible autonomous software agents representing objects inside the environment and supporting HRI and collaborative operation with distributed working environment [2]. It is a Java-based software platform to support knowledge processing and coordination of tasks among several software modules and agents representing the robotic hardware connected on a network. SPAK consists of the following major components: graphical user interface (GUI), knowledge base (KB), network gateway and inference engine as shown in Fig. 2. SPAK allows TCP/IP-based communication with other software components in the network and provides knowledge access and manipulation via a TCP socket. Framebased knowledge is entered into the SPAK system with full slot information: attributes, conditions and actions. Based on information from remote software components (e.g. pose classification, face recognition, etc.), the SPAK inference engine processes facts, instantiates frame instances and carries out the user’s predefined actions. Fig. 9 shows an example SPAK knowledge editor (a part of this system) with the class frames: robot, user, gesture, behavior (robot actions) and pose. The frame system can be defined using GUI and interpreted by the SPAK inference engine. The new users, poses, gestures, robot behaviors (actions) can be added by creating corresponding frames in the knowledge base using the SPAK knowledge editor. 5. Experiments, results and discussion This system uses a standard video camera or ‘Robovie’ eye cameras for data acquisition. Each captured image is digitized into a matrix of 320 × 240 pixels with 24 b color. The recognition approach has been tested with a real-world HRI system using a humanoid robot ‘Robovie’ (developed by ATR) [16]. This recognition and learning approach is verified for real-time input images as well as static images.
5.1. Results of user recognition and learning Seven individuals were asked to act for the predefined face poses in front of the camera and the sequence of face images were saved as individual image frames. All the training and test faces are 60 × 60 pixel gray images. The learning algorithm is verified for seven people’s frontal face or normal face (NF) images and five directional face images (normal face, left directed face, right directed face, up directed face, down directed face). Fig. 10 shows the sample outputs of the learning method for normal (frontal) faces. In the first step, the system is trained using 60 normal face images of three persons and developed three clusters (top three rows in Fig. 10) corresponding to the three persons. The cluster information table (that stores the starting pointer of each cluster) contents are {1, 11, 23, 30}. In this situation, if any input face image matches with the known member between 1 and 10 then the person is identified as person 1 (‘Hasan’). In the second step, 20 face image sequences of another person are fed into the system as input. The minimum Euclidian distances (the Euclidean distance (ED) is determined between the weight vectors generated from the training face images and the weight vectors generated from the detected face by projecting them onto the eigenfaces) for 20 eigenface images of the fourth person from three known persons eigenface images are shown using the upper line graph (B adap) in Fig. 11. Based on predefined threshold values (ED for the known/unknown person classification) the system identifies these faces as unknown persons and activates the user learning function. The bottom line (A adap) in Fig. 11 shows the minimum ED for those 20 face images of fourth person after learning or adaptation. This bottom line shows that for eight images the minimum ED is zero (for perfect recognition) because those are included in the new cluster as member images (fourth row of Fig. 10) so that the system can recognize other frontal face
652
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
Fig. 10. Sample outputs of the learning method for normal faces.
Fig. 11. Minimum Euclidian distance for 20 face images of person 4 (before and after adaptation).
Fig. 12. Cluster member distribution for five directed face poses.
images of that person. The user learning function develops a new cluster (fourth row of Fig. 10) and updates the cluster information table {1, 11, 23, 30, 38}. This method is tested for seven persons including two females and formed different clusters with different lengths (number of images per cluster) for different persons as shown in Fig. 10. The user learning system is also tested for five directional face images of seven persons for 700 test face images. Fig. 12 shows the distribution of 41 clusters for the 700 face images: five clusters for person 1 ‘Hasan’ (as shown in Fig. 4), six clusters for person 2 ‘Cho’ (as shown in Fig. 4), seven Clusters for person 3 ‘Osani’, seven clusters for person 4 ‘Satomi’ (female), seven clusters for person 5 ‘Huda’, four clusters for person 6 ‘Yosida’, five clusters for person 7 ‘Yomiko’ (female). Fig. 13 shows the sample errors in the clustering process. In cluster 26, up-directed faces of person 6 and person 5 are overlapped (Fig. 13(a)). In cluster 31, up-directed faces of person 5 and normal faces of person 6 are overlapped
(Fig. 13(b)). This problem can be solved using a narrow threshold, but in this case the number of iterations as well as the discarding rates of the images will be increased. In our previous research, we have found that the accuracy of frontal face recognition is better than up, down and greatly left–right directed faces [19]. This system prefers frontal and a slightly left–right rotated face for person identification. We have tested this face recognition method for 680 faces of seven persons, where two persons are female. The average precision for face recognition is about 93% and recall rate is about 94.08%. 5.2. Results of pose classification and learning The system uses 10 hand poses of seven persons for evaluating pose classification and learning method. All the training and test images are 60 × 60 pixel gray images. This method can automatically cluster the training images. This system is first trained using 200 images of 10 hand poses of person 1 (20 images of each pose). It automatically clusters the images into 13 clusters. Fig. 5 shows the sample outputs of the hand poses learning method for person 1 (’Hasan’). If the user uses two hands to make the same pose, then it forms two different clusters for the same pose. Different clusters can also be formed for the variation of orientation even if the pose is same. If the person is changed, then it may form different clusters for the same hand poses (gestures) due to the variation of hand shapes and colors. After being trained with 10 hand poses of person 1, 200 images of 10 hand poses of person 2 are fed into the system. The system develops eight more clusters for person 2 corresponding to eight hand poses. For, the ‘LeftHand’ and ‘RightHand’ palms it did not develop new clusters, rather it inserted new members into those clusters. Table 2 shows the 22 clusters, which developed for 10 hand poses of two persons, and their associations with hand poses. Fig. 14 shows the distributions of the clusters of 10 hand poses for two persons. In this study we have also compared pose classification accuracy (%) using two methods: one is a multi-cluster-based approach without adaptation, the other is a multi-cluster-based approach with adaptation. In this experiment we have used total 840 training images (20 images of each pose of each person 3 × 14 × 20) and 1660 test images of 14-ASL (American
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
653
(a) Cluster 26: Person 5 frontal face and person 6 up-directed face overlapping.
(b) Cluster 31: Person 5 up-directed face and person 6 (2 images from right side) frontal face overlapping. Fig. 13. Example of errors in the clustering process. Table 2 Pose and associated clusters Pose name
Associated clusters
ONE (raise index finger) FIST (fist up) OK (make circle using thumb and index fingers) TWO (make V sign using index and middle fingers) THREE (raise index, middle and ring fingers) LEFTHAND (left-hand palm) RIGHTHAND (right-hand palm) THUMB (thumb up) POINTL (point left) POINTR (point right)
PC1 , PC15 PC2 , PC3 , PC19 PC4 , PC20 PC5 , PC6 , PC16 PC7 , PC17 , PC18 PC8 PC9 PC10 , PC11 , PC14 PC12 , PC21 PC13 , PC22
Fig. 14. Cluster member distribution of hand poses.
Sign Language) characters (Fig. 15) [26] of three persons. All images are segmented using skin colors and normalized to 60 × 60 gray images. Table 3 presents the comparisons of two methods (B Adap=before adaptation and A Adap=after adaptation). This table shows that the accuracy of the pose classification method which includes the adaptation or learning approach is better, because the learning function increments the clusters members or forms new clusters if necessary to classify the new images. Fig. 16 depicts the graphical
representations of 14-ASL character classification accuracy using the learning approach (after adaptation) and without the learning approach (before adaptation). The comparison curves shows that if we include the incremental learning approach then pose classification performance will be better. However, this learning approach needs user interaction, which is the bottleneck of this method. Note that, the accuracy of the gesture recognition system depends on the accuracy of the pose classification unit. For example, in some cases, pose ‘TWO’ (‘V sign’) is present in the input image but the pose classification method failed to classify it correctly and classified it as pose ‘ONE’ (‘raised index finger’) due to the variation of orientation. The accuracy of the dynamic gesture recognition also depends on the accuracy of the face pose classification. Table 4 shows the success rate of dynamic gestures recognition. This test is peformed on six image sequences and each image sequence consists of 485 image frames. For correct recognition, at least each face pose should be at the stable state for one image frame time.
Fig. 15. Examples of 14-ASL characters.
654
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
Table 3 Comparison of pose classification accuracy with and without adaptation Input images (ASL char) A B C D E F G I K L P V W Y
Number of image 120 123 110 120 127 120 120 100 120 120 120 120 120 120
Correct recognition
Accuracy (%)
B Adap
A Adap
B Adap
A Adap
81 82 73 78 80 81 118 56 107 100 79 75 74 85
119 109 103 106 96 114 120 100 116 119 119 86 101 98
67.50 66.66 66.36 65 62.99 67.50 98.33 56 89.16 83.33 65.83 62.50 61.66 70.83
99.16 88.61 93.63 88.33 75.59 95 100 100 96.66 99.16 99.16 71.66 84.16 81.66
Fig. 16. Comparison of pose classification accuracy. Table 4 Evaluation of dynamic gesture recognition Image sequence
Successful gesture YES NO
Unsuccessful gesture
Average success rate (%)
Seq1 Seq2 Seq3 Seq4 Seq5 Seq6
19 18 23 16 35 18
2 0 1 2 3 2
93.33 100 97.22 94.73 93.87 94.11
9 20 12 20 11 14
5.3. Implementation of human–robot interaction The real-time gesture-based HRI is implemented as an application of this system. This approach has been implemented on a humanoid robot ‘Robovie’. This system considers that the HRI will be situated in a specific room with constant illumination, where the user wears long sleeves with a non-skin color. If the user wears short sleeves or T-shirts then it needs to separate hand palms from arms. Hand palm segmentation from cluttered background is one of the important issues of our future research. The system also considers that a single user will interact the robot at a time and the robot behavior is defined as person-centric. If multiple users present in the captured image, the system will recognize the most significant user whose matching probability (based on minimum ED) is higher than others. Since the same gestures can mean different tasks for
different persons, we need to maintain the gesture with personto-task knowledge. The robot and the gesture recognition PC are connected to the SPAK knowledge server [15]. From the image analysis and recognition PC, person identity and pose names (or gesture name for dynamic gesture) are sent to the SPAK for decision making and robot activation. According to the gesture and user identity, the knowledge module generates executable code for robot actions. The robot then follows speech and body action commands. Fig. 17 shows the example of a ‘Robovie’ behavior frame ‘RaiseTwoArms’. This frame will be activated if the user is ‘Hasan’, gesture is ‘TwoHand’ and the selected robot is ‘Robovie’. This system has considered that gesture command will be effective until the robot finishes corresponding action for that gesture. This method has been implemented on a humanoid robot ‘Robovie’ for the following scenario: User: ‘Hasan’ comes in front of Robovie eyes camera and robot recognizes the user as ‘Hasan’. Robot: “Hi Hasan, How are you?” (speech) Hasan: uses the gesture ‘Ok’ Robot: “Oh Good! Do you want to play now?” (speech) Hasan: uses the gesture ‘YES’ (nods face) Robot: “Oh Thanks” (speech) Hasan: uses the gesture ‘TwoHand’ Robot: imitates user’s gesture ‘Raise Two Arms’ as shown in Fig. 18. Similarly, Robovie imitates for other gestures. The other actions of ‘Robovie’ are, ‘Raise Left Arm’, ‘Raise Right Arm’, ‘Point with Left Arm’ and ‘Point with Right Arm’ corresponding to the gestures ‘LeftHand’, ‘RightHand’, ‘PointLeft’ and ‘PointRight’, respectively, for the user ‘Hasan’. Hasan: uses the gesture ‘FistUp’ (stop the interaction) Robot: “Bye-bye” (speech). For another user ‘Cho’, User: ‘Cho’ comes in front of Robovie eyes camera, robot detect the face as unknown, Robot: “Hi, What is your Name?” (speech) Cho: Types his name ‘Cho’ Robot: “Oh, Good! Do you want to play now?” (speech) Cho: uses the gesture ‘OK’ Robot: “Thanks!” (speech) Cho: uses the gesture ‘LeftHand’ Robot: imitate user’s gesture (‘Raise Left Arm’) Cho: uses the gesture ‘RightHand’ Robot: imitate user’s gesture (‘Raise Right Arm’) Cho: uses the gesture ‘One’ Robot: “This is one” (speech) Cho: uses the gesture ‘Two’ Robot: “This is two” (speech) Cho: uses the gesture ‘Three’ Robot: “This is three” (speech) Cho: uses the gesture ‘TwoHand’ Robot: “Bye-bye” (speech).
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
655
Fig. 17. Example of Robovie behavior frame ‘Raise Two Arms’.
Fig. 18. Example human–robot (Robovie) interaction.
The above scenario shows that same gesture is used for different meanings and several gestures are used for the same meanings for different persons. The user can design new actions according to his/her desires using ‘Robovie’ and can design corresponding knowledge frames using SPAK to implement their desired actions. Robot behavior or action can be programmed or learned through experiences. But it is difficult for a robot to perceive user intention. This system has proposed the experience based user interactive learning method. In this method, the history of the similar gesture action map is stored in the knowledge base. The system should recognize and remember the characteristics of different user that means gesture and corresponding robot actions. According to generalization of multiple occurrences of same gesture robot will select the action and asks the user for acknowledgement. If user uses “YES” gesture or types ‘YES’, then the corresponding frame will be stored permanently. For future multi-modal HRI speech, text and gesture-based interaction are more natural and friendly that’s why integration of those are the focus of attention of our team and other researchers. For example scenarios: Huda: comes to Robovie. Robovie: cannot recognize, and asks “Who are you?” Huda: types ‘Huda’. Robovie: says “Hello Huda. Do you want to play with me?” Huda: shows ‘Ok’ hand gesture. Robovie: asks “Do you mean yes?” Huda: nods his head. Robovie: adds ‘Ok = Yes, for Huda’ into his knowledge base. Suppose that there are two examples already for the same meaning of ‘Ok’ gestures, therefore, Robovie adds ‘OK = Yes, for everybody’ into his knowledge base. It is possible to implement the complex commands using a knowledge-based software platform. The complex gesture
frame will be activated if corresponding gestures are recognized sequentially. A future task of our team is to develop such a gesture that is helpful for elderly/handicapped. In the proposed system we are not training elderly to learn gestures. Instead, the elderly/deaf people are supposed to know in advance gestures for accomplishing commands. 6. Conclusions This paper describes an adaptive visual gesture recognition method for HRI using a knowledge-based software platform. The frame-based knowledge modeling is quite flexible and powerful for gesture interpretation and person-centric HRI. Human skin color (luminance and chrominance components) differs from person to person even in the same lighting conditions. Therefore, person-centric threshold values for YIQ components are very useful for skin regions segmentation. This system is able to identify and learn new users, poses, gestures and robot behaviors. In this system the user can define or update the knowledge frames or conditions for gesture recognition/interpretation, and the robot behaviors corresponding to his/her gestures. In this system we propose a multi-cluster-based interactive learning approach. It increases the numbers of training images to adapt to new poses or new users. However, if a large number of users use a large number of hand poses it is impossible to run this systems in real time. To overcome this problem, in the future we will maintain person-specific subspaces (each person separates PCA for all hand poses) for pose classification and learning. Another problem is that the system still needs minimal user interaction in the learning/registration phase and the user should have some prior knowledge about the skin color threshold selection method. In our future research we will propose adaptive threshold selection methods that can define the threshold values
656
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657
automatically. By integrating with a knowledge-based software platform, a gesture-based person-centric HRI system has also been successfully implemented using a robot ‘Robovie’. The future aim is to make the system more robust, dynamically adaptable to new users and new gestures for interaction with different robots such as ‘Aibo’, ‘Robovie’, ‘Scout’, etc. The ultimate goal of our research is to establish a human–robot symbiotic society so that they can share their resources and work cooperatively with human beings. References [1] H. Ueno, A knowledge-based information modeling for autonomous humanoid service robot, IEICE Transactions on Information & Systems E85-D (4) (2002) 657–665. [2] V. Ampornaramveth, H. Ueno, SPAK: Software platform for agents and knowledge systems in symbiotic robots, IEICE Transactions on Information and systems E86-D (3) (2004) 1–10. [3] M. Minsky, A framework for representing knowledge, MIT-AI Laboratory Memo 306, 1974. [4] T. Zhang, M. Hasanuzzaman, V. Ampornaramveth, P. Kiatisevi, H. Ueno, Human–robot interaction control for industrial robot arm through software platform for agents and knowledge management, in: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, IEEE SMC, Netherlands, 2004, pp. 2865–2870. [5] P. Kiatisevi, V. Ampornaramveth, H. Ueno, A distributed architecture for knowledge-based interactive robots, in: Proceedings of 2nd International Conference on Information Technology for Application, ICITA, 2004, pp. 256–261. [6] T. Zhang, V. Ampornaramveth, P. Kiatisevi, M. Hasanuzzaman, H. Ueno, Knowledge-based multiple robots coordinative operation using software platform, in: Proceedings of 6th Joint Conference on Knowledge-Based Software Engineering, JCKBSE, Russia, 2004, pp. 149–158. [7] T. Fong, I. Nourbakhsh, K. Dautenhahn, A survey of socially interactive robots, Robotics and Autonomous Systems 42 (3–4) (2003) 143–166. [8] R.E. Axtell, The meaning of hand gestures. http://www.allsands.com/ Religious/NewAge/handgestures wca gn.htm, 1990. [9] C. Torras, Robot adaptivity, Robotics and Autonomous Systems 15 (1995) 11–23. [10] V.I. Pavlovic, R. Sharma, T.S. Huang, Visual interpretation of hand gestures for human–computer interaction: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7) (1997) 677–695. [11] D.J. Sturman, D. Zetler, A survey of glove-based input, IEEE Computer Graphics and Applications 14 (1994) 30–39. [12] S. Waldherr, R. Romero, S. Thrun, A gesture based interface for human–robot interaction, Journal of Autonomous Robots (2000) 151–173. [13] D. Kortenkamp, E. Hubber, P. Bonasso, Recognizing and interpreting gestures on a mobile robot, in: Proceedings of AAAI’96, 1996, pp. 915–921. [14] M.A. Bhuiyan, V. Ampornaramveth, S. Muto, H. Ueno, On tracking of eye for human–robot interface, International Journal of Robotics and Automation 19 (1) (2004) 42–54. [15] M. Hasanuzzaman, T. Zhang, V. Ampornaramveth, H. Gotoda, Y. Shirai, H. Ueno, Knowledge-based person-centric human–robot interaction by means of gestures, Information Technology Journal 4 (4) (2005) 496–507. [16] T. Kanda, H. Ishiguro, M. Imai, T. Ono, K. Mase, A constructive approach for developing interactive humanoid robots, in: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, 2002, pp. 1265–1270. [17] M. Hasanuzzaman, T. Zhang, V. Ampornaramveth, H. Ueno, Gesturebased human–robot interaction using a knowledge-based software platform, International Journal of Industrial Robot 33 (1) (2006) 37–49. [18] M.A. Bhuiyan, V. Ampornaramveth, S. Muto, H. Ueno, Face detection and facial feature localization for human–machine interface, NII Journal 5 (1) (2003) 25–39.
[19] M. Hasanuzzaman, T. Zhang, V. Ampornaramveth, M.A. Bhuiyan, Y. Shirai, H. Ueno, Face and gesture recognition using subspace method for human–robot interaction, in: Proceedings of the PCM’2004 (5th Pacific Rim Conference on Multimedia), Tokyo, Japan, in: LNCS, vol. 3331 (1), Springer-Verlag, Berlin, Heidelberg, 2004, pp. 369–376. [20] M. Turk, A. Pentland, Eigenface for recognition, Journal of Cognitive Neuroscience 3 (1) (1991) 71–86. [21] M. Hasanuzzaman, V. Ampornaramveth, T. Y. Shirai, H. Ueno, Real-time vision-based human–robot interaction, in: Proceedings of conference on Robotics and Biomimetics, pp. 379–384.
Zhang, M.A. Bhuiyan, gesture recognition for the IEEE International ROBIO, China, 2004,
[22] D.W. Patterson, Introduction to Artificial Intelligence and Expert Systems, Prentice-Hall Inc., Englewood Cliffs, NJ, USA, 1990. [23] S. Satoh, N. Katayama, Comparative evaluation of face sequence matching for content-based video access, in: Proceedings of International Conference on Automatic Face and Gesture Recognition, 2000, pp. 163–168. [24] M.H. Yang, D.J. Kriegman, N. Ahuja, Detection faces in images: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (1) (2002) 34–58. [25] L. Aryananda, Recognizing and remembering individuals: Online and unsupervised face recognition for humanoid robot, in: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, vol. 2, 2002, pp. 1202–1207. [26] American Sign Language Browser. http://commtechlab.msu.edu/sites/ aslweb/browser.htm.
Dr. Md. Hasanuzzaman received a Ph.D. degree from the Department of Informatics, National Institute of Informatics (NII), The Graduate University for Advanced Studies, Tokyo, Japan. He graduated (with Honors) in 1993 from the Department of Applied Physics and Electronics, University of Dhaka, Bangladesh. He received a Masters of Science (M.Sc.) in Computer Science in 1994 from the same university. He joined as a lecturer in the Department of Computer Science & Engineering, University of Dhaka in 2000. Since April 2003 he has been serving as an Assistant Professor in the Department of Computer Science & Engineering, University of Dhaka. His research interests include Human–Robot Interaction, Computer Vision, Image Processing and Artificial Intelligence. Dr. T. Zhang received B.S., M.S. and Ph.D. degrees in Electrical Engineering from Tsinghua University of China, in 1993, 1995 and 1999, respectively. In 2002, he received his second Ph.D. degree in Electrical Engineering in the Department of Advanced Systems Control Engineering, Graduate School of Science and Engineering, Saga University, Japan. He was a postdoctoral researcher in the National Institute of Informatics of Japan up to March 2006. Since April 2006 he has been serving as an Associate Professor in the Department of Automation, Tsinghua University, Beijing, China. His current research interests include artificial intelligence, symbiotic robotics, nonlinear system control theory, power system control and maintenance scheduling and neural networks. he is a member of IEEJ, IEEE and RSJ. V. Ampornaramveth received his B.Eng degree (with honors) in Electrical Engineering from Chulalongkorn University, Thailand in 1992, and M.Eng and D.Eng in Control and Systems Engineering from Tokyo Institute of Technology in 1995 and 1999, respectively. His research interests include knowledge management for symbiotic robots, control of multi-actuator robots, Web and mobile applications, and distance learning systems. He is a member of IEEE and RSJ.
Md. Hasanuzzaman, et al. / Robotics and Autonomous Systems 55 (2007) 643–657 H. Gotoda, received a B.Sc., M.Sc. and D.Sc. from the Faculty of Science and Graduate School of Science of the University of Tokyo, Japan in 1989, 1991 and 1994, respectively. Since November 2000 he has been serving as an Associate Professor in Human and Social Information Research Division, National Institute of Informatics, The Graduate University for Advanced Studies (Sokendai), Japan. His main area of research is representation and recognition of 3D objects. His current research topics include image-based modeling of deformable objects, and similarity of 3D geometries. Y. Shirai received a B.Eng. degree in Mechanical Engineering from Nagoya Univiersity in 1964. He received M.Eng. and D.Eng. degrees in Mechanical Engineering from the University of Tokyo in 1966 and 1969, respectively. From 1969 to 1988 he was with the Electro Technical Laboratory MITI, Japan, as research staff (1969–1972), senior research staff (1973–1979), the Chief of Computer Vision Section (1979–1985) and the Director of Control Section (1985–1988). From 1971 to 1972 he was a Visiting Researcher of the
657
Artificial Intelligence Laboratory, MIT. From 1988 to 2005 he was a Professor of the Department of Computer-Controlled Mechanical Systems, Graduate School of Engineering, Osaka University, Suita, Japan. Currently he is a professor at the Department of Human and Computer Intelligence, School of Information Science and Engineering, Ritsumeikan University, Japan. He is an active member of IEEE, RSJ, IPSJ, IEICE and AAAI, and president of JSAI. His research area has been computer vision, robotics and artificial intelligence. Dr. H. Ueno received a B.E. in Electrical Engineering from National Defense Academy in 1964, and M.E. and D.Eng. in Electrical Engineering from Tokyo Denki University in 1968 and 1977, respectively. He is a professor of the National Institute of Informatics (NII), Tokyo, Japan. He is also a professor of the Department of Informatics, The Graduate University for Advanced Studies (Sokendai), Japan. His research interests include knowledge-based systems, symbiotic robotics and distance learning for higher education. He is the originator of the International Joint Conference on Knowledge-Based Software Engineering (JCKBSE), SIG-KBSE, SIG-AI, etc. He is a member of Engineering Academy of Japan (EAJ).