Automatic recognition of object size and shape via user-dependent measurements of the grasping hand

Automatic recognition of object size and shape via user-dependent measurements of the grasping hand

Available online at www.sciencedirect.com Int. J. Human-Computer Studies 71 (2013) 590–607 www.elsevier.com/locate/ijhcs Automatic recognition of ob...

2MB Sizes 0 Downloads 31 Views

Available online at www.sciencedirect.com

Int. J. Human-Computer Studies 71 (2013) 590–607 www.elsevier.com/locate/ijhcs

Automatic recognition of object size and shape via user-dependent measurements of the grasping hand Radu-Daniel Vatavun, Ionut¸ Alexandru Zait¸i University Stefan cel Mare of Suceava, str. Universitatii nr. 13, 720229 Suceava, Romania Received 4 July 2011; received in revised form 13 August 2012; accepted 3 January 2013 Communicated by E. Motta Available online 24 January 2013

Abstract An investigation is conducted on the feasibility of using the posture of the hand during prehension in order to identify geometric properties of grasped objects such as size and shape. A recent study of Paulson et al. (2011) already demonstrated the successful use of hand posture for discriminating between several actions in an office setting. Inspired by their approach and following closely the results in motor planning and control from psychology (Makenzie and Iberall, 1994), we adopt a more cautious and punctilious approach in order to understand the opportunities that hand posture brings for recognizing properties of target objects. We present results from an experiment designed in order to investigate recognition of object properties during grasping in two different conditions: object translation (involving firm grasps) and object exploration (which includes a large variety of different hand and finger configurations). We show that object size and shape can be recognized with up to 98% accuracy during translation and up to 95% and 91% accuracies during exploration by employing user-dependent training. In contrast, experiments show less accuracy (up to 60%) for user-independent training for all tested classification techniques. We also point out the variability of individual grasping postures resulted during object exploration and the need for using classifiers trained with a large set of examples. The results of this work can benefit psychologists and researchers interested in human studies and motor control by providing more insights on grasping measurements, pattern recognition practitioners by reporting recognition results of new algorithms, as well as designers of interactive systems that work on gesture-based interfaces by providing them with design guidelines issued from our experiment. & 2013 Elsevier Ltd. All rights reserved. Keywords: Hand posture; Grasping; Prehension; Object recognition; Experiment; Shape recognition; Size recognition; Measurements; Data glove; Gestures

1. Introduction The human hand represents a remarkable instrument for grasping and manipulating objects but also for extracting useful information such as object weight, size, orientation, surface texture, and thermal properties. Hands serve therefore both executive and perceptive functions synchronously while they are employed to explore as well as to transform and change the environment. In MacKenzie and Iberall’s own words: we use our hands as general purpose n

Corresponding author. Tel./fax: þ40 230 524801. E-mail addresses: [email protected], [email protected] (R.-D. Vatavu), [email protected] (I.A. Zait¸i). URL: http://www.eed.usv.ro/~vatavu (R.-D. Vatavu). 1071-5819/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ijhcs.2013.01.002

devices, to pick up objects, to point, to climb, to play musical instruments, to draw and sculpt, to communicate, to touch and feel, and to explore the world (Makenzie and Iberall, 1994, p. 4). The understanding of the hand functions and the inner workings of the fine mechanisms for planning and controlling movements at both muscular and neural (central nervous system) levels still represent a very active field of study for psychologists interested in motor control (Jones and Lederman, 2006; Makenzie and Iberall, 1994; Wing et al., 1996). However, MacKenzie and Iberall’s description of hands as instruments can prove to be extremely interesting to human–computer interaction: the hand becomes a specialized device that extracts information from the objects users are interacting with. This can be described as extrinsic-oriented exploration of the environment

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

in opposition to intrinsic-oriented exploration where objects share identification information explicitly through various technologies such as RFID (Kim et al., 2007; Tanenbaum et al., 2011; Ziegler and Urbas, 2011) and Bluetooth (Farella et al., 2008; Siegemund and Florkemeier, 2003). As the ¨ exploration is also voluntary and accomplished using the human hand, we can further refer to it as extrinsicproprioceptive exploration as the hand posture is the only available source of data from which information about target objects is inferred. There are two main motivations in HCI for collecting and using hand posture information during object grasping and exploration. One of them is concerned with recognizing hand postures and designing proper interaction metaphors in order to build natural gesture-based interfaces. The goal in this case is to provide natural and intuitive ways for users to interact with computing systems by leveraging the information richness delivered by the hand pose (Baudel and Beaudouin-Lafon, 1993; Erol et al., 2007; Wachs et al., 2011). The other motivation is represented by recognizing activity patterns for context-aware applications, leading HCI developments towards ubiquitous computing. Within this direction, Paulson et al. (2011) showed that various activities in an office such as dialing a number, holding a mug, typing at the keyboard, or handling the mouse can be recognized using hand posture information solely. However, the application domain can be much extended and many advantages and interactive opportunities can be envisaged regarding the information obtained while the hand is grasping ambient objects. First, the hand as a measuring instrument relieves the costly need for embedding identification technology into practically every ambient object as would be the case for RFID tags (Ilie-Zudor et al., 2011). This is especially important when such tag installations become impractical for some scenarios in terms of cost, performance, and reliability (e.g., identification problems do occur cause of tag orientation, material type, and reader collision), and social acceptance of RFID technology hindered by ethical concerns (Want, 2006). Second, knowledge of how objects are manipulated can be exploited for enhancing existing interactions with everyday objects: a firm grip on the phone ends the current call; a firm grip on the door knob locks the office door on the way out while a gentle grasp simply closes it without locking; grasping the remote turns on the TV, etc. Video games can also benefit of enriched interaction experiences in the way that players can grab any object from the real world and use it inside the game. For example, grasping a simple stick informs a sports game that the user is holding the baseball bat; a toy pistol can be used in an action game that senses the hand in the ‘‘trigger’’ posture; grasping a ball can make the game character grab another snow ball and throw it at the opponent in a winter game. Traditional learning games for children such as wooden bricks that encourage hand motor development can become highly interactive whilst the hand posture is retrieved and analyzed: virtual guiding tutors that know when the child grasped the

591

object with the right size and shape or automated monitoring of the child’s progress. All such interaction opportunities with everyday objects become viable once information can be inferred about the grasped object using the hand alone. The idea of using hand measurements in order to identify properties of target objects has been investigated before in various forms and for various purposes outside the HCI community. For example, an early study of Buchholz et al. (1992) was concerned with proposing a model for the human hand in order to noninvasively predict postures as the hand grasps different sizes of objects. The model served to predict and evaluate the prehensile capacities of the hand when grasping ellipsoidal objects (Buchholz and Armstrong, 1992) in order to provide assistance for designers of tool handles. In the psychology line of motor control work, Santello and Soechting (1998) showed that it is possible to discriminate between concave and convex objects using the relative flexure between the index and little fingers of the grasping hand. Such previous works suggest the potential of using the hand in order to extract information about the environment, and, more precisely, about the objects users are interacting with. The findings, although dispersed in specific contexts and research communities, suggest interesting opportunities for human–computer interaction. However, in order for the community to benefit of such a technique, solid evidence and analysis must be provided for researchers and practitioners to rely on when designing and evaluating their applications. Inspired by the well-grounded results in psychology concerning the grasping hand (Jones and Lederman, 2006; Makenzie and Iberall, 1994; Wing et al., 1996) and following the results of Paulson et al. (2011), we explore the feasibility of employing hand posture to automatically extract object properties. However, in opposition to Paulson et al. (2011) that only show that a number of distinct office activities can be recognized, we adopt a more thorough procedure. By following closely the results from motor control theory concerning the act of prehension and grasping target objects, we designed an experiment for determining whether size, shape, and size and shape together can be reliably identified for objects with basic geometries. The analysis was carried out in two different scenarios: object translation, for which stable and firm hand grasps are used, and object exploration, for which a large variety of different hand postures and finger configurations are employed. The exploration scenario was specifically introduced in order to evaluate the technique independently of the intended use of the object and therefore to better understand its feasibility. The contributions of our work include: i. We show that object size and shape can be robustly inferred from measurements on the grasping hand with up to 98% accuracy during object translation and 95% and 91% accuracies during object exploration in the user-dependent training scenario. ii. We report design guidelines for the implementation of the classification algorithm, the size of the training set, and the training procedure.

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

592

iii. We introduce new tools for analyzing the variability in hand posture and for measuring the amount of shared postures when grasping different objects. 2. Related work We are interested in previous works that come from the field of psychology and motor control in order to understand the main results with respect to hand prehensile and grasping movements. We then use these results in order to inform the design of an experiment for acquiring hand measurements as described in the next section. We are also interested in works from the pattern recognition and HCI communities that attempt to recognize hand postures and, consequently, we report results on acquisition and recognition technology. By relating to all these communities, we believe that this work could potentially benefit all of them by providing more insights on hand measurements, recognition results, as well as design guidelines for hand-based interfaces. A vast amount of literature exists on hand motor control (Jones and Lederman, 2006; Makenzie and Iberall, 1994; Wing et al., 1996) that provides a solid starting point for looking at the human hand as an instrument for extracting object properties. Also, this knowledge can be used in order to inform the design of recognition algorithms and interaction techniques for gesture-based interfaces as motor theorists may have already (or partially) found answers to important questions:

 

How is the hand adjusting to objects during prehension and grasping? What postures do people use when exploring an object in order to learn about its properties?

2.1. The influence of object characteristics on hand posture Objects with different geometries are being grasped using different hand postures that depend on object size (Jakobson and Goodale, 1991), shape (Santello and Soechting, 1998), and intended use (Napier, 1956). Even more, the hand shape changes during the reach-to-grasp movement in accordance to the shape, dimensions, and other properties of the object (Jones and Lederman, 2006). For example, the amplitude of the maximum grip aperture of the hand was found to correlate with the size of the target object during the transport phase (Jakobson and Goodale, 1991; Gentilucci et al., 1991) after which the hand shape becomes distinctly resemblant to that of the object to be grasped. The shape of the grasped object imposes constraints on the posture being used so that the applied forces would coordinate and therefore prevent the object to slip by maintaining a stable grasp (Jenmalm et al., 1998). This stable grasp is attained by positioning fingers to contact points on the object surfaces. Therefore, object size and shape will also influence both the number of fingers as well as their contact locations during grasping (Makenzie and Iberall, 1994). These findings on hand adaptation to the

size and shape of target objects represent a solid base for suggesting the use of the hand as an automatic measurement device acting on the objects users are interacting with. The intended use of the target object was also found to influence hand posture. For example, Makenzie and Iberall (1994) note that: ‘‘when a person reaches out to grasp an object, the hand opens into some suitable shape for grasping and manipulation—suitable, in the sense that the person’s understanding of the task influences the shape of the hand. As well, the hand posture adapted for drinking is dramatically different from that used to throw the mug at someone else.’’ (p. 4). Therefore, a variety of postures are expected to be associated to a given object during manipulation. This has a huge impact on the algorithm used for recognition, suggesting that a nearestneighbor approach as in Paulson et al. (2011) must be backed up with a properly sized set of training examples (as shown later in the Results section). Researchers have found that people choose between different hand movement patterns when inspecting objects in order to learn about their various properties (such as surface texture, weight, temperature, etc.). These stereotypical movements have been coined as exploratory procedures in the study of Lederman and Klatzky (1987) and they include lateral motions for identifying texture, pressure for hardness, unsupported holding for weight, contour following and enclosure for shape. The observation regarding the existence of such standard exploratory patterns is very interesting from an algorithmic perspective as it informs on the postures people are likely to use when showing interest in some specific property of the target object. Other researchers have tried to propose classificatory systems for the postures adopted by the hand while grasping (see Makenzie and Iberall, 1994, pp. 19 and 20 for a review). Two postures have been occurring frequently for most classifications: the power grip and the precision grip. The first one is used when the objective of the task is force (e.g., using a heavy tool such as a hammer). The posture in this case is characterized by a large contact area and little movement from the fingers as the grasp needs to be stable. The precision grip is usually accomplished using the thumb, index, and sometimes middle fingers so that the primary objective is precise control rather than force. The tip, palmar, and lateral pinches are examples of the precision grip (Cutkosky and Howe, 1990). These findings represent a great starting point for understanding the hand as an instrument. We used knowledge from these works in order to inform the design of our experiment in accordance to the already established practices from psychology. Also, observations on hand grasp patterns were exploited in the Recognition and Results section of the work in order to explore the design space of the Euclidean distance for computing the dissimilarity between hand postures. 2.2. Acquisition and recognition of hand postures Many technologies exist today for capturing hand postures including gloves with flexure sensors, special markers working with IR tracking systems, or video cameras. They

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

differ in acquisition resolution in terms of finger position accuracy and data sampling rate. For example, sensor gloves allow fine detection of finger flexure and precise hand orientation and are able to capture all the degrees of freedom exhibited by the human hand at accurate resolutions. For this reason, they are the preferred equipment for motor control experiments (Santello and Soechting, 1998; Santello et al., 2002). Also, they have been employed extensively for interactive applications (Baudel and Beaudouin-Lafon, 1993; Paulson et al., 2011; Sturman and Zeltzer, 1994) as they allow rapid prototyping enabling thus HCI researchers to concentrate on interaction techniques rather than on the acquisition technology itself. Color gloves and computer vision have also been shown recently to work remarkably well (Wang and Popovic´, 2009). Rather than describing here the multiple options available today for acquiring measurements of the human hand for interactive purposes, we refer the reader to extensive surveys such as Erol et al. (2007) and Wachs et al. (2011). When it comes to recognition algorithms, researchers have proposed different techniques depending on whether hand postures or gestures need to be recognized. Although advanced algorithms are being used for recognizing dynamic hand gestures such as Hidden Markov Models (Chen et al., 2003) and neural networks (Tsai and Lee, 2011), classification of hand postures is frequently performed using more simple approaches such as the nearestneighbor algorithm (Paulson et al., 2011; Wang and Popovic´, 2009). The approach consists in comparing the hand posture to be classified to previously recorded samples available as a training set. The comparison is guided by the use of a distance such as the Euclidean metric. The candidate hand posture is recognized as belonging to the class of its closest sample or neighbor in the training set. Although simple, this approach has been demonstrated to work very well when the training set is properly sized in order to model adequately the distribution of each posture class (a data set of around 100,000 entries was reported in Wang and Popovic´, 2009). Also, the nearest neighbor classifier presents several advantages such as flexibility and adaptability which encouraged its adoption for gesture recognition in the human–computer interaction community (Appert and Zhai, 2009; Li, 2010). The most notable advantage is represented by the fact that new or user-specific gestures can be added by simply providing training examples without the need of rethinking the structure or retraining the classifier. Although hand posture recognition has been shown to work with high accuracy rates, distinction must be explicitly made between using hand posture in order to send commands (for which the number of commands is usually small and limited by the human ability to learn and recall) and the attempt to classify all distinct hand postures used for grasping. For the later, the psychology community has expressed concerns with regards to developing classificatory systems for grasping hand postures. According to Jones and Lederman (2006), researchers have found

593

challenging to develop hand function taxonomies for the purpose of predicting the hand posture during grasping by only considering function and object geometry. Such concerns must be carefully taken into consideration. First of all, from the recognition point of view, the training set needed for recognizing hand postures associated to the same object should probably contain a large number of examples, as we have already stated before. Second, the technique of inferring object properties from hand postures is very likely to have limitations which must be understood and investigated (and we show less accurate recognition results for user-independent training). Therefore, our research approach is to start investigating the technique with a thoroughly controlled experiment so that designers and practitioners of such interactive technique would have a solid base to rely on. The challenge is to understand to what extent such a technique can be used reliably in practice. The design of our experiment described in the next section of the paper considers this aspect. 3. Experiment 3.1. Premises for experiment design Findings from motor control theory show that the hand posture changes accordingly with the size (Jakobson and Goodale, 1991) and the shape (Santello and Soechting, 1998) of the grasped object. However, variations in posture are also determined by the intended use of the object (Napier, 1956) which makes it difficult to create complete taxonomies of the grasping hand (Jones and Lederman, 2006) based on object geometry alone. However, even within this limitation, we hypothesize that object size and shape could still be robustly recognized if a large sample of data were available. Even if current hand taxonomies can not be complete from this point of view (Jones and Lederman, 2006), we rely nonetheless on such classificatory systems as they represent the most reliable sources in order to inform the design of our experiment. From all the existing taxonomies (see Makenzie and Iberall, 1994, pp. 19 and 20 for a review) it is worth noting a very early and simple yet effective classification of prehensile postures into just six categories from Schlesinger in a work from 1919 cited and described by Makenzie and Iberall (1994, pp. 17 and 18). This minimal set was constructed by considering the specific functionality needed for grasping or holding various objects and is composed of the cylindrical, spherical, tip, hook, palmar, and lateral postures. As they are important to our experiment design, we briefly describe each of them below (see also Fig. 1): 1. Cylindrical: the open fist grip used for grasping tools or a closed fist for thin objects. 2. Spherical: the fingers are spread and the palm arched while adapting to grasp spherical objects such as a ball. 3. Tip: fingers grasp sharp and small objects, such as a needle or pen. 4. Hook: used for heavy loads such as suitcases.

594

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

Fig. 1. A very early (1919) and simple yet effective classification of prehensile postures into six categories from Schlesinger (Makenzie and Iberall, 1994, pp. 17 and 18): (1) cylindrical, (2) spherical, (3) tip, (4) hook, (5) palmar, and (6) lateral hand postures.

Fig. 2. Objects used for the experiment consisting in six basic shapes (cube, parallelepiped, cylinder, sphere, pyramid, and surface) and three different sizes (small, medium, and large). All objects were made out of light wood.

5. Palmar: used for flat and thick objects. 6. Lateral: the thumb is primarily used in order to grasp thin and flat objects such as a piece of paper.

Schlesinger’s classification incorporates three critical notions: object shape (cylindrical, spherical), hand surface (tip, palmar, lateral), and hand shape (hook, close fist, open fist). Therefore, by considering basic object geometries in accordance with these criteria, an informed investigation on the feasibility of using hand posture for object or action recognition can be reliably conducted. As a side discussion, a problem with this simple classification is that it does not take into account the actual task for which objects are being used as suggested by Napier (1956). For example, the postures employed to pick a pen and to write with it are different. Napier’s classification argued therefore that the most important influence on the chosen posture is the goal of the task although the influence on the posture comes from various sources (object shape, size, surface, etc.). However, for the purpose of this study we are interested in extracting object properties such as size

and shape rather than inferring on the intended use. We believe it is more rigorous to start with simpler research questions which once answered construct the base for more future work on the uses and applications of this technique. We therefore selected six basic shapes for our experiment (cube, parallelepiped, pyramid, sphere, cylinder, and surface) as informed by the classification of prehensile postures of Schlesinger (Makenzie and Iberall, 1994, pp. 17 and 18). For each shape, three different sizes were considered and labeled for our reference as small, medium, and large for which the actual size references were 2, 4, and 8 cm which applied to each object in accordance with its specific geometry. The geometries of the 18 objects used for the experiment are illustrated in Fig. 2.

3.2. Participants Twelve right-handed participants volunteered to take part in the experiment (ages between 18 and 24). None of them have worked before with a data glove. All subjects were naive as to the goals of the experiment.

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

3.3. Apparatus Hand postures were captured using the 5DT Data Glove 14 Ultra1 equipped with 14 optical sensors. The glove measures finger flexure (two sensors per finger placed at the first and second joints) as well as the abduction between the fingers (four sensors), see Fig. 8b and the 5DT glove user’s manual.2 The glove was connected to a desktop PC on the USB port. Data from the glove were calibrated for each user with each sensor measurement being stored as a real value in ½0: :1 using 12 bits of precision. The material of the glove is stretch lycra so that it fits the user’s hand and hence reduces the discomfort while wearing the equipment. Data were captured at a frequency of 60 Hz. 3.4. Tasks and design The experiment consisted in two tasks for which hand postures were captured. For the first task we were interested in hand postures employed during object translation while for the second we captured postures used during object exploration. The tasks are different with respect to the variation in hand posture that can be collected. For the translation task, the objects are likely to be held firmly in a stable grip while they are moved from one location to another while for the second task a large variety of postures will presumably be employed while exploring the object. We therefore expect small variation for the first and large hand posture variations for the second task. The motivation of using the two tasks was to acquire a sufficient amount of samples that would be representative for grasping and manipulating a given object as well as to compare the feasibility of the technique when either firm grasps or exploratory postures are being used. 3.4.1. Task 1: object translation Participants stood in front of a table on which all the objects were placed at an easy arm reach distance. The initial location of the objects was on the participants’ right side of the table. Participants were asked to grab the objects placed on their right and move them to the left side. Only the dominant hand (right hand for all subjects) had to be used. The order in which objects had to be moved was randomized by a software application which displayed a text description concerning the current trial, e.g. ‘‘Move the small cylinder to your left’’. The experimenter was present during all the trials in order to record the grasping motion: the recording started with a key press of the experimenter at the moment when the participant first touched the object and ended (again with a key press) when the participant released the object. The objective of the task was to capture hand postures used when grabbing and maintaining a stable grip posture for reliable

595

translation of objects. The task took approximately 5 min to perform.

3.4.2. Task 2: object exploration For the second task, participants were asked to pick up each object and perform a series of explorations on it. Again, only the dominant hand had to be used. The exploration consisted in performing a predetermined sequence of rotations on each object which would require fingers to be used frequently. The objective of the task was to capture as many different postures as possible that can be associated to a given object of a specified shape and size. The order in which objects were manipulated was randomized by the software application as it was the sequence of rotations that had to be performed. Prior to the experiment, six different labels were glued on each object representing the numbers from 1 to 6. For each trial, the application generated the rotations by randomly choosing a sequence of numbers that was displayed to the participant, e.g. ‘‘Grab the medium pyramid and search for the sequence 2–5–4–2–1–3–5–1’’. Participants had to search for each number in the sequence in order to complete the trial. Evidently, the sequence of numbers was not important: the main advantage of using it was that users were given the freedom to explore the object themselves (by looking for the next number in the sequence). The experimenter ensured that participants performed the entire sequence before moving to the next trial. This second task was used in order to capture multiple measurements of postures that each participant chose to perform with a specific object. Participants were given the freedom of exploring the object with no instruction other than using one hand only. Postures were being captured starting with the moment when the participant first touched the object until the object was released (as in the first task, the two events were explicitly specified by the experimenter with a press of a key). The task took around 15 min to complete.

3.5. Recognition and analysis apparatus We describe below the definitions employed during our analysis. We introduce the definition for hand posture, distance between two postures, and the main algorithm employed for classifying a new posture. A hand posture p is represented as a 14-feature vector where each feature pi encodes either the flexure or the abduction distance measured by the glove sensors in the interval ½0: :1 with 1 denoting maximum flexure/abduction (see the previous section for the technical description of the glove) p ¼ ðp1 ; p2 ; . . . ; p14 Þ 2 R14

ð1Þ

1

http://www.5DT.com 2 http://www.5dt.com/downloads/dataglove/ultra/5DTDataGloveUltra Manualv1.3.pdf

This representation can be used directly in order to compute the dissimilarity between two hand postures in

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

596

the form of the Euclidean distance dðp; qÞ ¼ JpqJ ¼ ð

14 X

1=2

ðpi qi Þ2 Þ

ð2Þ

i¼1

A small distance between two posture measurements indicates a strong similarity. The maximum value for the Euclidean distance between two postures having 14 norP 2 1=2 ¼ 141=2 ¼ 3:75. malized features would be ð 14 i ¼ 1 ð10Þ Þ Classification of a new hand posture p is computed using the nearest-neighbor rule (Paulson et al., 2011; Wang and Popovic´, 2009). If T ¼ fpj 9j ¼ 1; 9T9g represents a set of training samples, p will be classified to the class of pk for which k ¼ arg min fdðp; pj Þg

ð3Þ

j¼1;9T9

4. Results We try to understand whether hand posture can be reliably used in order to discriminate between grasped objects that exhibit fine shape and size differences. Also, we analyze whether size and shape can still be inferred during an exploration task for which multiple finger configurations are likely to be used. The problem is subtle and recognized before as being difficult cause of the large variability of the grasping hand (Jones and Lederman, 2006): ‘‘because the hand usually adopts different grasps during the performance of a task, it has been difficult to develop taxonomies of hand function that can predict hand grasps from a specification of task requirements and object geometry’’ (p. 140). However, previous work (Wang and Popovic´, 2009; Wang et al., 2011) showed that nearest-neighbor classifiers work particularly well for recognizing hand postures when employing a large set of training samples which gives hope for proposing a robust solution to the problem. 4.1. Data set A data set of  44; 000 hand postures was collected from the 12 participants during the first task (object translation) with an average count of 200 postures per object (SD=77) which is equivalent to firmly holding each object for approximatively 3–4 s. For the second task,  200; 000 postures were collected with an average of 925 postures per object (SD=295) which is the equivalent to  15 s (SD=4.9) exploration time per object and participant.

that were acquired for each participant and each object would represent roughly the same posture and we expect a small amount of variation. However, for the second task which involved object exploration, we expect that variation in hand posture will increase significantly cause of the different finger configurations employed. Let A ¼ fa1 ; a2 ; . . . ; a9A9 g be the set of all hand postures acquired for object A for a given participant during either task 1 or 2. We define the within-object posture variation as follows: 9A9

wðAÞ ¼

1 X Jai aJ2 9A9 i ¼ 1

ð4Þ

where a represents the average posture for set A and J  J the Euclidean distance between two postures defined in the previous section. The average posture a of the set is obtained by averaging the values of each sensor for all postures in the set 0 1 9A9 9A9 9A9 X X X 1 1 1 ð5Þ a¼@ ai;1 ; ai;2 ; . . . ; ai;14 A 9A9 i ¼ 1 9A9 i ¼ 1 9A9 i ¼ 1 where each posture ai is described by 14 measurements ai ¼ ðai;1 ; ai;2 ; . . . ; ai;14 Þ as provided by the 5DT data glove. The within-object variation w(A) measures how much each individual hand posture deviates from the center of the set with respect to the Euclidean distance. If hand postures are roughly the same then we expect to find a small amount of variation. We computed the values of the variation for both translation and exploration tasks. As expected, variation was significantly larger for the exploration task (mean 0.20, SD¼ 0.11) than for the translation task (mean 0.10, SD¼ 0.08) as shown by a Wilcoxon signed-rank test (z ¼ 10:084; po 0:001; r ¼ 0:49). Interestingly, there was a significant effect of object size on variation as indicated by a Friedman test for both translation (w2 ð2Þ ¼ 6:778; po 0:05) and exploration (w2 ð2Þ ¼ 57:000; p o 0:001). Fig. 3 illustrates the variation values for the two tasks. The findings show that the larger objects are, more options for grabbing and holding them in a stable posture are

4.2. Hand posture variation We start by measuring the amount of variation in hand posture in relation with the variability in grasping objects observed in the motor control theory (Jones and Lederman, 2006). For the first task, participants used a stable hand posture in order to hold firmly the grasped object while moving it to a new location. Therefore, data

Fig. 3. Within-object variation in hand postures vs. object size. Error bars represent 95% CI.

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

Fig. 4. Within-object variation in hand postures vs. object shape. Error bars represent 95% CI.

possible during translation and especially during exploration. A significant effect of object shape was also found for both translation (w2 ð5Þ ¼ 22:984; p o 0:001) and exploration (w2 ð5Þ ¼ 43:095; p o 0:001). Fig. 4 illustrates the variation measured for each shape. Post-hoc tests revealed further insight on hand posture variation. Wilcoxon signed-rank tests (Bonferroni corrected) showed no significant differences between small and medium or medium and large sizes for the translation task. Also, n.s. differences were found between parallelepiped and cylinder or between cube and sphere (expected due to similar geometries). The differences were, however, significant for the exploration task. 4.3. Shape and size recognition accuracy The measurements on variation confirmed the observations from the motor control theory that grasping different sizes and shapes involves different hand postures. Also, different amounts of variation were found for the translation and exploration tasks. In these conditions, we investigate in the following whether hand posture information is robust enough in order to discriminate reliably between the size and shape of the objects from our set. In order to estimate recognition accuracy for object shape and size, the following approach was adopted. The hand posture data set representing a continuous recording for each participant, object, and task type (translation or exploration) was divided into training and testing. Specifically, a fixed time window of w postures was randomly extracted from the recorded timeline while all the remaining postures were used for training (see Fig. 5). Recognizers employed in different ways the information contained in the testing set of w postures (see below) in order to deliver a classification decision on the object shape and size. In order to estimate recognition accuracy, we repeated the window extraction procedure for 100 times for each object and each participant and counted how many times each recognizer was correct Recognition accuracy ¼

number of correct classifications ½100% 100 random trials

ð6Þ

597

This time window testing procedure corresponds exactly to how a recognition system will work in practice: posture data come as a continuous stream with the recognizer buffering a number of w consecutive hand postures for which a classification decision needs to be delivered. When considering the high frequency response of today’s data gloves (e.g., our 5DT glove delivers 60 measurements per second), the size of the window can be relatively large (e.g., up to 30–60 postures) for the classifier to take an informed decision by using more data. We performed trial tests with w ¼ 15, 30, and 60 postures for the window size which correspond to 0.25, 0.5, and 1 s, respectively, of continuous recording and found similar recognition results. Therefore, we only report in the following results for w ¼ 30 postures which is equivalent to outputting a classification response each half second in a real-time working system. We start our recognition analysis by using the results of Paulson et al. (2011) which tested different classifiers and feature spaces and found that nearest-neighbor classifiers working with either raw or principal component analysis features deliver the best recognition performance for the office tasks scenario. We therefore selected the nearestneighbor approach for our own analysis as well. The technique of Paulson et al. (2011) was to average all the postures recorded while performing a task in the office (e.g., drinking from a cup) and use the averaged result as the training prototype for each class. For our problem, this led to all the training hand postures recorded during translation and exploration to be averaged into a single representative posture. When we tried this approach we obtained recognition rates for shape and size3 of 64.3% (SD ¼ 12.5%) for the first task and 42.9% (SD ¼ 12.0%) for the second task which are considerably lower than the results reported by Paulson et al. (2011) for recognizing the 12 office activities. However, this result is somewhat expected as we work with objects between which much finer differences exist such as a cube and a sphere. Also, our data for the second task comes from object exploration and therefore exhibits greater variation. Averaging out all these data practically ignore object specificity and, as shown by the results, does not capture accurately the structure of each class in the sample space. This also shows that when going beyond rather distinct activities that only require firm postures, the particular problems identified by motor theorists with respect to constructing classification taxonomies for hand postures (Jones and Lederman, 2006) start to reveal for pattern recognists as well. As the averaging technique did not output satisfactory results, we investigated additional enhancements for the nearest-neighbor classifier that would take into account all the measured variance in our data. Therefore, we tried multiple classifiers which we describe below. We named the classifiers using the convention TRAINING-TESTING-TECHNIQUE 3 Recognition rates were computed for each user and averaged values are reported.

598

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

Fig. 5. The hand posture data set for a given participant, object, and task type is divided into training and testing. A randomly selected time window of size w makes up the testing set while all the remaining postures go to training.

where TRAINING and TESTING reflect how the training and testing sets are used [e.g., whether we average all the postures as in Paulson et al. (2011)] – AVG – or rather use every posture individually—RAW) while TECHNIQUE represents either the nearest-neighbor (NN) or the K-nearest-neighbor (KNN) classification approach. For example, the RAW–AVG– NN classifier employs the nearest-neighbor technique to classify a candidate hand posture obtained by averaging all the postures from the testing set (AVG) and compare the average posture against all the individual (RAW) data stored in the training set. The different versions of the classifiers are described below: 1. AVG-RAW-NN classifier: Training postures are averaged into a single posture for each object and task type which serves as a prototype or representative pattern. A single posture is randomly selected from the testing set (the time window of size w) and classified by comparing it via the Euclidean distance against the prototypes of all objects in accordance with the nearest-neighbor technique. The random selection of the training/testing sets and the random selection of the tested posture are repeated for 100 times in order to estimate the recognition accuracy of the technique as per Eq. (6). 2. AVG-RAW-KNN classifier: The training set is averaged while raw testing postures are used, as in the previous AVG–RAW–NN approach. However, this classifier uses the entire testing set (rather than one single posture) in order to deliver the classification result. The reported result corresponds to the most frequently detected class by performing all the classifications of each individual posture in the entire testing window of size w. This approach uses the principle of K-nearest-neighbor classification in order to return a more informed decision. To make a simple comparison with the previous technique, the AVG–RAW–KNN classifier holds out its decision for w ¼ 30 postures rather than classifying and reporting the result for each new posture the data glove is streaming. Again, the random selection of the training/testing sets is repeated for 100 times.

3. RAW-AVG-NN classifier: All the postures are used for training (raw data) but we average postures from the testing set into a single candidate that is submitted to classification. The averaged posture is compared against all the samples from the training set using the nearestneighbor approach. The idea behind this approach is that the nearest-neighbor classifier will operate better as the sample space is more populated (i.e., the class distributions of each object type are more dense). This acts as a brute force implementation. 4. RAW-RAW-NN classifier: All the postures are used for both training and testing without computing any averages. One single posture from the testing set is randomly selected and submitted to classification. The process is repeated for 100 times as in the previous approaches. 5. RAW-RAW-KNN classifier: All the postures are used for both training and testing but we output the most frequently detected class for the entire time window w. This combines the brute force of RAW–RAW–NN with the group weighted decision of KNN. Training and testing sets are randomly selected for 100 times in order to estimate recognition accuracy. 6. AVG-AVG-NN classifier: This is the classification procedure of Paulson et al. (2011) for which we already reported classification results. We include it here for completeness purposes.

Besides these classifiers which are directly derived from the nearest-neighbor approach, we also tested two state-ofthe-art machine learning techniques that have been employed before for recognizing static hand postures (Kelly et al., 2010; Pizzolato et al., 2010; Rashid et al., 2010). 7. Multilayer perceptron (MLP): Two MLP architectures were used for the size and shape classification problems. Pretests were conducted in order to determine the optimal number of neurons in the hidden layer that would deliver the best classification result. The final

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

MLP architectures for which we report results are 14  70  3 neurons for the size classification problem and 14  70  6 neurons for the shape problem. These architectures correspond to an input layer of 14 sensors, a hidden layer of 70 neurons, and a number of output neurons depending on the number of distinct classes (3 for the size and 6 for the shape classification problem). 8. Multiclass support vector machine (SVM): We used a single SVM architecture with a linear kernel type with imperfect separation of classes and C=1 multiplier for outliers. The model was trained using a k-fold cross validation technique and has one output coding the class to which the object belongs (3 sizes  6 shapes). A k factor of 20 was found during pretests to deliver the best results.

Results are displayed in Figs. 6 and 7 for both translation and exploration tasks. The highest recognition rates were achieved when combining the classification results of individual raw postures across the entire time window

599

(RAW–RAW–KNN). For the translation task, recognition rates were above 98% for both shape and size. For the exploration task, shape was recognized with 91% accuracy, size with 95.1%, while both shape and size were recognized with 90.1%. The Friedman test showed a significant difference between the recognition rates reported by the different techniques for both translation (w2 ð7Þ ¼ 5635:862; p o 0:001) and exploration tasks (w2 ð7Þ ¼ 5956:323; p o 0:001). 4.4. Recognizer design and findings from motor control theory The large corpus of hand posture data that was acquired allowed us to verify existing observations from psychology on the independence of finger movements and preferred patterns in grasping objects as well as to test whether these observations could help to increase the performance of our posture recognizers. We briefly summarize such findings under this section, report results on our own data, and

Fig. 6. Recognition rates for object shape for different techniques based on the nearest-neighbor approach. Error bars represent 95% CI.

Fig. 7. Recognition rates for object size for different techniques based on the nearest-neighbor approach. Error bars represent 95% CI.

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

600

devise new variants for the RAW–RAW–KNN recognizer as informed by these observations. Although each finger taken alone can perform a wide range of motions, hand muscles act over many joints which make some fingers hard or impossible to control independently for some specific motions. This is referred to as co-activation (Schieber and Santello, 2004). The degree of independence of fingers in motor tasks has been quantified by Hger-Ross and Schieber (2000) which reported the thumb and index fingers as the most independent while the ring and middle fingers the least independent. Correlations occur both for finger flexure and abduction between fingers, as noted in the practical task of typing characters (Fish and Soechting, 1992). We start by reporting results on correlation analysis for finger movements in order to detect shared variance between sensor output values. Fig. 8(a and b) illustrates the colorcoded Pearson correlation coefficients computed between all 14 sensors on the entire data (N¼ 43,859 values for the translation task and N¼ 199,569 for the exploration task). About 96% of all coefficients (87 out of (13  14)/2¼ 91) were significant at p¼ 0.01 (2-tailed). The sensor type and location are showed in Fig. 8(d) for convenience. Average correlations were 0.20 (SD¼ 0.15) for the translation task and 0.14 (SD¼ 0.16) for the exploration task. Fig. 8(c) illustrates the distribution of these coefficients, with majority being less than 0.20 but also showing that 12% of the sensor pairings are highly correlated (between 0.40 and 0.80). The largest correlations were found between sensors 7 and 10 (r¼ 0.73 for translation, r¼ 0.66 for exploration), 10 and 13 (r¼ 0.61 and r¼ 0.75), 11 and 14 (r¼ 0.61 and r¼ 0.55), 8 and 11 (r¼ 0.43 and r¼ 0.46). These results confirm observations from motor control such as the ring and middle fingers being the least independent (Hger-Ross and Schieber, 2000) (sensors 10 and 11 measure the ring finger and sensors 7 and 8 the middle finger). These findings can be used in order to inform a better design of the distance reporting dissimilarity between hand postures that would take into account shared variance between different sensors. The simplest way to achieve this is to adopt a weighting scheme which makes the Euclidean distance become dðp; qÞ ¼ JpqJ ¼ ð

14 X

wi  ðpi qi Þ2 Þ1=2

distance employing weights derived by averaging translation correlation results (see Fig. 8a) w ¼ f0:84; 0:85; 0:81; 0:80; 0:83; 0:78; 0:72; 0:78; 0:85; 0:73; 0:78; 0:79; 0:83; 0:85g 10. Exploration-weighted RAW-RAW-KNN: The RAW– RAW–KNN technique is used together with the Euclidean distance employing weights derived by averaging exploration correlation results (see Fig. 8b) w ¼ f0:93; 0:93; 0:93; 0:82; 0:90; 0:88; 0:82; 0:88; 0:84; 0:78; 0:84; 0:81; 0:80; 0:84g

We continue our investigation and look further at studies that report on frequently used grasping patterns. For example, it was observed that people tend to use a threefinger grasp when reaching for most objects (the ‘‘tripod’’ grasp) while sometimes a pinch grip is employed for small objects (Jones and Lederman, 2006). Also, when the size or mass increase, four fingers are used to pick up the object (Cesari and Newell, 2000). Other researchers noted that the thumb position does not depend on the size of the object being grasped which seems to influence mostly the position of the other two digits (index and middle fingers) (Gentilucci et al., 2003). The relative contribution of each individual finger force during grasping was studied by Kinoshita et al. (1995) which reported 42%, 27%, 18%, and 13% percentages for the index, middle, ring, and little fingers for fivefinger grips, respectively. Informed by these observations on grasping patterns (Cesari and Newell, 2000; Gentilucci et al., 2003; Jones and Lederman, 2006; Kinoshita et al., 1995), we considered three new variants for the RAW–RAW–KNN classifier, as follows:

11. 3-finger-weighted RAW-RAW-KNN: The Euclidean distance is weighted with 0/1 values in order to focus solely on three-finger grasps w ¼ f1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 0:0; 0:0; 0:0; 0:0; 0:0; 0:0g

ð7Þ

i¼1

where wi 2 ½0: :1 represents the normalized weight controlling the contribution of sensor i into the final result. For each sensor we computed the average correlation with all the other 13 sensors (see Fig. 8(a and b)) and defined the weight for sensor i as the complement of the averaged correlation value with respect to 1.0. The resulted weights led to two new variants of the RAW-RAW-KNN classifier, as follows:

9. Translation-weighted RAW-RAW-KNN: The RAW–RAW– KNN technique is used together with the Euclidean

12. 4-finger-weighted RAW-RAW-KNN: The Euclidean distance is weighted with 0/1 values in order to focus solely on four-finger grasps w ¼ f1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 1:0; 0:0; 0:0; 0:0g 13. Force-weighted RAW-RAW-KNN: The Euclidean distance is weighted with force percentages as informed by the study of (Kinoshita et al., 1995) w ¼ f0:0; 0:0; 0:0; 0:42; 0:42; 0:0; 0:27; 0:27; 0:0; 0:18; 0:18; 0:0; 0:13; 0:13g

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

601

Fig. 8. Correlation analysis between the values reported by sensors installed at various locations: (a) Pearson correlation coefficients computed on the translation task data (N¼ 43,859, 96% coefficients were significant at p¼0.01). Darker colors show larger absolute values; (b) Pearson correlation coefficients computed on the exploration task data (N¼ 199,569, 97% coefficients were significant at p ¼0.01); (c) frequency histogram of individual correlation values showing that 12% of sensor pairs are highly correlated between 0.40 and 0.80 and (d) actual location of the 14 sensors on the glove: 10 sensors measure finger flexion (shown in white) and four sensors measure abduction between fingers (shown in black). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

The recognition rates for object shape and size for the new classifiers are displayed in Figs. 9 and 10 respectively. The performance of the unweighted RAW–RAW–KNN technique is shown for comparison convenience. The Friedman test showed a significant difference between the recognition rates reported by the different techniques for both translation (w2 ð5Þ ¼ 351:851; po 0:001) and exploration tasks (w2 ð5Þ ¼ 1098:048; po 0:001). When analyzing size recognition accuracies, post-hoc Wilcoxon signed-rank tests showed nonsignificant differences in the translation task between the unweighted and fourfinger grasp (Z ¼ 0:991 n:s:) and between the unweighted and correlation-weighted technique (Z ¼ 0:964 n:s:), and

a significant but small Cohen effect difference between unweighted and three-finger grasp (Z ¼ 9:563, p o 0:001, r=0.20). For the exploration task, Wilcoxon tests revealed a nonsignificant difference between unweighted and correlationweighted (Z ¼ 0:739 n:s:) and a significant yet small effect difference between unweighted and four-finger grasp (Z ¼ 8:299, po 0:001, r=0.17). When analyzing shape recognition performance, tests showed nonsignificant differences between unweighted and four-finger grasp (Z ¼ 0:790, po 0:001) for the translation task and nonsignificant differences between unweighted and correlation-weighted for both translation and exploration tasks. These results show little influence of the ring and little fingers for the classification

602

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

Fig. 9. Recognition rates for object shape using the RAW–RAW–KNN technique and weighted Euclidean distances. Error bars represent 95% CI.

Fig. 10. Recognition rates for object size using the RAW–RAW–KNN technique and weighted Euclidean distances. Error bars represent 95% CI.

result for the size problem while these fingers bring useful information for recognizing shape. In the end, the unweighted RAW–RAW–KNN technique using all data seems to deliver the best performance with low implementation overhead. 4.5. Improving real-time classification performance The previous sections identified RAW–RAW–KNN as the technique delivering the best classification performance by employing raw training and testing data. However, the size of these sets can be quite large in practice which may impact the real-time performance of the system. Fig. 11 illustrates the response time needed for each technique to produce a classification result. Times were measured on a 2.40 GHz Intel CoreDuo Quad CPU for an average size of the training set of  3000 postures for the translation task and 16,000 postures for the exploration task. Both RAW–RAW classification techniques show considerably larger execution times due to their use of each individual sample from the training and testing sets. Even though classifications are being reported in 0.63 s which can fulfill at the limit the requirements of a realtime working system, we are interested in the following in using a smaller training set for the same classification task

(with the purpose of reducing memory storage of samples and execution time for classification). Therefore, we explore filtering the time line of postures associated to a given object by eliminating postures that are too similar with respect to some threshold E. Procedure FILTER-TRAINING-SET(: :) lists the pseudo code describing this process. Algorithm 1. FILTER-TRAINING-SET (trainingSet). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

reference’trainingSet0 filteredSet’freferenceg for i’1 to sizeof(trainingSet) do distance’Euclidean-Distanceðreference; trainingSeti Þ if distance 4 E then reference’trainingSeti filteredSet’filteredSet [ freferenceg

end if end for return filteredSet

The variation analysis for the first task from a previous subsection shows a mean value of 0.10 which represents the average Euclidean distance squared. This suggests a

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

603

Fig. 11. Classification times for different techniques measured on a 2.40 GHz Intel CoreDuo Quad CPU.

pffiffiffiffiffiffiffiffiffi filtering value of E ¼ 0:10 ¼ 0:32 to use between consecutive postures. However, in order to test the effect of filtering, we experimented with various thresholds running from 0.1 to 0.5 in increments of 0.1. We only provide results for the RAW–RAW–KNN approach that showed the best performance in the previous test. Figs. 12 and 13 illustrate the effect of filtering on both recognition rate and training set size. As expected, recognition rates become lower as the size of the training set decreases. The Friedman ANOVA test showed significant differences for size and shape recognition rates for both tasks (at p o 0:001). However, recognition rates are still large enough for practical purposes for low E filters. For the translation task, size recognition rate is above 97% and shape rate above 95% for E ¼ 0:1 and 0.2 (which reduces considerably the number of training postures from 3000 to 268 and 131, respectively). Also, size rate is above 93% and shape rate above 90% for the exploration task using E ¼ 0:1 (corresponding to a reduction in training set size from 16,000 to 4000 hand postures). Filtering beyond E4 0:2 accentuates the degradation of the recognition rate. When compared to the unfiltered training set, the E ¼ 0:1 option seems a good compromise. The classification time for the translation task dropped from 118 ms for the unfiltered set to 11 ms for the E ¼ 0:1 filtered set (see Fig. 14). For the exploration task, classification time reduced from 629 ms to 160 ms. Results, even for the unfiltered training set, show applicability of the classification technique for real-time scenarios.

Fig. 12. The effect of filtering on recognition rate. The 95% CI error bars too small to display.

4.6. User-independent recognition results

Fig. 13. The effect of filtering on the size of the training set. The 95% CI error bars too small to display.

The above recognition results were reported for userdependent training for which both testing and training data come exclusively from the same user. However, interactive systems would greatly benefit of userindependent scenarios which eliminate the need of training before using the system. We therefore performed another recognition test in order to see how the user-independent

training scenario would work for this technique. We used the RAW–RAW–KNN enhancement of the nearest-neighbor algorithm running on training sets filtered with E ¼ 0:1. In this scenario, hand postures from one participant were considered for testing while the rest of the participants were used for training. We repeated the recognition test for each participant data acting as testing. However, size

604

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

postures from B are as close from the center of A as are the postures from A. The threshold could be chosen as maxi¼1;9A9 fJai aJg. However, in order not be biased by outliers, we remove 5% of the largest distances (and denote the threshold D0:95 ). The definition of postures from B that are shared by A which we denote gðA9BÞ is therefore gðA9BÞ ¼ 9fbj 2 B9Jbj aJo D0:95 ; j ¼ 1; 9B9g9 We can similarly define gðB9AÞ and use the two measures for the final definition of shared postures between A and B (normalized in ½0: :1) as follows: Fig. 14. The effect of filtering on classification times measured on a 2.40 GHz Intel CoreDuo Quad CPU.

recognition rate for the translation task was only 57.0% and shape rate was 26.0%. Surprisingly, rates were similar for the exploration tasks (we expected them to be lower as in the user-dependent scenario from the previous section). A total number of 3400 and 46,000 postures, respectively, were used for the training sets of the two tasks. At this point of the analysis, we can conclude that user-dependent training is needed for robust results although a taskoriented classification may provide better results as informed by the observations of Napier (1956) that considered task type having more influence on posture rather than object characteristics alone. 4.7. Shared postures It is interesting to understand how many hand postures are being shared by two different objects during either translation or exploration. By analyzing the results of our recognition tests involving different nearest-neighbor techniques, it is clear that a certain amount of shared postures does exist. Also, looking at the low rates of the first three NN approaches (see Figs. 6 and 7), we can even hypothesize that there must be a relative large percent of such shared postures. In order to quantify the amount of shared postures, we must introduce an appropriate measure. Let A and B be two sets of postures, A ¼ fa1 ; a2 ; . . . ; a9A9 g and B ¼ fb1 ; b2 ; . . . ; b9B9 g, associated to objects A and B. The amount of shared postures would be the cardinal of the intersection A \ B. However, when two postures are being compared, the Euclidean distance needs to be used which does not provide a yes/no equality answer but a positive value corresponding to the dissimilarity of the two postures. Also, we are interested in this problem from the perspective of the recognition rate. Therefore, in order to count how many postures from B could also be part of A (or are being shared by A), we compute the ordered list of distances from all postures ai 2 A to some reference posture (e.g. the average a of A). We then compute the distance from a for every posture bj from B and count those for which the distance is less than a threshold. Equivalently put, we count how many

gðA; BÞ ¼

gðA9BÞ þ gðB9AÞ 9A9þ 9B9

ð8Þ

Using this measure, we computed the shared postures between every pair of objects from our set. We found an average of 22.3% shared postures (SD=24.4%) for the translation task while for the exploration task the percentage was 65.1% (SD=22.4%). The Wilcoxon signed-rank test showed a significant effect of task over the percentage of shared postures (z ¼ 48:555; p o 0:001) with a large effect (r=0.57). Fig. 15 illustrates visually the amount shared postures for every pair of objects in our set. 5. Discussion and design guidelines Our findings show that hand posture can be used in order to recognize the size and shape of grasped objects provided that user training procedures run first. In order to assist designers and practitioners in implementing this technique in their applications as well as to understand the overhead of the user training procedure, we summarize below the training stage and the parameters of the RAW–RAW–KNN classification technique in the form of design guidelines. Results showed that user training is mandatory for obtaining high recognition rates. If the requirement of the application is to simply recognize object size and shape from a stable grasp, then approximately 5 s of continuous recording at 60 Hz proved sufficient for our translation task. If object properties are to be inferred during object exploration then a manipulation time of about 20 s per object consisting in repetitive rotations with the fingers is sufficient. We arrived at the 20 s threshold by looking at the average manipulation time for the exploration task which was 19.2 s (SD ¼ 6.1). Fig. 16 illustrates the various manipulation times. We also found that large objects took more time to manipulate (21.6 s, SD¼ 7.8 s) than small (18.1, SD ¼ 4.4) or medium sized ones (18.0, SD¼ 5.1) as showed by an ANOVA test (F ð2; 213Þ ¼ 8:305; po 0:001). However, the actual differences are small and post-hoc tests confirm this conclusion. For example, Bonferroni post-hoc tests showed no significant difference between the time needed to manipulate small and medium objects. Also, no significant difference was observed between manipulation times of different object types, F ð5; 210Þ ¼ 1:758 n:s: In the end, as a rough guideline,  20 s of continuous object exploration at a measurement rate of

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

605

100%

100% 0%

translation 100%

shared postures

75%

65.1%

50%

25%

0%

22.3%

translatione xploration

0%

exploration Fig. 15. Shared postures between the objects used in the experiment for translation (top) and exploration (bottom) tasks. The percentage of shared postures between two objects is computed as per Eq. (8). A darker color indicates a higher percentage. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

60 Hz will allow building a training set large enough for robust recognition results. In what concerns the recognition technique, we found that the RAW–RAW–KNN approach worked best. This means using raw individual postures for both training and testing but combining the classification results over a window of size w ¼ 30 (half a second of recording). We also found that a considerable reduction in classification time (as well as in occupied memory for storing the training samples) can be safely achieved by filtering the data stream of postures with the Euclidean distance and a threshold of E ¼ 0:1. 5.1. Public hand posture data set and source code A large amount of data were acquired for this work. In order for other researchers to be able to replicate and advance

Fig. 16. Object manipulation time needed for acquiring the training set of hand postures: 5 s for translation vs. 20 s for exploration tasks considering posture data delivered at 60 Hz.

606

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607

our results, we decide to make the data set public together with the source code used for the recognition analysis. We make available both hand posture data files (sensor measurements and recording time lines) and source code on the corresponding author’s web page. 6. Conclusion and future work The paper explored the use of hand posture for recognizing the size and shape of grasped objects during both translation and exploration tasks. The results of an experiment designed in relation with observations from the motor control theory show that size and shape can be recognized with rates of 98% for firm grasping (translation) and 95% and 91% during object exploration for user-dependent training. Recognition tests showed that large training sets are needed for accurate results. However, a practical procedure was described for delivering low classification times with less resources (filtering) as well as for the simple training of the system (20 s of continuous finger aided rotations). Additional definitions and measurements for hand posture variation and shared postures were introduced and discussed. We believe that the results presented in this paper bring more light to the opportunity of using the hand as a device for automatic detection of object properties in terms of size and shape. Future work will consider designing new recognition experiments in accordance with other taxonomies from motor control theory that also take into account the intended use of an object (Napier, 1956). We believe that restricting classification in the context of a given task might improve recognition rate for the userindependent case as the range of possible finger movements will be constrained by the actual task. Future work on user-independent recognition could also consider employing additional sensors such as accelerometers. Also, it would be interesting to investigate whether object properties can be inferred even much sooner than the moment when the hand actually touches the object or whether other object properties can be also detected (e.g., texture by recognizing the exploratory procedures described by Lederman and Klatzky, 1987). The experiment conducted in this work made use of a data glove as the most reliable technology today for retrieving fast and accurate hand pose and finger flexure data. As sensing technology advances, noninvasive acquisition devices such as Kinect4 could be explored. In the end, we believe that the results presented in this paper can benefit practitioners from pattern recognition and human–computer interaction by providing more insights on recognition algorithms and design guidelines for using the hand as a measurement device but they may also prove useful to researchers from psychology and motor control interested in new tools for hand posture analysis. We also hope that the large corpus of hand 4

http://www.microsoft.com/en-us/kinectforwindows/

posture data we provided will help advance the state-ofthe-art techniques for inferring object properties from hand measurements, leading the HCI developments in free-hand interfaces towards the future world of ubiquitous computing. References Appert, C., Zhai, S., 2009. Using strokes as command shortcuts: cognitive benefits and toolkit support. In: CHI ’09. pp. 2289–2298. Baudel, T., Beaudouin-Lafon, M., 1993. Charade: remote control of objects using free-hand gestures. Communications of the ACM 36 (July), 28–35. Buchholz, B., Armstrong, T., 1992. A kinematic model of the human hand to evaluate its prehensile capabilities. Journal of Biomechanics 25 (2), 149–162. Buchholz, B., Armstrong, T., Goldstein, S., 1992. Anthropometric data for describing the kinematics of the human hand. Ergonomics 35 (3), 261–273. Cesari, P., Newell, K., 2000. Body-scaled transitions in human grip configurations. Journal of Experimental Psychology: Human Perception and Performance 26 (5), 1657–1668. Chen, F.-S., Fu, C.-M., Huang, C.-L., 2003. Hand gesture recognition using a real-time tracking method and hidden Markov models. Image and Vision Computing 21 (8), 745–758. Cutkosky, M.R., Howe, R.D., 1990. Human Grasp Choice and Robotic Grasp Analysis. Springer-Verlag, New York, Inc., New York, NY, USA, pp. 5–31. Erol, A., Bebis, G., Nicolescu, M., Boyle, R.D., Twombly, X., 2007. Vision-based hand pose estimation: a review. Computer Vision and Image Understanding 108 (October), 52–73. Farella, E., Cafini, O., Benini, L., Ricco, B., 2008. A smart wireless glove for gesture interaction. In: ACM SIGGRAPH 2008 Posters. SIGGRAPH ’08. pp. 44:1–44:1. Fish, J., Soechting, J.F., 1992. Synergistic finger movements in a skilled motor task. Experimental Brain Research 91, 327–334. http://dx.doi.org/10.1007/ BF00231666. Gentilucci, M., Caselli, L., Secchi, C., 2003. Finger control in the tripod grasp. Experimental Brain Research 149, 351–360. http://dx.doi.org/10.1007/ s00221-002-1359-3. Gentilucci, M., Castiello, U., Corradini, M., Scarpa, M., Umilta, C., Rizzolatti, G., 1991. Influence of different types of grasping on the transport component of prehension movements. Neuropsychologia 29 (5), 361–378. Hger-Ross, C., Schieber, M.H., 2000. Quantifying the independence of human finger movements: comparisons of digits, hands, and movement frequencies. Journal of Neuroscience 20, 8542–8550. Ilie-Zudor, E., Keme´ny, Z., van Blommestein, F., Monostori, L., van der Meulen, A., 2011. Survey paper: a survey of applications and requirements of unique identification systems and rfid techniques. Computers in Industry 62 (April (3)), 227–252. Jakobson, L.S., Goodale, M.A., 1991. Factors affecting higher-order movement planning: a kinematic analysis of human prehension. Experimental Brain Research 86, 199–208. http://dx.doi.org/10.1007/ BF00231054. Jenmalm, P., Goodwin, A.W., Johansson, R.S., 1998. Control of grasp stability when humans lift objects with different surface curvatures. The Journal of Neurophysiology 79 (April (4)), 1643–1652. Jones, L.A., Lederman, S.J., 2006. Human Hand Function. Oxford University Press. Kelly, D., McDonald, J., Markham, C., 2010. A person independent system for recognition of hand postures used in sign language. Pattern Recognition Letters 31 (August (11)), 1359–1368. Kim, J.G., Kim, B.G., Lee, S., 2007. Ubiquitous hands: context-aware wearable gloves with a rf interaction model. In: Proceedings of the 2007 Conference on Human Interface: Part II. pp. 546–554.

R.-D. Vatavu, I.A. Zait¸i / Int. J. Human-Computer Studies 71 (2013) 590–607 Kinoshita, H., Kawai, S., Ikuta, K., 1995. Contributions and coordination of individual fingers in multiple finger prehension. Ergonomics 38 (6), 1212–1230. Lederman, S.J., Klatzky, R.L., 1987. Hand movements: a window into haptic object recognition. Cognitive Psychology 19 (3), 342–368. Li, Y., 2010. Protractor: a fast and accurate gesture recognizer. In: CHI ’10. pp. 2169–2172. Makenzie, C.L., Iberall, T., 1994. The Grasping Hand. North-Holland, Elsevier Science B.V., Amsterdam, The Netherlands. Napier, J., 1956. The Prehensile Movements of the Human Hand, vol. 38B. Paulson, B., Cummings, D., Hammond, T., 2011. Object interaction detection using hand posture cues in an office setting. International Journal of Human–Computer Studies 69 (January–February (1–2)), 19–29. Pizzolato, E.B., dos Santos Anjo, M., Pedroso, G.C., 2010. Automatic recognition of finger spelling for libras based on a two-layer architecture. In: Proceedings of the 2010 ACM Symposium on Applied Computing. SAC ’10. pp. 969–973. Rashid, O., Al-Hamadi, A., Michaelis, B., 2010. Utilizing invariant descriptors for finger spelling american sign language using svm. In: Proceedings of the Sixth International Conference on Advances in Visual Computing, vol. Part I. ISVC’10. pp. 253–263. Santello, M., Flanders, M., Soechting, J.F., 2002. Patterns of hand motion during grasping and the influence of sensory guidance. The Journal of Neuroscience 22 (February (4)), 1426–1435. Santello, M., Soechting, J.F., 1998. Gradual molding of the hand to object contours. Journal of Neurophysiology 79 (3), 1307–1320. Schieber, M.H., Santello, M., 2004. Hand function: peripheral and central constraints on performance. Journal of Applied Physiology 96, 2293–2300.

607

Siegemund, F., Florkemeier, C., 2003. Interaction in pervasive computing ¨ settings using bluetooth-enabled active tags and passive rfid technology together with mobile phones. In: Proceedings of the First IEEE International Conference on Pervasive Computing and Communications. PERCOM ’03. pp. 378–387. Sturman, D.J., Zeltzer, D., 1994. A survey of glove-based input. IEEE Computer Graphics and Applications 14 (January), 30–39. Tanenbaum, K., Tanenbaum, J., Antle, A.N., Bizzocchi, J., Seif el Nasr, M., Hatala, M., 2011. Experiencing the reading glove. In: Proceedings of the Fifth International Conference on Tangible, Embedded, and Embodied Interaction. TEI ’11. ACM, New York, NY, USA, pp. 137–144. Tsai, C.-Y., Lee, Y.-H., 2011. The parameters effect on performance in ann for hand gesture recognition system. Expert Systems with Applications 38 (7), 7980–7983. Wachs, J.P., Kolsch, M., Stern, H., Edan, Y., 2011. Vision-based hand¨ gesture applications. Communications of ACM 54 (February), 60–71. Wang, R., Paris, S., Popovic´, J., 2011. 6d hands: markerless hand-tracking for computer aided design. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. UIST ’11. ACM, New York, NY, USA, pp. 549–558. Wang, R.Y., Popovic´, J., 2009. Real-time hand-tracking with a color glove. ACM Transactions on Graphics 28 (3). Want, R., 2006. An introduction to rfid technology. IEEE Pervasive Computing 5 (January (1)), 25–33. Wing, A.M., Haggard, P., Flanagan, J.R., 1996. Hand and Brain: the Neurophysiology and Psychology of Hand Movements. Academic Press. Ziegler, J., Urbas, L., 2011. Advanced interaction metaphors for rfidtagged physical artefacts. In: IEEE International Conference on RFID-Technologies and Applications (RFID-TA). IEEE, pp. 73–80.