Gesture recognition for manipulation in artificial realities

Gesture recognition for manipulation in artificial realities

Symbiosis of Human and Artifact Y. Anzai, K. Ogawa and H. Mori (Editors) © 1995 Elsevier Science B.V. All rights reserved. Gesture Recognition for Ma...

414KB Sizes 32 Downloads 90 Views

Symbiosis of Human and Artifact Y. Anzai, K. Ogawa and H. Mori (Editors) © 1995 Elsevier Science B.V. All rights reserved.

Gesture Recognition for Manipulation in Artificial Realities Richard Watson ~* and Paul O'Neill b ~Computer Vision Group, Department of Computer Science, Trinity College, Dublin 2, Ireland bIona Technologies Ltd., 8-34 Percy Place, Dublin 4, Ireland In [1], we conclude that the flexible manipulation, by a human operator, of virtual objects in artificial realities is augmented by a gesture interface. Such an interface is described here and it can recognise static gestures, posture-based dynamic gestures, posebased dynamic gestures, a "virtual control panel" involving posture and pose and simple pose-based trajectory analysis of postures. The interface is based on a novel, application independent technique for recognising gestures. Gestures are represented by what we term approzirnate splines, sequences of critical points (local minima and maxima) of the motion of degrees of freedom of the hand and wrist. This scheme allows more flexibility in matching a gesture performance spatially and temporally and reduces the computation required, compared with a full spline curve fitting approach. Training the gesture set is accomplished through the interactive presentation of a small number of samples of each gesture. 1. T H E G E S T U R E 1.1.

Input

INTERFACE

and Output

Streams

The Gesture Interface receives two streams of input and produces one output stream: A stream of time-stamped homogeneous transformations describing the pose (position and orientation) of the wrist with respect to the Control Space Base Frame. This input stream is generated by the GESTURE (POSE) subsystem. A stream of time-stamped values describing the posture of the hand and arm 2. Each value gives the magnitude of a particular degree of freedom of h a n d / a r m posture. This input stream is generated by the GLAD-IN subsystem (i.e. the instrumented glove and exoskeleton) 3. *This research was funded by the Commission of the European Communities under the ESPRIT II Framework 2The pose and posture data may be provided from any source. During development of the Gesture Interface these input streams were produced from a high-level motion description simulation language [2]. In the later stages of development this simulation was replaced by input streams produced from pose/posture data recorded from the GLAD-IN (Glove-like Advanced Interface) and GESTURE (Wrist Pose calculation process) subsystems, and subsequently by the live data. 3The angular magnitudes received from the GLAD-IN subsystem are assumed to correspond (within given tolerances) to the true angular magnitudes of the hand/arm degrees of freedom (dofs). In other

Each time the Gesture Interface recognises a physical gesture it sends at least a start and an end gesture notification to the client application. 2. G E S T U R E

RECOGNITION

In pattern recognition terms, the features extracted in this system, are critical points of a degree of freedom's motion or discontinuities. A discontinuity is a peak, a trough, or either the start or end of a plateau, as shown in figure 1. The classification stage is

peak

'/

~

peak trough /

ateau



tO

start of plateau

Observationsstart fromtimetO. Time

Figure 1. Time-space pattern of a metacorpophlangeal joint (knuckle) in performing a gesture a template matching process where sequences of discontinuities for each degree of freedom (dof) are compared against those extracted. A further classification stage calculates whether the gesture is acceptable according to several fit metrics. Analysing the input data from the proprioceptive glove and the pose calculation module, discontinuity extraction can be performed by analysing the angular velocity of a degree of freedom. Hand jitter is modelled simply by high frequency motion, thus the critical points are extracted using a low-pass filter. 2.1. C l a s s i f i c a t i o n The interface module maintains a set of gesture templates, composed of sequences of discontinuities for sequences of degrees of freedom. The templates may be viewed as the axes of a multi-dimensional gesture space; thus the aim of the classifier is to firstly calculate the axis to which a given set of observed motion discontinuities is closest, and then to decide whether this is close enough given a set of distance metrics. The first process of mapping a set of observed discontinuities to a gesture subspace i.e., matching sequences of discontinuities, can be formulated as a finite state acceptor (FSA), shown here as the 5-tuple, M~ = < Q,I, 5, q0, F >. M~ accepts an instance of the correct discontinuity pattern, for a degree of freedom, j and a gesture class, c, where the state set, Q, is the set of partial pattern matches, the input alphabet, I, is the set of words, the GLAD-IN calibration procedure is assumed to be effective enough to reduce/remove the need for user-specific training of the Gesture Interface.

discontinuity types, the transition function, 5, is determined by the temporal sequence of discontinuities trained for this template, the initial state, q0 is the first discontinuity in the sequence, and F C_ Q, acceptable halting states, is the final discontinuity. An example discontinuity pattern and its representation in this formulation is shown in figure 2. Dof i end plateau

st plateau e-----O

Dofi max .e

• min

....,\

st plateau

e----

Time

min (

)

st plateau (

/

end plateau i

!

((...... /)

Figure 2. Template Discontinuity pattern for a single degree of freedom and a labelled digraph corresponding to its FSA.

e min

in Time

Figure 3. Template pattern with a recurring discontinuity.

The matching process is made more complex by the small number of discontinuity types. Consider, for example, the problem occuring where a template with a recurring discontinuity, as in figure 3. The first two discontinuities have been matched. As a minimum is observed, it is not clear whether this is the first or the third discontinuity. Thus, a new matching attempt must be started as another instantiation of the FSA for this degree of freedom to cover the former case.

2.2. F e a t u r e C o m p u t a t i o n

Most gesture templates have a small number of discontinuities, thus the set of gestures which can be unambiguously represented is correspondingly small. For example, the set of static gestures, consisting of a start plateau followed by an end plateau for each degree of freedom are represented identically. A set of features and corresponding metrics further characterise and disambiguate gestures, at 2 levels of detail: per discontinuity, i and per degree of freedom (sequence of discontinuities), j. These features are described formally, where C represents the set of gesture classes or templates. Thus qi~j(x) is the observed magnitude of the discontinuity i, degree of freedom j, in gesture template c. Q[j(x) is the equivalent discontinuity in template c. There are also interest conditionals: ;(c, j), which is true when the degree of freedom j is significant for classification of gesture class c and ~(7, c, j), which is true when the metric f is significant for the degree of freedom j, and gesture class c. For each metric, gesture class, degree of freedom and discontinuity used there is a corresponding acceptability threshold, e, computed by the Gesture Training Module.

D i s c o n t i n u i t y i level metrics C C Absolute Magnitudes Q(j,x) " Vi Iqi,j(x) - Qi,j(x)l < el,c~Q ~ (x) A ~(Q(j,x), c, j)

Absolute Timestamps Q(j, t)'Vi Iqi~,j(t)- Qi~,j(t)[ < e~:Q(t) A ~(Q(j, t), c, j) D e g r e e of F r e e d o m j level metrics Aggregate discontinuity level metrics l-I(/): VjQ(j,x) A Q(j, t) A ~(c, j) c,A

Range of Motion A(x) • Vj 5](x) - 5~(x) < e.,j (x) A ~(A(x), c, j) A ~(c,j) where 5](x) = I max(q:j(x)) - min(q:j(x))[ and 5~(x) - I max(Q[j(x)) - min(Q.~j(x))l Spatial Scaling Uniformity S ( x ) ' V j ( N ( j , x ) where N(j, x) =

I Ei--1

< e['jS(x)A ~o(S(x),c,j))A ;(c, j)

qiC,j(X)Qic,j(x) c

X 2

Temporal Scaling Uniformity $(t) " Vj(N(j,t) < e.~,)S(t) A ~($(t),c,j)) A ;(c,j) I

where N(j, t) =

Ei=I qi~,j(t)Q~,j(t) c 2 V/~ill qi,j(t) ~ i : 1I

Qc.w,, "(t)2/

Hence the gesture class c is matched if : H(i) A A(x) A S(x) A S(t) Figures 4 and 5 show the scope for spatial and temporal scaling in this approach. For one degree of freedom, a set of observed discontinuities is matched to corresponding template discontinuities. Disregarding the absolute values of degree of freedom magnitude and timestamp cannot strictly be called scale-invariance, since ignoring these values allows many types of pattern warping. The subset of these metrics to employ for a particular gesture is specified by the user in an interactive training procedure. 2.3. W r i s t Pose The pose of the wrist is provided by GESTURE as a homogeneous transformation from which three degrees of freedom for position and three for orientation may be extracted. Dynamic gestures involve movement, and hence naturally the position and orientation of the wrist. The pose of the wrist may be important in one of several ways: - Translating static gestures (holding hand posture constant and changing hand pose) to add emphasis or parameters to the original meaning, or to easily multiply the number of gestures recognised by differentiating the direction of translation as in Fels' system [3]. - In a gesture, for example, where the posture is a point, the direction along which this point is made may be important, or it may be necessary to actually translate the posture in the desired direction.

Observed Pattern

@

Template

Observed Pattern

--iiii!!~ o

Time

Template

/ ¢

Time

Figure 5. Temporal scaling

Figure 4. Spatial scaling

Patterns traced out by the position of a fingertip are examples of gestures, a circle, meaning rotate, or an X drawn over an object to mean remove it from view. Positional trace pattern gestures are handled within the framework provided by the classifier by treating discontinuities in (z, y, z) position identically to posture discontinuities. Thus, circles and X patterns, for example, have templates consisting of patterns of temporally ordered discontinuities in the x, y and z axes. To prevent spurious matches it is necessary to apply fit metrics to the circle trace gesture: minimum diameter and diameter ratio are employed. 3. G E S T U R E T R A I N I N G

MODULE

The purpose of the Gesture Training Module is to semi-automatically compute a representation for each physical gesture. The required representation will vary from user to user. Usually this variation will lie only in discontinuity magnitudes and time-stamps. The purpose of asking the user to perform multiple samples of each gesture is to obtain an idea of the natural variation in the way the person makes the gesture. There are two principal points to note about the gesture training mechanism described in this section. It is only necessary to present a small number of samples of each physical gesture to the system. Empirical tests show that five samples of each physical gesture are sufficient. Also, the end-product of training is an ezplicit, understandable representation of each gesture. The information required to fully describe a physical gesture may be broken into two categories: (i) A u t o m a t i c a l l y Generated. This information is computed from presented gesture samples. It consists of: discontinuity patterns, discontinuity magnitudes and time-stamps, acceptance tolerances for metrics using the magnitudes and time-stamps and jitter tolerance (used as a threshold during discontinuity extraction). Consider a single degree of freedom. If the discontinuity patterns based upon each of the samples are not identical to each other then a majority voting algorithm is invoked. (ii) U s e r - S u p p l i e d . This consists of decisions about the appropriateness of metrics to apply to gestures and degrees of freedom. During training the user typically "refreshes" an existing set of physical gesture templates, through recomputation of the (user-specific) automatic information. In this case it is not necessary for the user to supply information about applicable metrics.

10

4. CONCLUSIONS

4.1. R e s u l t s An arbitrary number 4 of static gestures can be recognised from the Irish single-handed deaf alphabet as can posture-based dynamic gestures such as "Come Here ''5 and "Thumb Click ''6. The following pose-based dynamic gestures can be recognised based upon their discontinuity patterns: "Circle" 7 and "X"S. By employing these gestures, artificial reality commands such as navigation, point and click ("mouse emulation"), view point manipulation (zooming, panning etc.), metacommands (such as resetting the viewpoint or quitting from the system), and manipulation of graphical objects (i.e., grasping, their creation and deletion) can be effected. These virtual world commands are documented in more detail in a further paper [4]. 4.2. F u t u r e W o r k Future work will concentrate on development of a more flexible discontinuity pattern representation which allows variability to be expressed elegantly and orientation-invariant descriptions of pose-based gestures. At present the computational task of recognising gestures is O(n), where n is the number of gesture classes (or templates). A method of constructing a tree (or hash table) of partial discontinuity sequence matches would (in theory) reduce this complexity to O(log n).

REFERENCES 1.

2.

3.

4.

Richard Watson. A Survey of Gesture Recognition Techniques. Technical Report TCD-CS-93-11, Department of Computer Science, Trinity College Dublin, July 1993. Available at ftp://ftp.cs.tcd.ie/pub/tcd/tech-reports/reports.93/TCD-CS-93-11.ps.Z. Richard Watson. A Gesture Simulation Language. Technical Report TCD-CS-93-12, Department of Computer Science, Trinity College Dublin, July 1993. Available at f t p : / / f t p , cs. t cd.ie / pub / tcd / t ech-rep orts / reports. 93 / T C D- C S- 93-12. ps.Z. S. Sidney Fels and Geoffrey E. Hinton. Building adaptive interfaces with neural networks: The glove-talk pilot study. In Human-Computer Interaction--INTERACT '90, pages 683-688. IFIP, Elsevier Science Publishers B.V. (North-Holland), 1990. Richard Watson and Paul O'Neill. A Flexible Gesture Interface. In Wayne Davis, editor, Proceedings of Graphics Interface '95, Montreal, Canada, May 1995.

4The correct recognition of gestures based upon small differences in thumb position has proved difficult (largely due to calibration difficulties); and it is not physically possible to make some gestures while wearing the glove, due to physical interference between the sensors. An example of this type of gesture is where one finger must lie flat upon another. 5The initial posture of this gesture is a flat-hand. The forefinger is flexed and then extended again in one smooth motion. 6Thumb flexion and yaw is brought from its minimum value to its maximum value and then back to its minimum value in one smooth motion, while the other degrees of freedom maintain a static point gesture. 7The user traces a circle in space, with his wrist. The circle gesture has been problematic in that it is difficult for the user to make a precise (or even approximate) circle. In addition, the discontinuity pattern observed during a circular motion in 3D space depends upon the orientation of the circle and the direction in which its boundary is traced. SThe user traces an X pattern in space with his wrist.