Pattern Recognition, Vol. 24, No. 9. pp. 895-907, 1991
0031-3203/91 $3.00 + .00
Printed in Great Britain
Pergamon Pressple ~) 1991 Pattern Recognition Society
A TRAINABLE GESTURE RECOGNIZER JAMES S. LIPSCOMB IBM Thomas J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598, U.S.A. (Receioed 14 November 1989; in revised form 1 October 1990; receivedfor publication 5 February 1991) Abstract--Gestures are hand-drawn strokes that do things. These things happen at distinctive places on the stroke. We built a gesture input filter and recognizer. The input filter is fast, because it does few computations per input point, because it can omit pre-filter data smoothing, and because wild points caused by hardware glitches are removed at the few output points of the filter, not at the many input points. The recognizer is a novel combination of two traditional techniques; angle filtering and multiscale recognition. Because an angle filter does not produce well-behaved scaled output, the multi-scale treatment had to be unusual. Cross product Vector product Gesture recognition On-line recognition Real-time recognition
Multi-scale recognition
I. INTRODUCTION
H a n d - d r a w n abstract symbols can c o m m a n d a computer. These symbols are sometimes called gestures. Usually, just a single stroke indicates both what to do and where to do it. Some commercial laptop computers can recognize handwriting drawn with an electronic stylus. A user-interface made up of gestures would seem the natural next step. Examples of gestures from recent systems are: (1) A gesture encircles objects to be copied. The tail shows where to put the copy (Fig. 1). (l)
\o A/
Fig. 2. Height of temple roof © 1987 IEEE. special demands on gesture recognizers. W h e n a place on a gesture carries meaning, it is said to be a hot point, for example the corner of a check mark.
Fig. 1. Copying objects. (2) The height of a reverse letter h indicates the height of the roof to be added to a drawing of a temple (Fig. 2). (2) (3) A delete, transpose, or mark-word gesture can be made as wide as needed to cover the text to be changed (Fig. 3). (3) (4) A n object is deleted by circling it and then drawing a nearby proof-reader delete mark (Fig. 4). (4) This is our project's old gesture recognizer, replaced by the one described in this paper.
INSERT INSERT UNE DELETE MERGE UNES MARK MARK DELETE MARK WORD TRANSPOSE SPLIT SPLIT PARAGRAPH MERGE MERGE PARAGRAPH
Gestures os handwriting Differences from natural language text place
/~
( E [~ i
i
F-LJ ~"
-J,-~
Fig. 3. Correction marks for editing text © 1988 IEEE. 895
896
JAMES S. LIPSCOMB
A I 2 3 4 5 6
B
C
D
Dept.
Hardware Store Jan Feb
Ptar~)-,''~
O~e
mmmmmlm
I l l l l I N
sates std. expenses nonstd, expense 7 taxes 8 9 misc. 10 Profit 11 12 13 Dept. 14 T*o 15 sales 16 std. exp. 17 nonstd, exp. 18 19 Profit 20
F
Proflts_/
I I I N I I I
$2,3M $1,85G $856
$1,kge $1,850 $729
180
47
$0
t
$0
$1,6k3 $1,850 $1,048 65 $0
($545) ($1,136) ($1,320)
profit lq
$9,1162 $7,373 $6,382
$2,52S $0,056
$2,525 $3,366
$2,525 $11,045
$881
$1,482
($188)
profit lq
Fig. 4. Deleting characters and words.
Hot points are places significant to the user, but also must be places that the computer can reliably find. Hot points do not exist in traditional handwriting recognition. Some handwriting recognizers match features but these features are not distilled to points of action. Some gestures are intentionally distorted to convey meaning, such as scope or size (Figs 1-4). Handwriting recognition need only face accidental distortion. Some gestures can be drawn in many orientations or in mirror image. Examples are a circle and a proof-reader delete mark. Handwriting is drawn in a fixed orientation with respect to an established baseline. Most gestures are drawn in isolation with little context to aid recognition. Handwriting recognizers can put off a final decision on characters until a word is complete, and a spelling checker gives its opinion. Gestures, by contrast, cause actions when drawn. Therefore, a decision must be made immediately. Finally, when correcting a misrecognized gesture, one must also undo its action. This is an additional burden that handwriting recognizers need not shoulder. To offset these new demands, the artificial nature of gesture languages helps recognition. Gesture languages avoid the woes of facing a large and predetermined natural language alphabet and of finding the boundaries between letters. Gesture languages are designed with easily recognized shapes that are fast to draw. As a result, gesture languages consist largely of single strokes.
Gesture training and recognition Training by example is not possible, as far as can be told from publications of the recent gesture recognizers shown in Figs 1-3 0-3) or from their ancestors. (s-7) This probably reflects the immaturity of gesture work. Most handwriting recognizers are trained by example. For us, gesture technology is new. Appropriate training took time to develop. We began by writing
a program fragment to recognize each gesture (Fig. 4) 3+) Sometimes, this was frustratingly hard. To convert to training by example, we use a multi-scale recognizer driven by a table of prototypes. Our target languages are the symbols used in proof-reading, mathematics, music, etc., for the paper-like interface (Fig. 4). (4.s.9) This is a research prototype of a laptop computer operated by an electronic stylus. A Photron tablet simulates the portable machine. It both digitizes the stylus input and displays on the writing surface directly under the stylus, as if it were leaving an ink trail. An attached IBM PC/RT computer recognizes the strokes. A spreadsheet application program runs on an attached IBM P C / A T computer. Only single-stroke gestures are recognized. A stroke is a continuous line from pen-down to penup. After this recognizer was built, we implemented multi-stroke gestures by using a high-level dialog manager, which oversees this single-stroke recognizer. Multi-stroke gesture recognition is a separate topic not discussed further here. Section 2 of this paper is the input filter that selects candidate features and hands them off to the recognizer. Section 3 describes the recognizer and shows how strokes are trained by example.
2. THE INPUT FILTER
Input filters are fast, simple algorithms that make later recognition easy. There is nothing unusual about a fast input filter feeding a slow recognizer. This is a traditional division of labor. Input filters reduce noise and quickly distill the many input points of a stroke (Fig. 5(a)) to a few candidate features (Fig. 5(b)). Later, a recognizer uses a sophisticated and slow feature finder to decide which candidate features are significant (Fig. 5(c)-(e)). These features match a stored prototype stroke, triggering recognition. The input filter must be tuned to find all features important to the recognizer with a minimum of extra,
A trainable gesture recognizer
(a)
~
Strokeinput from device.
(b)
~
AfterInitial filtering.
(c)
~
Afterelimina~ng small-scale features.
(d)
(e)
After ellminating medium-scale features. " ~
Aftereliminating larger-scale features.
Fig. 5. Successive filtering steps.
Equal cross products between
A
Close
vectors.
to original curve.
Fig. 6. Input filter output. incorrect candidate features. This project's recognizer uses an angle filter, (~°) which produces output points where stroke curvature is high, at the corners. The input filter was designed to behave likewise. Many input filters have been built that concentrate their output points at the corners. Usually, the pen moves slowly there. Nex sampling 0°) takes advantage of this. It selects some n and then throws away every nth input point. When the pen stops, however, points can pile directly on top of each other, which is an error case for an angle filter. Used in moderation, nex sampling can speed up any filter. Another way to get points at the corners is to set a threshold for area between a curve and its successively longer approximating line segment. (H) Each input point accumulates area until the threshold is passed, and then that point becomes an output point. The area accumulates faster where the curvature is high, so the output points are closer together there. However, output is likely to be triggered when just passing and missing the deepest part of the corner. Missing the corner slightly will not degrade recognition accuracy, but the gesture's action will occur away from where the user expected it. Our cross-product input filter concentrates output points near high curvature (Fig. 6) too, but it has a symmetrical outlook. It prefers output points in the middle of corners.
Input filter algorithm fundamentals The cross-product input filter approximates curved strokes by line segments. (t2'~3) It selects three points
897
from the input stroke (Fig. 7(a)). One point tells which input point last went to the output (point a), one tells where the algorithm is (point b) and one tells where the algorithm is going (point c). Point b is the candidate for output, and it is most likely to be sent when it is at a corner. Having point b at a c o m e r maximizes the area of triangle abc, which is one half the cross product of line segments ab and bc. The cross product is a dimensionless scalar. It has the same value as the area but not the units of area. If the cross product of vectors ab and be is below the threshold, then advance points b and c, with b moving slower than c (Fig. 7(b)). If the new cross product is greater than the threshold, save point b (Fig. 7(c)) for later output, and establish new points a, b and c (Fig. 7(d)). Repeat until point c reaches the end of the stroke. The right half of Fig. 7 gives the program code for the computation-intensive parts of the input filter. The output (Fig. 6) is a sparse set of points defining head-to-tail vectors. The cross products between successive output vectors are about the same. Points are far apart in straight regions and closer together along the curves. This happens because the cross product of two vectors is high when vectors are long or have a sharp bend between them. It may seem easy to arrange that where the curvature is constant, vectors are equally long, but a similar, simpler input filter went into oscillation. It produced alternately long and short vectors. These vectors had equal cross product between them, but that property alone does not guarantee stability. The three-point input filter above does the job. Enough detail on the input filter is now in hand to explain its benefits.
Benefits Cross product fast to compute. The input filter as described so far needs only two multiplications and eight additions per input point. Four subtractions compute the vectors between a, b and c. Subtractions will count as additions in the final sum of computation cost. The cross product uses two multiplications, a subtraction and an absolute value (which will count as an addition). Moving points b and c requires two more additions. A later section will call for two more multiplications and two more additions. Fig. 7 shows that there is program logic overhead as well. This input filter is suited for small, portable computers that lack fast floating point. Computations are inherently fixed-point, because the input points are integers and because the cross products and dot product have no divisions. Pre-filter smoothing unneeded. Most other input filters need their input smoothed by a pre-fiiter to eliminate x-y jitter. Small input jitter from hardware
898
JAMESS. LIPSCOMB cpfllt (xin, yin. nin, wmlizeq, thcp. t~dp) / * xin. yln: x and y input arrayl. Zero origin. * / /* nin: Number of input point] in xln and yin. */ / * ~ m l l a q : Square of ItTok* elze (dx**2 + dye2). * / / * thcp ,, 100: Threshold for t r e k product bmL * / / * Ulclp -, -100: Thrlmhold for dot product tilt- * / / * Output will be xout, your, hour; lice xin, yin. nin. * / / * All voriablel intogerl. XXXX: etohlmlmtl omRtld. * / ! / * Normalize threeholdx to ilTOkl lize. Some breakdown * / nthcp : - (thcp * lymlizeq) / 1 0 0 0 0 ; / * Id~ether elroke ie * / nthdp : - (thdp * l y m l l a q ) / 19000; / * ll'ge or Irnoll. * / xout[O]:-xin[O]; yout[O]:-,yln[O]; nout:-l; / * Save frill */ a:-O; b : - l ; c:-2; / * Shirt pointers at flrlt 3 polntm. * / while ( C < nin ) ! / I C hoe not reached end of Input- * / /* a ---(01,bl)---> b ---(02,b2)---> 01 : - xin[b]-xin[o]; bl : - yin[b]-yin[a]; 02 : - xin[c]-xin[b]; b2 :,, yin[c]-yin[b];
(a)
c */
obscp : - abs(al*o2 - o2'ol); / * Cross producL * / dip :a1*02 + bl*o2; / * Dot producL * / i~:)x~
a
. . . . . . . . . . . . . . if (ablcp < nthcp .and. d p > nlhdp) ! b b+l; c ::-- c+2; |
b I~xEI
/* Look for corner b between old b and c. * / while (I)+1 < C) ! XXXXXXXXXXXXXXXXXXXXX I / " white b+1 . /
(C) .....................
(d)
tart . . . . . . . . . . . . . . . . . . . . .
/ * Save point b for output. * / xout[nout] : - xin[b]; yout{nout] : - yin[b]; nout : - nout+l; / * Rutort a~,c pointers at b,b+lJ~+2 * / 0 :~ b; b : - 0+1; c :,, b+l; ! / " el~. ~ b */
! / * while c < np * / / * Look for trailing hook. * / XXXXXXXXXXXXXXXXXXXXX / * Save Iolt Ineut point if dIltln¢t from loll b moved. * / if (xout{nout--1]#,xin[nln--1] .or. yout[nout--1]d.fln[nin-1]) | xout{nout] : - xln[nin-1]; yout[nout] :,- yin[nin-1]; nout :,, nout+l; I return (xout.)'out. hour): ! / * cpmt * /
Fig. 7. Input filter algorithm fragment. noise, quantization, or wavering of the user's hand produces wiggles too small to be features. These should not be passed to the recognizer. A movingaverage pre-filter (14) traditionally smooths away this jitter. Moving-average pre-filters are slow, because all input points must be processed and because each point is part of several overlapping computations. Worse, moving average filters move points. Handwriting recognizers do not care if a letter moves
slightly, but gesture recognizers have hot points, places where things happen. A gesture recognizer would have to maintain an index of its processed data back to the original data, so that the original screen location of processed hot points could be reconstructed. This input filter needs no pre-filtering, because mathematical properties of the cross product ignore x-y jitter. The filter selects input points for further
A trainable gesture recognizer (*)
5 ,?b 4
7
c
(b)
.~ c 7 .z-" 4 a i.ti 0 6 9
5 .'~c 2 1-41~>'6" 8 3 Eft=" g
b
(a)
2
//•4
3 I fl a
(a)
8
7
"'~-..-- It "'I11 c
For threehold ~P' ~ and R ore Im M/tort thet Irnpoallble for l l i 6 x ~ l > r Jitter Ignored.
I o-6 x ~ I < T still, but approaching threlhold. Pen motion still icnored.
I a'~ x 5"~ I > T. Save point b (point 5) for =lgorl~m output. Output will be point= #1, p5, et¢
899 Stroke input:
~
Wild point.
(b)
Cro==-product filter juet before dot-product teat:
(c)
Wild point removed:
Fig. 9. Wild points detected and removed.
Fig 8 Cross product ignores jitter processing without moving them. Short lines have small cross products. The cross product has the value of the length of one vector, times the length of another, times the sine of the angle between them (without needing to compute the sine). If the vectors are short, no angle can make the cross product larger than the threshold, so small jitter is ignored. Figure 8(a) shows input from a hypothetical noisy tablet as the user draws a line. Nine input points are numbered in order. Algorithm points a, b and c form vectors too small for their cross product to exceed the threshold no matter what the angle between them. By Fig. 8(b), points b and c have advanced, but little has changed. By Fig. 8(c), points a, b and c are far enough apart that a big angle would put the cross product over the threshold, but the points are beginning to lie along the line. By this time, the x-y jitter has been successfully ignored, and the next thing the cross product finds will be a candidate feature. Figure 8(d) shows the cross product exceeding the threshold and producing point 5 as output. The input filter has found a potential feature. The data appear to have turned a corner at point 5. Later, the recognizer decides whether or not this potential feature really is a feature. Point 5 might be a small noise spike. Noise is ignored only in the sense that its back-and-forth character is not passed on to trouble later recognition. Small noise-generated rough spots on otherwise smooth curves do seem to be more likely than other points to be passed on to the recognizer. Remember, though, that the alternative of using a moving-average filter introduces comparable changes, as all points on the curve are shifted to fall into line.
Wild points eliminated quickly. Hardware glitches occasionally produce spurious, outlying wild points. Input filters that remove wild points with pre-filter smoothing are especially slow, because they must examine all of the many input points. These filters must remove wild points in a separate pre-pre-filter
pass through the data before the pre-filter smoother. This is because, if the pre-filter ever saw a wild point, it would distribute the error with a moving-average filter to a half-dozen neighboring points. This would leave the good data torn up in a way difficult to untangle later. This input filter, by contrast, needs neither the slow pre-pre-filter, nor the slow pre-filter. A postfilter finds and removes wild points quickly, because it checks the few filter output points (Fig. 9), not the many input points. Each potential output point is compared to only the input points immediately adjacent to it (Fig. 7, "Check b - 1, b, b + 1 . . . . "). There is not even the overhead of a separate pass through the output data. The dot product is a quick check for wild points. A large negative dot product between successive output vectors suggests that the point in the middle is wild. A simple test against a threshold may wrongly convict as wild one of the points just before or just after the wild one (Fig. 9(b) and (c)). But this may be desirable anyway to avoid the output of two points very close together. Another popular way to find wild points is to test for acceleration beyond the power of the human hand. This is slow, because it involves calculating the magnitude of a vector. Our dot product test added less than 1% to an input filter execution time of 10 ms. That comes to less than 100 #s per stroke.
Can select the size of candidate features. All feature-finding input filters must be tuned. They must let through just enough candidate features to be sure that the real features get through, but not so many that the recognizer gets confused by insignificant wiggles. This input filter is different from similar input filters since it can be tuned in two ways. A large cross-product threshold tunes away large wiggles. This is similar to threshold behavior on other input filters. But unlike handwriting filters, this one scales the threshold to the size of the stroke (top of Fig. 7). For example, if this recognizer set the height of a roof (Fig. 2) from the height of a backwards h, the stroke would need to have the same
900
JAMES S. LIPSCOMB
(a) 'c' cloee to 'b' finds small wigglee: b
x '6"cJ < T. Advance 'b' and 'c.' b
Eventually
I~'b x E l > T. Save 'b' for output. Big i n c r o a n in aJ'o'bx 6-~J, becouJe angle abc changes greatly ae point ~o' elides down. ...
... wiggle considered elgnificant.
(b) 'c' far ahead of 'b' ignores small wiggles: la'b x EcJ < T, and etaye emall ae point 'c' elidu forward. b
J b
Eventually ... ~________~___-.-~----~
... wiggle overlooked.
Fig. 10. Ignoring small wiggles. number of output points whatever height it is drawn. Then, the prototype's hot points would match up to the correct places on the input stroke, and the height could be found from them. The second way to tune this input filter is to change the relative speed of the two traveling points, b and c. If small wiggles in the input data should be found, then c would travel only slightly faster than b (Fig. 10(a)). If small wiggles should be ignored, then c should travel much faster than b (Fig. 10(b)). Experience suggests that c should travel 1.5 to 2 times the speed of b. Having this second tuning parameter helps recognition accuracy a little.
Input filter algorithm details A few more details, although minor, are essential for the input filter to work correctly. These are described below and noted by "XXXXX" in Fig. 7.
Finding thin loops. The cross product, by itself, cannot detect a change of direction (Fig. 11(a)) because it tests the sine of the direction change. The sine function cannot tell the difference between direction changes of 0 ° or of 180 °. Thin loops are missed. A dot product test of each input point finds the reversal (Fig. 11(b)), which triggers the output of point b and restarts. This is similar to traditional dotproduct cusp detection. (15) Here, the threshold is negative (Fig. 7) to detect only substantial direction reversal, not small jitter. The cost of the dot product is
2 multiples and 2 additions. This brings the arithmetic cost to 4 multiplications and 10 additions per input point.
Finding corners. The input filter as described so far nearly always puts an output point at or very close to the natural comers of the stroke, but occasionally it misses (Fig. 12(a)). It misses more often when point c moves much faster than point b. Fig. 12(b) gives an algorithm that makes doubly sure that the corner is found. Few input points need be inspected, so any other corner-finding algorithm would also be computationally cheap. This input filter suffered only a7%increase in execution time. The improvement in comer-finding was small and occasional, but anything that improves accuracy is welcome. Finding features at the end. Small features at the end of the stroke can be missed, because sometimes they produce only a small change in the cross product (Fig. 13(a)). Also, when point c hits the end of the stroke, the algorithm has not had a chance to consider saving points between b and c. One fix is to run the algorithm backwards until it hits the old b stopping point (Fig. 13(b)). This computation cannot be considered extra cost, since it only closes a gap produced when the algorithm stopped early. This finishes the description of the input filter, except for its speed, which is covered later. Next, the input filter passes its candidate features to the recognizer.
A trainable gesture recognizer
901
('b) Dot product finds thin loops missed by cross product:
(a) Cross product by itself misses thin loops:
c
C
I ~ x E l < T, so advance 'b' and 'c.'
I~'b x E l < T, and a-6. ~'~ > O, SO advance 'b' and 'c.'
iF6 x ~ l < T st/ll, since nearly parallel to be. Direction reversal not detected.
I~-b x E l < T, but o b e b c < O, which indicates direction reversal,
Eventually, the loop is missed.
.,.
SO
save
b for output.
i•
Eventually, the loop is found.
Fig. 11. Finding thin loops. (a) Cross product combined with dot product sometimes misses corners:
%
Corner missed.
w
(b) Fix: Search for corner: a
I ~ x 6-cl > T, but do not save 'b' yet. a
Advance 'b.' Angle abe gets smaller.
"',,\ "",
b
Advance 'b.' Angle abc is at a minimum.
Save 'b' for output.
Fig. 12. Finding corners.
3. MULTI-SCALE RECOGNITION
Multi-scale recognizers are popular for image processing. Their large-scale data generalize, which allows quick training. Their small-scale data memorize fine detail needed to distinguish similarly-shaped objects. (16)Multi-scale recognizers automatically find the scale that best distinguishes any two objects.
Stroke data, however, are rarely processed by multi-scale algorithms. Hang07) joins broken pieces of a curve, caused by pen-skips, using a pyramid structure. Each level of the pyramid corresponds to a scale. His is an input filter that cleans up the pen track for later recognition. It is not a recognizer. Shojima os) filters the input stroke once and compares it to a single-scale prototype of a trained stroke. His data are single-scale, but his algorithm is multiscale. He compares the first line segment in the input stroke to the first in the prototype. If they match, he compares the second line segments in each. If they do not match, he combines input line segment 2 with either input line segment 1 or 3, depending on which has a smaller direction difference with line segment 2. Then, he compares this to the prototype, etc. Shojima's algorithm combines angles at will, without reference to thresholds for small, medium, and large features. This contrasts with traditional, multiscale recognizers, which compare a large-scale version of the input scene to a large-scale version of the known scene, then medium-scale to medium-scale, etc. This points out another unusual thing about Shojima's version of multi-scale processing; it is applied to the input stroke only. His prototype processing, by contrast, is single-scale, because angles are not combined. This paper's recognizer uses an angle filter 0°) much like Shojima's. However, we use preset thresholds as our criteria for direction changes to keep. An angle filter seems appropriate for gestures, because it approximates a stroke by a few points at the corners in the original stroke, which are perceptually important. {19)These are logical candidates for the hot points of gesture action.
902
JAMES S. LIPSCOMB
Ca) Cross product misses emall feature at end of stroke:
Kcl <
T.
[a-b x ~ I < T. mleue emall, wlgniflcant feature at end. becauee angle abe changed little. Ngertthm ha-, reached end. Remember where ~o' craped. Save 'c' for output.
(b) Fix: Restart algorithm backward,,: Reetert.
Corner found. Save 'b.' reochae place where , er,thm stopped on f ~ e r d paN. ~
Stop.
Produce output.
Fig. 13. Feature at end of stroke. Unlike Shojima, we pre-compute the input stroke once at a particular angle setting, rather than recomputing angles with each comparison to a prototype. Pre-computation gains efficiency. Also, we do the same for the prototypes, so that both types of data are treated in a multi-scale way. It is novel that this recognizer combines angle filtering with multiple scales in a way close to traditional multi-scale recognition. Perhaps the reason this was not done before is that treating angle thresholds like scales requires a storage organization and an algorithm different from the usual multi-scale treatment. This will be shown first in training new gestures.
(~
SV,ke
Pert t :
$h'oke
/
.L
|
f
~
"tepee"
Fllter t ~"circle"
¢ ~
Ft2ter !
P$rt ~, flJtee
0
Filter
3
Fllter B ~-"circle"
|
~ Fsster ~/ Ft2ter 4
Plv't
~
"circle"
Part 4~" Filter 4
A Fig. 14. Train new gestures.
Recognizer algorithm Training gestures. Figure 14 shows the table being trained to recognize a circle gesture and a triangular gesture shaped like a tepee. The angle filter's output at different direction-change thresholds finds features at different scales, in a sense. These are prototype gestures, which are stored in different parts of the table. Filter 1 eliminates direction changes less than 27°. Filters 2--4 look for coarser distinctions by eliminating direction changes less than 38°, 54° and 75°, respectively. These values give the best recognition rate against a set of test gestures. All points on a large-scale version of a gesture are present on its smaller-scale version too, because large-scale filters only remove points from smallerscale filterings. This is called coarse-to-fine tracking. <16) The gesture designer need only mark
one prototype point hot, and all smaller-scale prototypes can have the same hot point. A prototype hot point can be quickly matched up with a point on an input stroke. Recognition succeeds only when the known and the unknown have the same number of points. For example, if prototype point 3 is hot, then filtered input stroke point 3 is made hot. The ease with which hot points are handled by this point-by-point matching is one of the chief benefits of this particular multi-scale algorithm. Directions are quantized to the hours of a 12hour clock. Two scalars, per type of gesture, code orientations and reflections. Some gestures, for example a proof-reader delete mark, can be drawn in any orientation or in mirror-image. This is different from handwriting, which is oriented to a base line. The trainer in Fig. 14 always stores the first filter's
A trainable gesture recognizer PB/'t I."
Stroke
Filter 1 ~
Fllter
903
I ~
No match
,~rt t: ~-
~
"tepee"
"tepee"
"circle"
"circle"
Part 2."
F11ter
2~
No match
~ Fllter 2
"circle"
Part
~-
~
Part
"delete" "circle"
Distinction must be mode here or above.
Part 3"
F i l t e r 3 ~.~ Distant Clone
"delete"
Part ~ Q." "circte"
"circle"
~ Fllter 3
•~
~
"delete"
Potential
confu-lon; "circle" Part 4=
Fig. 15. Recognize the circle gesture. prototype plus all larger scale prototypes up to the largest-scale one that is the same as filter 4's output. For example, the tepee gesture is unchanged by filtered removal of medium-scale features, because it has none. The three versions unchanged by filtering are not stored. Not storing some prototypes solves a problem. The circle prototype stored in Fig. 14, part 3, looks just like the tepee prototype in part 1. The two must be separated in the table so that they will not be mistaken for each other when it comes time to recognize unknown gestures. The lower parts of the table are large distances, in some abstract information-measuring space, from the original input gestures. This distance affects recognition.
Recognizing gestures. A circle is recognized in Fig. 15. Successive filtering discards details until the algorithm finds a match. In this case, there are two matches. The recognizer prefers the close match in Fig. 15 between an unknown that took three filterings and a stored prototype that also took three filterings. There is more of a chance that they started out looking similar before filtering, than the distant match to the tepee in part 1 of the table. If a tepee were the input, it would be recognized after one filtering step. It would prefer the close match to a tepee in part 1 of the table to the distant circle in part 3. Retraining misrecognitions. If a proof-reader delete mark were intended instead of a circle, then the table must be trained (Fig. 16). The new delete gesture prototype in part 3 of the table is identical to the old circle in part 3, so the circle prototype is deleted as a potential troublemaker. Prototypes cause trouble when they are widely misrecognized as other things. Troublemakers must be deleted, because their misrecognitions are hard to correct by adding more prototypes of the good gestures. It is hard for the program to know when troublemakers exist and when they do not. Fortunately, there is little penalty for always assuming
Fig. 16. Train the delete gesture.
(a) ~
(b)
Down arrowhead
/
Checkmark
Fig. 17. Two gesturesthat differ only in length. a troublemaker and deleting. At worst, repeated retraining continues inappropriate deletions. If this were the case in Fig. 16 then repeated retraining would fill in the smaller-scale part 2 and recognition would eventually always happen by part 2. The trainer inserts prototypes only up to the scale at which misrecognition occurs, not beyond. This is one reason why there is no filter 4 in Fig. 16. The misrecognition of a delete as a circle occurred in part 3 of the table, so a delete gesture is not added to part 4, no matter what. Recognition continues to be fast, after retraining misrecognitions, because few prototypes are trained into the table and one prototype is deleted. This leaves few prototypes to inspect on subsequent recognition.
Comparing lengths. To keep the recognizer algorithm description simple, line segment length has been ignored so far. Length is a weighted factor in recognition but does not fundamentally affect algorithm logic. There are three ways that length comes into play: distinguishing between otherwise similar gestures, deciding when a gesture is too distorted, and eliminating short line segments as smallscale features. Sometimes, length is the only way to distinguish between gestures, for example, a down pointer and a check mark (Fig. 17). At other times, a gesture may be so length-distorted that it is not reasonable to recognize it. However, large length distortions seem to be acceptable (Fig. 18). A useful cut-off for recognition seems to be to prevent a long line segment from matching
904
JAMESS. LXPSCOMa
Fig. 18. Acceptable distortions of summation sign.
rn
(o) m~ m
Prototype summation
(9) There is more than one match. Prefer the best match of line segment lengths. Unless there is a tie, declare success and quit. (10) There is still more than one match. Prefer the best match of direction. Unless there is a tie, declare success and quit. (11) There is still more than one match. Prefer the more recently-defined prototype (coded by position in the table). Declare success and quit.
I11
(b) m ~ m m
Accepf.obledlsf.ortlon
s (C) m " ~mm
Unacceptabledlltort.[on
Fig. 19. Short lines prevented from matching long lines and vice oersa: s = short; m = medium; I = long. a short one and vice versa (Fig. 19). Lengths are coded short, medium and long. Short line segments look like small features and therefore should be eliminated, just like small direction changes. Performance against test data verifies this. Figure 5(d) shows a line segment eliminated in Fig. 5(e) purely because it is short. Filter 1 (27 ° direction change) eliminates line segments smaller than 1/10 the size of the gesture's maximum x or y dimension. Length thresholds for other scales are in step 3 below. Recognizer pseudo-code. This is the recognition algorithm shown graphically in Fig. 15. (1) Read the input data points and reduce them to a manageable number using a fast input filter. (2) Set j = 1. This sets the scale of the angle filter applied to the input. (3) Extract features from unknown gesture at scale j. (Scale 1:27 °, 1/10. Scale 2:38 °, 1/9. Scale 3:54 °, 1/7. Scale 4:75 °, 1/5.) (4) Compare unknown gesture at scale j with all stored prototype gestures. To match, the unknown must have the same number of line segments as the prototype, the directions of the line segments must be the same to within plus or minus one, and short lines in the unknown cannot be long lines in the prototype or vice versa. Immediately declare no match on the first mismatch beyond these thresholds. (5) If no matches and j < 4, then set j =/' + 1 and go to step 3. (6) If no matches and j' =>4, then declare failure and quit. (7) Have at least one match between the unknown and the prototypes. If there is only one match, then declare success and quit. (8) There are other matches in the table. Prefer matches to scale/', if no match there, try scale j - 1, j + 1, j - 2, j + 2, etc. If there is just one match at the first scale to have any matches, then declare success and quit.
Recognizer lessons Angle-filter thresholds are not scales. This recognizer seeks the traditional accuracy of multi-scale recognition, while using the traditional angle filter. But angle filter output is only sometimes like proper scale information. The customary idea of scale is a notch filter. Small-scale features have high spatial frequency (short lines and small direction changes). Sometimes, an angle filter behaves like a notch filter. For example, a hand-drawn circle (Fig. 14) has constant curvature. At a small direction change threshold (filter 1), the angle filter produces many short line segments, which are small-scale features. At a large direction-change threshold (filter 3), the angle filter produces long lines connected with large direction changes, which are large-scale features. A traditional multi-scale recognizer could use an angle filter to recognize circles. For other shapes, an angle filter is a low-pass filter. Gesture curvature switches the angle filter between its two behaviors. This seriously complicates multiscale recognition. For example, the tepee (Fig. 14) consists only of large-scale features, long lines and sharp corners. When the angle filter looks for smallscale features, it finds these large-scale features (filter 1). This is the behavior of a low-pass filter. The switch between behaviors can occur within a single gesture, for example the distorted circle in Fig. 15, filter 2. The small-scale angle filter finds a mix of long lines (large-scale features) where the original stroke curvature is low and short lines (smallscale features) where curvature is high. The largescale part of the table does hold large-scale features only. But the small-scale part holds both small-scale and large-scale features. We still call part 1 of the table the small-scale part, and the things it holds small-scale features, because that is true to an approximation, and makes discussion simple. Fast small-scale to large-scale searching. The fastest multi-scale search direction has been thought to be from large scale to small scale, which is thought to avoid searching the many small-scale features. This recognizer must search the other way, from small scale to large scale, to avoid searching the entire table. Only the small-scale part of the table has prototypes for all trained gestures. Efficiency does not seem to suffer for three reasons. Firstly, one can avoid searching most small-scale
A trainable gesture recognizer features if the recognizer aborts early (step 4 in the recognizer pseudo-code above) when it becomes clear that there is no hope of a match. Customarily, large-scale to small-scale searching has the efficiency of testing fewer prototypes at smaller scales. Small-scale to large-scale searching can gain similar efficiency, when training omits redundant large-scale prototypes from the table (Fig. 14). This is necessary to avoid confusion caused by the angle filter not producing properly scaled output. Finally, searching from small scale to large scale need not search many scales. The search is longest for curvy gestures, but instrumentation shows that even curvy gestures are typically recognized 60% of the time at scale 1 and 80% of the time by scale 2. Three similar claims for speed can be made for customary large-scale to small-scale searching. The speed contest seems to be a draw, but only when each method uses a filter suited to its search direction. Applications whose filters extract proper scale information should continue to search from large scale to small scale. Applications whose filters sometimes mix scales might benefit from omitting largescale prototypes and searching from small scale to large scale.
Accuracy Other gesture research papers have not published error rates, but then, comparison of different test data sets would not be meaningful anyway. The multi-scale recognizer is both more reliable and faster than the hand-coded recognizer it replaces, as well as being trainable by example. Several applications are running with user-independent recognition. Even curvy gestures are user independent, although that takes considerable training. An alternative to a table-driven multi-scale recognizer is to write a small, hand-coded, single-scale program for each gesture. Our experience is that such a hand-coded program can have a lower error rate for curvy gestures, like a circle or a proof-reader delete mark, in exchange for a higher error rate for gestures made from straight line segments, like a rectangle or a summation sign. The overall error rate dropped from 8.4% with the hand-coded recognizer (4) to 5.0% with the multiscale recognizer on a set of 804 curvy and angular gestures taken from human-factors testing sessions. The multi-scale recognizer was trained to data different from the test set. Angular gestures (Figs 14 and 18) are quick to train, because their straight lines match the straight lines in the recognizer's data structure. Only 1 to 5 training examples are needed. These are the gestures for which this recognizer was built. A gently curving, nearly straight line can be easily distinguished from a straight line too (Fig. 20) with about 4 repetitions of training. Curvy gestures, for example a g-clef (Fig. 21),
905
(b) ~
Line
Fig. 20. Similar gestures that training distinguishes quickly.
Fig. 21. Curvy gesture that takes longer to train. take longer to train, because of the inherent mismatch between their curvy nature and the recognizer's line-segment quantization. A g-clef can be reliably recognized after training 6-20 non-recognitions. Sometimes, curvy strokes, like a g-clef or a proofreader delete mark, are so familiar that the application demands them, and so curvy gestures are accepted by the recognizer as a hardship.
Speed Total recognition time with the old hand-coded recognizer is 77 ms. The new multi-scale recognizer reduced recognition to 53 ms. The recognizer runs on an IBM PC/RT computer without an APC card. This computer executes about one million instructions per second in fixed point. The old hand-coded recognizer's 77 ms consist, on average, of: (1) 15 ms of moving-average input smoothing. (2) 29 ms of selecting vectors according to velocity and length criteria. (3) 31ms of dehooking these vectors (removing insignificant glitches at the beginning and end of the strokes) and giving vectors direction codes. (4) 2 ms of hand-coded recognition. The new multi-scale recognizer's 53 ms consists, on average, of: (1) 0 ms of moving-average input smoothing. The cross-product filter does not need this. (2) 13 ms of filtering (10 ms for cross-product filtering and 3 ms for angle filtering). (3) 18 ms of dehooking these vectors and giving them direction codes. Dehooking is somewhat faster in this new version of the system, but little has been done to speed up this module. (4) 22 ms of multi-scale recognition. This takes ten times longer than hand-coded recognition, because of the many prototype table entries and because of the many scales to be processed. Also, there is an inefficient, linear search, which could be improved. Nevertheless, time is saved overall, because the cross-product input filter is fast. Not included in these timings are stroke capture by the combination tablet/display, stroke transmission to the RT computer, application processing time, transmission of graphical response back to
906
JAMESS. LIPSCOMB
the tablet/display, and liquid-crystal display time. Informal visual observation suggests that total system response time is less than 200 ms. Other gesture research papers have not published recognition times for comparison. This gesture recognizer is, however, much faster than our handwriting recognizer. Gesture shapes can be designed to be quickly recognized. Also, gestures have semantic actions, for which the user accepts a short delay. Handwriting recognizers, by contrast, often have a speed problem. They have to contend with a large and inconveniently-shaped alphabet, with multi-stroke characters, with hard-to-find character boundaries, and with raised expectations of speed by the user, since echoing characters is a lexical activity.
unneeded, because mathematical properties of the cross product ignore small jitter. (3) Wild points caused by hardware glitches are found and removed at the few output points, not the many input points, of the filter. This turn-about is possible because pre-fiiter smoothing is omitted. The recognizer is a novel combination of two traditional techniques; angle filtering and multi-scale recognition. Because an angle filter does not produce well-behaved scaled output, the multi-scale treatment had to be unusual. The relation of angle filter output to the customary use of the term scale depends as much on the input to the filter as on the filter threshold. Tradition has it that the fastest multi-scale search direction is from large scale to small scale. This algorithm searches the other way, from small scale to large scale, without appearing to be slower.
Problems
Acknowledgements--I thank Jim Rhyne, who directed this work, and the other members of the team. Joonki Kim wrote the initial, hand-coded gesture recognizer. His underlying program structure and data structure were extended, not rebuilt, to make them table driven and multi-scale. Kim also implemented the interactive hot-point training designed in this paper.
Any quantization of a continuous phenomenon can produce quantization error. Angle filter quantization of a curve into line segments sometimes causes the system to fail to recognize a curvy gesture that looks, to the user, similar to a gesture trained earlier. This still happens for curvy gestures, even though large-scale prototypes often hide quantization error by matching a wide range of input. Future work The remaining puzzles are user-interface issues that can be summarized by the concept of empowering. While our interactive gesture training is adequate for system builders, it is too rudimentary to empower end-users to train. After the example stroke is drawn, further training instructions, including hot-point selection, are issued with keyboard commands, instead of a modern pick and menu-select interface. The user should be empowered to diagnose and to correct problems graphically, now done with an incomplete set of keyboard commands. The system should be empowered to advise the user at all stages of this work. The system and the user together should be empowered to distribute new or changed prototypes to other applications and users.
Note added in proof. Figures 1--4 are reprinted with permission. The cross-product input filter and the multi-scale recognizer have been granted European patents 90117526.5 and 90117622.2. Another trainable gesture recognizer has appeared, tz°) Rubine's algorithm is the opposite of the multi-scale recognizer. Recognition algorithms concentrate their intelligence. The intelligence in Rubine's algorithm is in its many characteristics, 13 in all, which characterize a stroke and in the algorithm that sets the weights of these characteristics in a process similar to neural net weight adjustment. But his data structure is simple. The multi-scale recognizer, by contrast, concentrates its intelligence in its multi-scale data structure, not in its stroke characteristics or weighting. Stroke have only 4 characteristics, the number of line segments, the direction of each line segment, the length of each line segment, and the distance from the beginning to the end of the stroke, which in recognition is treated like any other distance. The weights of these characteristics are fixed. Buxton and Kurtenbach have updated their workol to include more gestures and to take an early look at human factors issues/TM
REFERENCES 4. SUMMARY
Gestures are hand-drawn strokes that do things. These things happen at distinctive places on the stroke. We built a gesture input filter and recognizer. The input filter for hand-drawn strokes can be the dominant computation cost for recognition, because input points vastly outnumber features. The cross product and dot product produce a filter that is fast because: (1) The algorithm requires only 4 multiplications and 10 additions per input point. (2) Pre-filter smoothing of input points is
1. W. A. S. Buxton and G. Kurtenbach, Editing by contiguous gestures: a toy test bed, ACM Chi '87 Poster, 1-5 (1987). 2. R. Makkuni, A gestural representation of the process of composing Chinese temples, IEEE Comput. Graph. & Appl. 7, 45-61 (December 1987). 3. A. Kankaanpaa, FIDS----a flat panel interactive display system, IEEE Comput. Graph. & Appl. 8, 71-82 (March 1988). 4. J. Kim, On-line gesture recognition by feature analysis, Proc. Vision '88 Interfaces, pp. 51-55 (June 1988). 5. M. Coleman, Text editing on a graphic display device using hand-drawn proofreader's symbols, Pertinent Concepts in Computer Graphics, M. Faiman and J. Nievergelt, eds, pp. 282-290 (1969).
A trainable gesture recognizer
907
6. W. A. S. Buxton, R. Sniderman, W. Reeves, S. Patel 14. B. Blesser, Multistage digital filtering utilizing several and R. Baecker, The evolution of the SSSP score criteria. U.S. Patent (4,375,081) (February 1981). editing tools, Comput. Music J. 3, 14-25, 60 (1979). . 15. M. K. Brown, Preprocessing techniques for cursive 7. W. A. S. Buxton, E. Fiume, R. Hill, A. Lee and script word recognition, Pattern Recognition 16, 447C. Woo, Continuous hand-gesture driven input, Proc. 458 (1983). Graph. Interface "83, pp. 191-195 (1983). 16. A. P. Witkin, Scale-space filtering, Proc. Int. fl Conf. 8. J. R. Rhyne and C. G. Wolf, Gestural interfaces for Arttf. Intell., pp. 1019-1022 (1983). information processing applications, IBM Res. (RC 17. T. H. Hong, M. Shneier, R. Hartley and A. Rosenfeld, 12179) (September 1986). Using pyramids to detect good continuation. University 9. C. G. Wolf, Can people use gesture commands?, ACM of Maryland, Computer Science TR-1185 (1982). SIGCHI Bull. (October 1986). 18. H. Shojima, T. Mifune, J. Mori and S. Kuzunuki, 10. M. Berthod and J. P. Maroy, Morphological features Method and apparatus for on-line recognizing handand sequential information in real time handprinting written patterns. U.S. Patent (4,718,103) (January recognition, Proc. 2nd Int. jt Conf. Pattern Recognition, 1988). pp. 358-362 (August 1974). 19. A. Atteneave, Some informational aspects of visual 11. H. Shojima, S. Kuzunuki and K. Hirasawa, On-line perception, Psyc. Review 61, 183-193 (1954). recognition method and apparatus a handwritten 20. D. Rubine, Specifyinggestures be example, A CM SIGpattern. U.S. Patent (4,653,107) (March 1987). GRAPH'91, Comput. Graphics 25, 4 (1991). 12. V.M. Powers, Pen direction sequences in character 21. G. Kurtenbach and W. Buxton, GEdit: a test bed for recognition, Pattern Recognition 5, 291-302 (1973). editing by continuous gestures, ACM SIGCHI Ball. 13. R. O. Duda and P. E. Hart, Pattern classification and 23, 22-26 (1991). scene analysis. Wiley, New York (1973).
About the AUthor--JAMES S. LIPSCOMBreceived a B.S. degree in Physics from Lafayette College in 1972. He received an M.S. in 1978 and a Ph.D. in 1981, in Computer Science, from the University of North Carolina at Chapel Hill. Dr Lipscomb is a Research Staff Member at the IBM T. J. Watson Research Center, which he joined in 1987. Besides gestural input, his experience includes 3D input devices, hand-motion analysis, visual perception of motion, stereoscopic display, molecular graphics, and computational watching and steering.