Int. J. Man-Machine Studies (1976) 8, 329-336
Spatial reference and natural-language machine control NORMAN K. SONDH.EIMER
Department of Computer and Information Science, The Ohio State University, 2036 Nell Avenue Mall, Columbus, Ohio 43210, U.S.A. (Received 26 February 1976) Current research on natural-language speech-understanding provides encouragement for the development of systems for the vocal control of mechanical devices. However, the designer of such systems faces a variety of difficulties in allowing for references to the position, orientation, and direction of motion of objects and actions in space. This paper analyzes sources of these difficulties and conceivable solutions to them.
1. Introduction In recent years, there has been considerable interest in developing systems that allow vocal communication with machines. A number of major research projects exist, Newell et al. (1973) and Erman (1974). Commercial products have already been introduced including several by Threshold Technology of the United States. What makes man-machine interaction in natural language desirable is its naturalness. The use of forms from his native language simplifies a machine controller's training and allows him more spontaneous input. The use of speech frees him from physical contact with the machine. These advantages are important in many applications where mechanical devices are controlled. For example, a computerized mechanical arm might, with vocal control, be capable of performing the basic operations in radioactivity laboratories or in other dangerous environments. Where industrial workers often have their hands occupied, speech can provide supplementary control, t Most importantly with a crippled or bedridden patient, voice control may be the only effective way for him to control his environment.++ These examples fall within an area that can be called "machine control". At the moment, its future looks promising. However, as with other new undertakings, study of the domain uncovers problematic phenomenon that have not been considered in the literature. One of the most important of these phenomena is "spatial reference". This can roughly be defined as the way people refer to the direction, orientation, and relation of objects and activities in space. The typical forms used in these references are the locative prepositions, e.g., "up", "over", "behind", and "in front of". Also used are terms such as "left", "top", and "upper". This type of usage is essential for any natural-language machine control. tRosen et al. (1974) report on the progress of a system that uses speech understanding to this end. ,+Heer et aL (1975) report on the development of a voice controlled wheel chair and manipulator system. 329
330
N. K. SONDREIMER
The goal o f this p a p e r is to bring to light some o f the p r o b l e m s arising f r o m spatial reference. P r o b l e m s derived f r o m i n d e p e n d e n t sources o f o r i e n t a t i o n a l systems are considered in section 2. Difficulties in establishing these systems' structure are discussed in section 3. Section 4 discusses the use o f n o n s t a n d a r d , nonlinear, orientational systems. Some m e t h o d s o f dealing with these p r o b l e m s are presented a n d analyzed in section 5.
2. Frame of reference P r o b l e m s in m a k i n g spatial reference can be seen in such simple utterances as " m o v e to the left" a n d " t u r n right". The response to these requests is often hestitation or misunderstanding. F o r example, people have f o u n d themselves asked b y s o m e o n e walking t o w a r d s t h e m in a hallway to "please move to the left" a n d then f o u n d the other p e r s o n m o v e d in the same absolute direction they did.~ The p r o b l e m here is n o t t h a t the speakers a n d addressees d o n o t k n o w their " l e f t " f r o m their " r i g h t " , b u t t h a t they have different lefts a n d rights. E a c h can be t h o u g h t o f as possessing left/right axes. Because they are facing each other, these axes are parallel a n d opposite.++ The addressee's m o v e m e n t can r e a s o n a b l y be to his " l e f t " or the s p e a k e r ' s " l e f t " a n d confusion can arise. Every reference to " l e f t " o r " r i g h t " m u s t be f r a m e d b y some o r i e n t a t i o n a l system. W e can call this the " f r a m e o f reference" p h o n e m e n o n . § Its existence b e c o m e s a p r o b l e m whenever it is possible for axes to differ. I n c l u d e d here are m a n - m a c h i n e interactions since machines can have their own "lefts" a n d " r i g h t s " . M o r e t h a n j u s t speaker a n d addressee have axes. Those possessed b y other p e o p l e m u s t also be considered in establishing a f r a m e o f reference. C o n s i d e r where y o u w o u l d l o o k if when y o u were watching television, y o u saw a golfer at a tee t h r o u g h a c a m e r a behind the green a n d the a n n o u n c e r said " h e hit the ball to the left". Y o u could l o o k for the ball on the side to y o u r " l e f t " o r you c o u l d a t t e m p t to determine where the a n n o u n c e r t l f you do not have this experience in your past, the probable reason is the phenomenon of pointing. When people give directions, they often accompany them with a physical gesture that helps to specify the reference. In the hallway situation, a gesture with the hand and fingers, or the eyes and head would show you where you should move. Similarly, the sight of the other person starting to move in one direction would help you to deduce that the reference was to another. If some mechanical pointing device were available, it might be included in a machine control system as an "analogue control" to complement the language which is a "symbolic control", to use the language of Ferrell & Sheridan (1967). However, the language itself remains the primary tool for machine control and the problems that are identified here are inherent in its use in this way. +*There are problems with the use of these axes analogy. Some are discussed in section 5. Others include the fact that "turn left" cannot be meant the same way as "turn 90 degrees counterclockwise". If someone turns 89 degrees he will probably feel that he has turned "left". A complete understanding system will need a formalism based on something like fuzzy logic (Zadeh, 1973). Similarly, relational phrases like "to the left of" need more than just the direct application of the axes system to show their meaning. For example, in Fig. 1, the box labelled I is to the left of the box labelled II. However, the left/ right axis by which the relationship is established cannot be drawn from the center of II to the center of I. Instead, the two boxes must be projected onto the axis and then their projections can be compared. This leads into the realm of fuzziness since box III is formally to the left of box II according to the definition, but one would like to say the relationship is less strong than the one between II and I. Nevertheless, the axis analogy appears to be sufficient to analyze the problems discussed in sections 2 and 3 and no further analysis is made of its shortcomings there. §In the linguistic literature this type of phenomenon has been labelled as "place deixis", (Fillmore, 1966 and 1975). This title is limited to uses of speaker's or addressee's frame of reference. Its use is avoided here since the problems discussed in the remainder of sect!on 2 go beyond this limit.
331
SPATIAL REFERENCE
is and use his left/right axis. But since the golfer has his own left/right axis you could also pick it. A survey of 36 people showed 40 ~o taking this option. Objects also possess axes that must be considered. The best example of problems here come from the other spatial axes, top/bottom and front/back. For example, a cereal box has an intrinsic " t o p " and "bottom". If one were lying on its side, an order to "stamp the price on top of the b o x " would be ambiguous. The " t o p " referenced could be the one inherent to the box or the one which is highest with respect to an observer's gravity
m
l
t ~w
v,
w
w
w Ri gi'~t
Left
F~o. 1 defined axis. Similarly, a classroom with its podium and backboards has an identifiable " f r o n t " and "back". If a person in the classroom is not facing its "front" when he is asked to "move back", he could either move to the room's "back" or "back away" from what he is facing. As the last example shows, the source of the frame of reference need not be one of the obvious participants in the speech act, i.e. the speaker, addressee, or referent. Instead, it can be an object in the environment. For another example, consider "John is sitting in front of Jane". This is a valid description of John being in a row ahead of the row in which Jane is sitting, even if she was facing the rear of the theater. This interpretation is based on the theater's front/back axis even though the theater is not mentioned anywhere. This indicates the important point that frame of reference is not a syntactic phenomenon. The final dimension of the frame of reference problem involves time. You might tell a friend to "get onto Elm Drive and then take a right onto Fig Road". If your friend gets onto Elm Drive on the opposite side of Fig Road from where you expect him to, the turn he makes could easily be the opposite of the one you intended. The problem arises because your reference to "right" uses a non-current frame of reference. To summarize, the source of the orientational systems that define spatial references can be the speaker, the addressee, and referenced entities, as well as objects simply existent in the environment. These systems can be established by sources current to the observation or even non-current ones. Deciding which system is being used is called the frame of reference problem.
332
N. K. SONDHEIMER
3. Conventions In the last section, animate and inanimate objects were described as having "fronts", "backs", "tops", "bottoms", "lefts", and "rights". However, only humans possess language. It is up to speakers to decide how to apply these terms. There are a number of conventions on which these decisions can be based. Hence, a speaker's and an addressee's understanding of a spatial reference can differ even when the frame of reference is known. This section presents the conventions and how they can conflict. The most common conventions are based on anthropomorphosis. This is the ascribing of human attributes to beings or objects non-human. These conventions center on the identification of human characteristics and the assignment of the axes in light of them. These characteristics include facial features, a standard orientation vis-~t-vis gravity and a predilection to movement in one direction with respect to the object's body. These properties are easiest to identify in other mammals.l" Plants have their standard orientation with respect to gravity and are accordingly assigned top/bottom axes. Man-made objects are subject to these processes. The fact that cars generally move a certain way suggests frontness and hence the front/back axis. Some boxes, such as cereal boxes, have more prominent graphics on the side we label "front". Objects that rest most easily on one side have a standard vertical orientation and hence are given a " t o p " and "bottom". Opposed to the above conventions are what can be called familiarity conventions. When people spatially relate themselves to an object in one normal or predominant way, they tend to assign axes to it according to that relation. The relation could, for example, be based on viewing the object, wearing it, or operating it. The " t o p " o f a page or painting is the edge that would normally be highest in people's retinal image when they view them. The "left" burner on a stove is the one that would be to the human's left when he operates the stove. Containers have their top/bottom axes defined by the side which is kept highest when it is open. The front/back axis conventions show two different distributions. When a human makes a close association between himself and an object, it tends to pick up his front/back axis arrangement directly. This can happen when the predominant spatial relation is such that the human is contained in the object or it is attached to him. For examples, consider chairs, lecture halls, pants, shirts, glasses, telescopes, and trumpets. When the relation is "looser" the side the human faces in his normal position is the " f r o n t " and the side opposite the "back". Examples here are paintings, televisions, stoves, dressers, and pianos. The above conventions can also be seen to apply to objects' unessential extrinsic properties. For example, a plain symmetrical table such as a folding or bridge table has no intrinsic "front" or "back". If such a table were pushed up against a wall, people would feel free to refer to something as being on the "front" or " b a c k " of the table. Rocks have no inherent front/back axis. I f a rock was partially buried in a hillside, people can refer to someone who is buried "behind" it or standing "in front of" it. The familiarity conventions often apply to extrinsic properties with reference to the front/. back axis. If a photographer was heard to say to his subject "stand in front o f t h e tree", since trees have no intrinsic front/back axis it would be clear that he was telling his subject. to stand on the side of the tree the photographer was facing. Here a loose relationship tThe fact that people speak of an animal's "left ear" or "right side" servesto point out that people are assigning properties where they are not inherent. Animals do not, in fact, naturally make a distinction between left and right (Corbales & Beale, 1971).
SPATIAL REFERENCE
333
arises between the human and tree. The relation that defines the axis is simply the present position of the observer vis-?z-vis the object. The existence of the varied criteria for making spatial references leads to conflicts. Some ice cream cartons have two sides which can be opened. The one a particular person is familiar with becomes its "top" to him. The side from which a person usually approaches a desk becomes its "front". The way in which a person usually orients himself with respect to a stage gives him his idea of its "left" and "right". An antropomorphic convention conflicts with a familiarity convention in identifying the "front" of a mobile home where the side by which it is towed is different from the one by which it is entered. Finally, 10 ~o of people asked anthropomorphize a dresser and identify its left drawer as the one to their right as they are opening it. The sensitivity to extrinsic properties brings many conflicts. If a truck were rolling down a hill backwards, many people would say that a person in its path was "in back of" it. However, based on the truck's actual motion, many other people would say that the person was "in front o f " it. The most common problem arises with loose connections between observer and object. Apparently any being or thing is subject to this convention. This means that when an object has an inherent front, ambiguities can arise. If a photographer says "stand in front of the car", he could intend for the subject to stand between him and the car, or next to the side with the grill and headlights.
4. Path structures The discussion in the last two sections uses the analogy of a triple set of axes to describe orientational systems. The intended impression was that the relations being discussed could be handled by thinking of straight lines either projecting from an object or floating in space. This analogy breaks down in many cases. Consider the example of one person being "in front of" another in line. It is a common occurrence for lines at popular shows to bend around corners. In this case, one person could be said to be "in front o f " another when not straight in front. Consider a highway intersection where either a turn is possible or the road could be followed further. Imagine that the road curves after the intersection such that a driver coming out of the intersection must turn his steering wheel to stay on the road. In this situation, the instruction to the driver to "go straight ahead" should not be interpreted as implying that a straight line of movement is required. Likewise, "St. Louis is ten miles down the road" refers to measurements taken along the highway. "Back" and "backward" present a unique problem. On a roadway, the instruction "back up" can be interpreted, depending on the situation, in the same way as "move in the direction opposite of the way you are facing", "go backward following the road", or interestingly, "go back along the way you went forward". Hence, "back" and "backwards" have a special interpretation that involves the object's previous movement. All these examples require that the possibility be allowed for of an orientational system separate from the axial one. A name for these systems of "path" structures serves to indicate their nature. With this structure it is possible to describe the meaning of the terms "before", "after", "past", and one sense of "beyond". "Get off before the bridge" means to get off the road at some point less far along the path than the bridge. There is a frame of reference problem in a different sense here. To talk of a point on a journey and relate other points to it requires some way of establishing the direction of approach to that point. This can
334
N. K. SONDHEIMER
be done by identifying a point as a frame of reference. For example the flame is clear in such sentences as "from here, the turn off is before the bridge", and "go from the crossroad till past the bridge". As before there can be conflicts when the assumptions made by a speaker and addressee differ.
5. Consequences for machine control In the last three sections, some of the bases for establishing the meaning of spatial references were presented. These were seen to allow ambiguities to arise from many different sources. The outstanding question is how these ambiguities can be allowed for in naturallanguage machine-control systems. Three types of approaches and their failings will .be considered. One extreme family of solutions is to avoid utterances from which any ambiguities could arise. Terms which potentially allow multiple interpretation and those structures which are syntactically ambiguous would have to be disallowed. This is standard with programming languages. For example, the numerical-control language APT only allows references to the six major directions of movement in a few constructs.I" APT, however, is used only for reference to the position a single cutting to01 in space. The frame of reference is always the tool and the axes are always defined by its most recent movement. Unfortunately, all but the simplest environments will require the abandonment of even the references APT allows. Included here are any of the environments described in the introduction. To strictly avoid problems, the only alternative i s t o use references to a fixed co-ordinate system, e.g., "go to 10 comma 3 comma 5". This, however, removes much of the advantage of natural-language control since adaption to such a co-ordinate system would require extended training and inhibit spontaneous control. Hence this type of approach does not appear feasible. An approach at the other extreme is to allow whatever language i s natural to the situation. It can be expected that a natural sub-language that minimizes possible ambiguities exists or will arise for each environment. However, the key here is "minimize". Left with more than one possibility the system must either give up or guess at the meaning. Taken too often the first alternative would make the system too frustrating to use. Two difficulties are incumbent upon the second alternative. First, a wrong guess would be unacceptable in many situations, e.g., in a radioactivity laboratory. Here powerful heuristics must be developed to watch for error conditions. However, secondly the development of the necessary heuristics for disambiguation have proven difficult in many situations.++ This type of approach does not presently appear practical in general. A compromise appears to be the best alternative. The language used must strictly avoid ambiguities but it must be composed of English forms. To do this, we must include semantic restrictions as well as lexical and syntactic ones. If these restrictions are based on natural tendencies and expressiveness is maintained, the language will have many of the advantages of unrestricted English. However, this goal is not easy to achieve. There are some syntactic structures which though subject to ambiguities have strong tendencies towards specific meanings. In particular, references that concretely relate to tThrough the commands GOLFT, GORGT, FORWD, GOBACK, GOUP, and GODOWN, the APT programmer has access to the triple axis system. ,The problem of developing heuristics for disambiguation is the central argument for the infeasibility of practical natural-language understanding. Practical systems for limited domains have been argued for (Thompson & Thompson, 1975). However, the arguments have involved the identification of a natural subclass of the complete language. This can be seen to be more like our third alternative.
SPATIAL REFERENCE
335
the speaker's position strongly suggest the speaker's frame of reference. For example, "the box to my left" is very clear. With respect to conventions it is easy to train a controller that the machine will assume the controller's inherent axes system. A controller must be able to relate locations to more than his own. Overt reference to the machine's location, as in "the box to your left", does have preferred frame of reference property. However, the ordering conventions require more training. With some expressions, only minimal training in a predefined axes system is necessary. Avoiding productive phenomena, such as the establishment of the front/back axes on extrinsic properties, will force more training for some other expressions. It is possible that acceptable performance is not achievable for these. Even if it was, it is unlikely that this technique could be extended. To relate a reference to any overtly mentioned object, a controller must be trained to associate a specific axes system with each object. Clearly, there is a limit to how many he could easily be taught. Some other means must be available if expressiveness is to be extended. The most reasonable method of allowing a larger number of references requires using language that places the addressee in a determinate location. This allows his "new" perspective to be referenced, e.g., "the side of room on your left when you face the blackboards". But as this last example shows, this language is considerably more structured and "artificial" than earlier forms. The situation becomes worse with references to the future, e.g., "turn towards what would be your left if you were on Elm coming from Newton Street". It is also necessary to train controllers to identify uses of the path and axes systems when confusing them is consequential. Extra language could be added to both, e.g., "straight ahead" or "ahead on the road". Alternatively the axes system could be the standard case and substitute language used for reference to path, e.g., "on the road, between us and the bridge" and "ahead of us". Either way the language must become more complex. The conclusion that can be reached from these examples is that for unambiguous spatial .reference based on semantic restrictions on natural forms, expressiveness is inversely proportional to the naturalness of phrasing. The consequence for the designer of a system using this approach is that he must decide how expressive a controller in his environment must be and how much unnaturalness the controller can accept.
6. Conclusion It has been shown that spatial references can be framed by orientational systems supplied by a large variety of sources. The systems were shown to have their structure established by different conventions. Anthropomorphic properties and the orientation with which people are familiar with an object were the two classes of conventions described. These were shown to apply to intrinsic and extrinsic properties. The system in which comparisons are made were shown to be axes-like or path-like in nature. The existence of these spatial reference phenomena were shown to be important in natural-language man-machine interaction because of the many possibilities they present for ambiguous interpretation of references. It was argued that the best means of allowing for this in natural-language machine-control systems is by restricting the syntactic and semantic structure of spatial references. However, this was shown to introduce trade-offs between naturalness and expressiveness.
336
N.K. SONDItEIMER
These results are positive in that they help allow for spatial reference. They are also in part negative. They indicate that spatial reference is more complex and more ditficult to allow for than perhaps thought before. It might be argued that there must be other phenomena with similar tradeoffs where the necessary compromises leave practical systems impossible. However, this has not yet been proven. Until it has, the benefits from their development make the continued investigation of natural-language machinecontrol systems worthwhile. The guidance of Richard L. Venezky is gratefully acknowledged.
References CORBALIS,M. C. & BEALE,I. L. (1971). On telling left from right. Scientific American, 224 (3), 96-104. ERMAN, L. D. (ed.) (1974). IEEE Symposium on Speech Recognition; Contributed Papers. New York: IEEE. FERRELLL,W. R. & St-mXUDAN,T. B. (1967). Supervisory control of remote manipulation. IEEE Spectrum, 4 (10), 81-88. FIL]LMORE,C. J. (1966). Deictic categories in the semantics of "come". Foundations of Language, 2, 219-227. FILLMORE,C. J. (1975). Santa Cruz Lectures on Deixis. Bloomington: The Indiana University Linguistic Club. HEBR,E., Wn~r~, G. A. & KARCnAK,A., JR. (1975). Voice controlled adaptive manipulator and mobility systems for the severely handicapped. Second Conference on Remotely Manned Systems (RMS): Technology and Applications, June 9-11. NEWEL]L,A., BAR~a'r, J., FORGIE,J., GREEN,C., K]LATT,D., LICKLIDER,J. C. R., MUNSON,J., REDI)Y, R. & WOODS, W. (1973). Speech-Understanding Systems: Final Report of a Study Group. Amsterdam: North-Holland Publishing Company. ROSEN, C. A., NITZAN, D., AGIN, G., ANDEEN, G., BERGER,J., ECKERLE,J., GLEASON,G:, HILL, J., KREMERS,J., MEYER,B., PARK,W. & SWORD,A. (1974). Exploratory Research in Advanced Automation. Second Report, Grant GI-38100X, SRI project 2591, Stanford Research Institute, Merilo Park, California, December. THOMPSON,F. B. & THOMPSON,B. H. (1975). Practical natural language processing: the RE'L system as prototype. Advances in Computers, 13 (M. Rubinoff and M. C. Yovits, eds.). ZADEn,L. A. (1973). Outline of a new approach to the analysis of complex systems and decision process. IEEE Transactions on Systems, Man and Cybernetics, SMC-3 (1), 28-44.