Copyright ® IFAC Analysis, Design and Evaluation of Human-Machine Systems, Kassel, Gennany, 200}
INTERACTIVE 3D OBJECT MINING J ochen D. Wicke" Pablo Alvarado· Thomas Kriiger·
Karl-Friedrich-Kraiss •
• Lehrstuhl fUr Technische Informatik, RWTH Aachen Ahornstr. 55, 52074 Aachen, Germany
Abstract: This paper presents an object recognition system which alleviates the task of finding information related to 3D objects by acting as a mediator between user and database. The system can be adapted to various automation and information retrieval tasks. Experimental results show that the system is able to discriminate up to approximately 200 different objects. Copyright © 2001 IFAC Keywords: object recognition, computer vision, visual pattern recognition, user interfaces, automation
1. INTRODUCTION
Data mining is concerned with finding information in large databases. There has been a considerable amount of research involving the retrieval of text. Recently, progress has been made in the area of image retrieval or picture mining (del Bimbo, 1999) . Databases containing data of three-dimensional objects will play an important role in the near future, especially in areas like CAD, virtual warehouses, or interactive product catalogues. Traditional database-query mechanisms however, are inadequate for these kinds of data. Suppose, for instance, a user has a database containing information on some kind of threedimensional objects. Presently, retrieving this information requires a textual description of the object. Such a description is often difficult to formulate and does not conform to the way humans usually refer to 3D objects. A more intuitive way to perform a query is presenting the object itself or some kind of visual representation as a key to access the database (Fig. 1) and formulating a query relating to the visual description ( "Show me the record of this object").
Fig. 1. Accessing a database of 3D objects by a visual description of the object. in a database search engine. The user presents arbitrary views of the object to the recognition system, which returns a list of numerical IDs of similar objects. If the presented object is among the objects returned as recognition result, the query can be continued as a traditional query on the database using the discovered object identity (Fig. 2). If the recognition is not successful the first try, the
The proposed approach for reaching that goal is to use object recognition technology as a first stage
user can either perform additional presentations,
147
models contained in the database with a view of the presented object. In contrast, viewer-centered object recognition does not use any explicit 3D information, but rather correlates the query image with stored views of the objects labeled with the object identification. The most prominent approach does not store the images directly, but rather computes feature vectors from these views which are invariant against in-plane transformations and robust against distortions like partial occlusion. During recognition, the same kinds of features are extracted from the query image and subsequently classified using statistical or artificial intelligence algorithms. The result of this process is either the identification of the recognized object or a list of possible objects.
user
Fig. 2. The application framework for object recognition as a tool for user interfaces. thereby supporting the object recognizer, or submit additional data such as the object weight, thereby reducing the search space in the visual domain.
Most object recognition systems are fine-tuned to special tasks. In automation, for example, 2D pattern matching mechanisms are usually sufficient to detect and identify specific objects in a production line. In other applications, where the recognition system is supposed to assist a human user, it is not possible to restrict the view point from which the object is to be identified. This situation arises for example when items remitted from the dealer to the manufacturer must be identified by humans.
Applications for such a system can be found in those areas of logistics, where objects need to be identified. Usually, this is accomplished .by simply evaluating a bar code label attached to the object. If such a label is missing, however, it is very difficult to acquire the object's identification code. For example, there are objects which are too small for being labeled or whose labels can get lost during transportation. In these cases, identification has to be performed manually, which is both difficult and error-prone. A human has to compare the object to depictions contained in a printed or electronic cataloque and find the correct identification.
A system that can cope with the presentation of an arbitrary view of an object is MIT's DyPERS (Schiele and Pentland, 1999), which uses multidimensional receptive field histograms for statistical object recognition and localization. SEEMORE (Mel, 1997) uses a combination of color, texture and shape descriptions for viewer-centered object recognition. A system which combines approaches from OR and VIR is Nefertiti (Paquet and Rioux, 1999), an access engine for content-based retrieval of images and three-dimensional models.
The system presented here aims to alleviate this task by offering an intuitive user interface for 3D object catalogues. The system consists mainly of a kernel containing modern object recognition technology which can be accessed by various kinds of client modules. Using a Web-based client, a user is able to search for objects in remote electronic catalogues by presenting a sample view.
AXON 2
(Elsen, 2000; Elsen et al., 1999; WaIter, 2000) is a very flexible system and is able to combine different features and classification methods. Unlike most other systems, it is easily adaptable to any kind of object set. It is the basis for the system presented here, which uses AXON 2 ,S object recognition technology, but improves the practical applicability of the system by providing a network interface and separating the object recognition functions from the user interaction modules.
2. RELATED WORK The system described here stems from a combination of two widely spread topics of research: Object Recognition (OR) and Visual Information Retrieval (VIR). Both fields try to make a computer system recognize objects or images the same way humans do. A considerable amount of work has already been done by many research groups. A summary of important topics is presented in the next paragraphs.
2.2 Visual Information Retrieval
Retrieving multimedia information has become an increasingly active area of research in recent years. The appearance of multimedia data has spawned a tremendous amount of interest on retrieval of non-textual data of any kind, including visual
2.1 Object recognition
One can distinguish between object-centered and viewer-centered object recognition. Object-centered recognition matches the three-dimensional
148
Fig. 3. The structure of the OR server and two different kinds of user interfaces.
Fig. 4. An example for a plan macro in AxoN 3 . The topmost plan item receives an image from the environment. The image is segmented into object and background by another plan macro. The segmented, illumination-invariant image is then passed on to feature extraction. Each feature is classified by a RBF network, whose results are combined and sent back to the environment.
data like images or video data, and aural data like speech, music, or sound samples. The techniques employed in Visual Information Retrieval are closely related to the ones used in viewer-centered object recognition. Both fields rely heavily on the extraction of meaningful features which can be used to correlate the query image with some kind of stored data. Examples for research systems using such methods are VisualSeek (Smith and Chang, 1996), Photobook (Pentland et al., 1994) and Piction (Srihari, 1995). Prominent commercial VIR-Systems include Virage (Gupta and Jain, 1997) and QBIC (Flickner et al., 1995). All these systems have the common goal to allow database queries by image content. However, none of these systems addresses the retrieval of 3D objects.
(2) an image preprocessing module (3) a feature extraction module (4) a classification module, deploying various kinds of neural networks (5) a combination and analysis module The server is connected to a client, which may either be a dedicated client program running locally as a stand-alone application, or a WWW gateway acting as a data relay station for receiving queries from clients and sending back the query results.
3. THE OBJECT RECOGNITION SERVER The object recognition system presented here is designed using a client-server structure, allowing the integration of different kinds of object recognition techniques in an user-friendly interactive environment for retrieving information on real-world 3D objects. The system has been called AxoN 3 (AxoN 2 for Networks), and has been derived from the AxoN 2 system (Adaptive eXpert system for Object recognition with Neural Networks) described in (Elsen et al., 1999). It is an adaptive viewer-centered object recognition system. The system is trained to recognize several sample views of the object. By generalization, it is able to recognize the object from an arbitrary viewpoint.
Each of the server modules provides a variety of basic blocks, each of which can compute a function of its input arguments. There are blocks that control data flow, others perform input/output of images or feature vectors or both. A functional block has zero or more input arguments, and zero or more output arguments. The whole system is organized as a reconfigurable data flow machine. Each configuration can be created using a graphical user interface, the so-called plan editor. As an example, Fig. 4 shows the graphical representation of a part of the data flow graph for an object recognition task. The recognition is initiated by a client, which sends the recognition request and an image showing the object to the OR server. This image enters the graph at the node labeled "environment". The OR server then performs some image processing and classification functions, resulting in several recognition results based on different feature
3.1 System structure
The OR server consists of the following modules (Fig. 3): (1) an input/output module
149
types. These single results are combined to an overall result and sent back to the gateway (not shown) . The gateway can either perform additional queries to a traditional database or it can just send the result back to the user.
Axon3 selVer interface The Image you selected for presentation to the
The object recognition procedure is defined by a querying protocol. Therefore, it is possible to create different kinds of GUIs for different types of users, or different user locations. As an example, consider the WWW gateway: It receives a request from the user, selects the corresponding data flow configuration, sends all the data to the OR server and collects the result. Then it presents the result to the user and processes any further queries.
Please select the action Ihe cerver is 10 perform:
I'wJrfl system Is shown below:
I recognize Object
Cl
I
Please be a little patient while your request is being processed. This system has not yet been optimized for response speed and our equipment is not very fast.. Thant..s .
3.2 Prototype implementation r::::;;::::::::::::::::::::::;::;;::::::::::::::~:;:.7:."::.":::.-:::::::::::-:::::,-::::.-:::.-::.-::::::::::-::::::::::-:-::::-:::::::::::.-::.-:.":::,':::.-::.-::::::;::::::::.-:;
A screenshot of the Web interface prototype is shown in Figure 5. 1 In order to allow using the system without any need for special hardware or software, the Web demonstration uses URLs for obtaining the query image.
AJdom resurch group. qdom6ttlfhinbtwftt=flS{;hen«" MIJ(J by UIIrsIufIIlflr Techniscllo /nfIXm8liA. fNffH _
~
. All Wghts f/esJWed.
Fig. 5. A screen shot of the Web page showing the user interface ofAxoN 3 after acquisition of the query image. The user can select the action that is to be performed with the displayed image.
The user may either choose a demo image from the test image repository or provide an URL of any image accessible via HTTP. Currently, there are two requirements for an image to be processible by AXON 3 : The background has to be black with a patch of white in the upper right corner, and it must have a size of 256 x 256 pixels. In a next step, the user is prompted to select the action the OR server is supposed to perform (Fig. 5). Each action is internally represented by a different plan.
Axon3 recognition result VDU have selected an Image Which has been used as an Input for AxonJ. The first raw shows tl'le sample view you selected. below are the results of the object recognition. The object recognition system sorts the catalog according to similarity to the presented objecl The ca1alog Is represented by Images which have been scanned from SW1(i catalog from 1999.
If an object recognition task is performed, the
system will present a catalog of all objects, sorted by similarity to the presented object (Fig. 6). 4. EXPERIMENTAL RESULTS The applicability of object recognition as an interface tool for 3D-databases obviously depends on its recognition correctness. However, for interactive applications it is not generally necessary to perform an unique identification. As outlined in the introduction, we are primarily concerned with finding similar objects. Therefore, it is sufficient if the system presents the most similar objects at the top of the result list. For evaluating recognition performance, it is important to incorporate the position of the desired object in the ranking list. For that purpose, the concept of k-best recognition is used, which is basically a function that maps the rank k to the rate of the desired object being at rank k or higher in the result list. 1
_ 1 : 0l0Z4G-r
RaIl< 3: ZZOJOZseepf.."
!IShOW successors I
Fig. 6. A recognition result derived by AXON 3 . The top image is a shrinked version of the source image. The row of images at the bottom represents the three most similar objects. The object ranked first (shown at the left) is the presented object.
The prototype system can be accessed at the URL
http ://www .techinfo.rwth-aachen.de/Forschung/ Axon/demo/
150
Table 1. Recognition rates for three different object sets. object set COIL-lOO 42 plush objects 202 plush objects
I-best 100% 98.8% 87.9%
2-best
3-best
99.9% 93.8%
100% 95.8%
the results obtained for this database suggest that viewer-centered object recognition is a promising approach for object sets containing several hundred objects. For larger object sets, an easy way of improving the recognition performance is to use two different query images showing the same object. Without this technology, a human user would be required to look for the object identification in some kind of catalog manually, which may take several minutes. IT user interface and server are located on the same machine, the OR system enables the user to find the object in less than two seconds.
5. CONCLUSIONS AND FUTURE WORK There are applications which require humans to retrieve information concerning three-dimensional objects. IT these objects lack a correct product identification, e. g. a bar code, this task currently can only be performed in a non-intuitive and rather cumbersome way.
Fig. 7. Catalog images showing some of the 202 plush objects used for the recognition tests. Some objects exhibit strong similarities, e. g. the two teddy bears in the top right corner, which differ only in object size.
This paper suggests a mechanism which uses the intuitive notion of presenting a visual representation of an object as a search key for the database. For this purpose, an object recognition system can support the user in formulating queries by identifying an arbitrary view of the object. Using such a system as a mediator between the user and the database can alleviate difficult identification tasks.
The system has been tested on three different object sets. The first one is the COIL database (Nene et al., 1996), consisting of 100 objects, each one represented by 72 views. 36 of these views were used for training, the remaining were used for testing. The second set is composed of views of 42 plush animals. Training was performed using 35 random views for each object, the test set consisted of of 840 novel random views, 20 for each object. The third set contains views of 202 plush animals. Figure 7 shows catalog images of some of the objects. 40 views of each object, taken from camera points uniformly distributed on an upper hemisphere above the object, served as training set. The test set consisted of 36 novel views for each object. Tests were performed using only a single recognition cycle, i. e. the system only used one view.
The system has demonstrated its applicability for an object set of medium size. For large object sets of thousands of different objects, it is going to be extended and improved in order to handle object sets large enough for real-world applications.
ACKNOWLEDGEMENTS This research has been funded by the HeinzNixdorf-Foundation. The test objects were kindly provided by Margarethe Steiff GmbH.
Table 1 shows the k-best recognition rates for these three sets. The objects shown in the COIL database could be identified completely without error. The 42 plush objects are harder to recognize because of strong similarities among some of them. This effect is even stronger in the database containing 202 plush objects. Considering this,
6. REFERENCES del Bimbo, A. (1999). Visual Information Retrieval. Morgan Kaufmann. San Francisco.
ISI
Elsen, I. (2000). Ansichtenbasierte 9D-Objekterkennung mit enneiterten Selbstorganisierenden Merkmalskarten. VDI-Verlag. Diisseldorf. Elsen, I., K.-F. Kraiss, D. Krumbiegel, P. Walter and J. Wickel (1999). Visual Information Retrieval for 3D Product Identification. Kunstliche Intelligenz 1/99, 64-67. Flickner, M., H. Sawhney, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele and P. Yanker (1995). Query by image and video content: The QBIC system. IEEE Computer 28(9), 23-32. Gupta, A. and R. Jain (1997). Visual Information Retrieval. Communications of the A CM 40(5), 71-79. Mel, B. W. (1997) . SEEMORE: Combining Color, Shape, and Texture Histogramming in a Neurally Inspired Approach to Visual Object Recognition. Neural Computation 9(4), 777804. Nene, S. A., S. K. Nayar and H. Murase (1996). Columbia Object Image Library (COIL-lOO). Technical Report CUCS-006-96. Columbia University. Paquet, E. and M. Rioux (1999). Nefertiti: A query by content system for threedimensional model and image databases management. Image and Vision Computing 17,157-166. Pentland, A., R. W. Picard and S. Sclaroff (1994). Photobook: Tools for content-based manipulation of image databases. In: SPIE Storage and Retrieval Image and Video Databases 11. San Jose. Schiele, B. and A. Pentland (1999). Probabilistic object recognition and localization. In: Proc. Int. Conf. on Computer Vision - ICCV'99. pp. 177-182. Smith, J. R. and S.-F. Chang (1996). VisualSEEk: A fully automated content-based image query system. In: Proceedings ACM Multimedia '96 Conference. Boston, MA. Srihari, R. K. (1995). Automatic Indexing and Content-Based Retrieval of Captioned Images. IEEE Computer pp. 49-56. Walter, P. (2000). Verfahren der sequentiellen Merkmalsanalyse fUr die Mustererkennung. Shaker-Verlag. Aachen.
152