A framework for the integration of expert systems with multimedia technologies

A framework for the integration of expert systems with multimedia technologies

Expert Systems With Applications, Vol. 7, No. 3, Via.427-439, 1994 Copyrisht © 1994 El.~vier Scient~ Ltd Printed in the USA. All rights reserved 0957-...

2MB Sizes 0 Downloads 50 Views

Expert Systems With Applications, Vol. 7, No. 3, Via.427-439, 1994 Copyrisht © 1994 El.~vier Scient~ Ltd Printed in the USA. All rights reserved 0957-4174/94 $6.00 + .00

Pergamon

A Framework for the Integration of Expert Systems With Multimedia Technologies ARCOT DESAI NARASIMHALU Institute of SystemsScience,NationalUniversityof Singapore,Singapore0511

Abstract--Expert system (ES) technology allows the capture and replication of expertise in an application domain. So far, the type of data that is used by an expert system for its reasoning process remains primarily alphanumeric. Image, graphics, and other dynamic data such as voice and video are part of an emerging technology generally termed multimedia technology. This article highlights the critical issues in building a successful application combining the two technologies. Based on these critical issues, it also provides a framework for the integration of multimedia technology with expert systems technology.

1. INTRODUCTION

of a tumor, requires expert knowledge. Feature extraction is application dependent and requires the capture of appropriate expertise. Accurate classification of features extracted from multimedia data is a nontrivial task and can be fuzzy and would require expert assistance. Inference engines using such classification have to reckon with fuzzy values and may indeed produce fuzzy results. Building such inference engines and interpreting these fuzzy results into something useful to end users are also new requirements. Consider, for example, an expert system that can help police identify a suspect in a crime based on the description provided by the victim. In this application, the first difficulty is that the description provided by the victim may be incomplete because of the trauma that he or she underwent during the crime. Second, the description may not be accurate given that the assailant would not have given the victim sufficient time to study him or her in any detail. Thus, the input to the expert system is likely to be imprecise. In this example, the attributes that a victim can easily remember, such as hairstyle, may not be the most reliable ones. It requires experts in crime investigation to identify what dominant attributes should be extracted and what confidence factors should be assigned to each of these attributes. Any attempt to classify the assailant using an artist's sketch and to compare such a sketch with the photos of known criminals requires development of new technologies that can successfully map a graphical sketch of a human face to an image. In some applications (such as the one just described), while an expert system is narrow in its focus, the underlying data may be shared among other related expert

MULTIMEDIA ( M M ) technology is defined as the combination of two or more component technologies such as audio, video, image, graphics, and alphanumeric data (Narasimhalu & Christodoulakis, 199 l). What sets multimedia technology apart from television and other technologies is the extent of interactivity and user modeling it provides. Building an expert system involves a series of procedures that are generally divided into two steps. The first step, called registration, includes extracting salient features from an application data set, classifying them, building indexes on them, and clustering related pieces of information together for optimal performance. The second set, called recall, involves parsing a query, deciding the type of inferencing or reasoning, retrieving relevant data for reasoning and matching the query with the retrieved data through a relevant reasoning process, and presentation. The main challenge in integrating expert system technology with multimedia technology lies in understanding the unique and new requirements (discussed later) that the new data types impose on each of the aforementioned processes. Proper representation methods for the salient features of a multimedia application's data set for purposes of reasoning are a new requirement in the design of an architecture for integrating expert systems with multimedia technologies. Extracting features from an image, such as an X-ray Requests for reprints should be sent to Arcot Desai Narasimhalu, Instituteof SystemsScience,NationalUniversityof Singapore,Heng Mui KengTerrace, Kent Ridge,Singapore0511. 427

428

systems. For example, criminals are also citizens of a country and hence the country's national registry will have some overlapping data. The two sets of data will have to be linked for building applications by a forensic department, in which an expert system will use information about victims, criminals, crime type, etc. to confirm the identity of a victim. We will use expert systems built around a collection of photographs of human faces, fingerprints, text descriptions, and graphical sketches of human faces as the examples for discussions throughout this article. Such an application is definitely multimedia in nature as per our definition (it has more than one data type). The data set can be enriched further by adding voice prints and genetic codes to make it even more diverse. We will discuss more than one expert system built on the same data collection to highlight the potential for building views into expert system technology. In discussing various issues with respect to expert systems built around a collection of human faces, we will be referring to three sets of landmarks as described later. The first set of landmarks is called visual landmarks. These are basically a user's observations about some salient features on a face. These features can be scars, moles, or even verbal descriptions of a feature such as an eye described as being big, medium, or small. A second set of landmarks is called anthropometric landmarks (Jurgens, Aure, & Pieper, 1990). These refer to physical measurements on the skin of certain predefined points on a human face. A third set of features is referred to as cephalometric landmarks. These are the physical measurements of certain predefined parts on the bone structure of a human face. They give measurements of a face based on the skull only and are normally measured using either X-rays or CT-scans. Multimedia technology can help expert systems technology by providing a variety of means of visualizing interaction between users and an expert system for input/output, processing, and presentation of results. There are reasonably well-defined guidelines for designing presentations that are good to look at (i.e., look and feel) and effective to use (i.e. usability) that multimedia technology can offer to expert systems technology (Thimbleby, 1990). Considerable research has been carried out on designing usable user interfaces and on tools for good graphical user interface designs (Andr6 & Rist, 1990; A r e n & Hovy, 1992; Aren, Hovy, & Mulken, 1993; Cohen, 1984; Feiner & MeKeown, 1990; Maybury, 1991; Roth & Mattis, 1991; Singh & Green, 1991). These results will be subsumed in the framework presented later in this article. Expert systems technology will provide analytical power to multimedia technology. Multimedia objects are generally very unstructured and often support complex interrelationships among their components. In our example, there are three data types--image,

A.D. Narasimhalu

graphics, and text. The interlinkages of different types of data into a semantically consistent record would require complex representations such as nested trees and graphs. Furthermore, data are not linearly ordered since several records can simultaneously participate in more than one relationship. For example, every face in the collection can belong to more than one subclass; these subclasses have been defined based on attributes such as ethnic background; type of eye, nose, lips; or sex. Some data, such as verbal descriptions of visual features, that are captured in freetext form are either semistructured or unstructured. Hence, the application requires new types of representations. The rest of this article is organized along the following lines: Issues related to the integration of multimedia technology with expert system technology are discussed in detail in Section 2. Section 3 presents a framework based on the issues raised in Section 2. Section 4 summarizes the discussions on the framework for the integration of these two technologies. 2. ISSUES IN T H E I N T E G R A T I O N O F M U L T I M E D I A AND EXPERT S Y S T E M TECHNOLOGIES We will divide the various issues into three categories: infrastructural, expert system (ES) technology, and application related. The infrastructural issues will address the problems related to storage, retrieval, analysis, clustering, and synthesis in the integrated system. The ES technology issues will address the new technologies that may be required or the reengineering of present technologies to address new challenges posed by the integration of MM and ES technologies. The application-related issues will address problems such as knowledge acquisition, handling subjectivity, conflict resolution among domain experts, customization, inference using abstract concepts, and prioritization of features in an application.

2.1. Infrastructural Issues in the Integration of MM and ES Technologies In this section we will discuss storage, retrieval, analysis, and synthesis. 2.1.1. Storage. Multimedia data needs large storage capacity--a true color image on a screen occupies about 3 MB of storage. Reasoning on multimedia data will use several component data. For example, an expert system for identifying a human face will require image attributes such as different classes of eyes and noses along with other attributes such as ethnicity, sex, profession, and address for its reasoning purposes. Hence, MM/ES systems will benefit from efficient

ES Integration with Multimedia Technologies storage techniques that will allow clustering of related data at the physical level (Kim, Chou, & Banerjee, 1988) so that when one data item is requested for the purposes of reasoning, all the related data items are also brought within the same disc access. The clustering will be easy only if attribute pairs (such as eyebrows and eyes) are strongly correlated. When many features (or attributes) are strongly correlated, complete retrieval within a single disc access becomes difficult and minimizing the number of disc accesses will become the objective. Disc array technology using disc striping on parallel arrays of discs connected to a high-bandwidth I/O bus architecture may be an alternative.

2.1.2. Retrieval. Categorization (Kim et al. 1988; Rosch, 1973) or classification is very important for retrieval. Information retrieval techniques (Salton, 1991) use classification trees (Breiman et al., 1984). When information retrieval uses classification trees, the retrieval is called class-based retrieval. Class-based retrieval can feed pruned data into expert system technologies such as case-based reasoning (CBR) (Bain, 1986), which is also organized based on classification. Categorization o f multimedia data cannot always be easy since such data are so rich in their information content that each M M object can lend itself to multiple and sometimes fuzzy classifications. In the case of human faces, the categorization may be based on ethnic divisions, geographical divisions, or professional classes. Also, intermarriages among races and nationalities result in humans with mixed features that defy strict and complete inclusion into only one group, thus resulting in the use o f fuzzy sets and logic. Hence, class-based retrieval of M M data for the purposes o f reasoning by an expert system must use fuzzy logic (Zadeh, 1983). A forensic expert system and an expert system for criminal investigation may be working on the same set of human faces. The former will use cephalometric data for reasoning, whereas the latter will use dominant visual data. Since multimedia objects can lend themselves to concurrent usage by different expert systems, it is important that retrieval provides for views on a data set. The role of retrieval in this architecture is to prune the search space based on classification and present the most relevant data to the inference engine for reasoning purposes. Most of the inference rules for the two expert systems will, of course, be different. Except for dominant features such as a scar, other descriptions of the same face by different users can be different. Hence, there is no use building a retrieval engine that will look for an exact match. The retrieval engine must look for closest matches based on some similarity measure (Tversky, 1977). While similarity measures (based on text primitives) for text retrieval systems have been researched extensively, there are still

429 plenty o f research opportunities for developing similarity measures for multimedia applications.

2.1.3. Analysis. Analysis refers to the process used for identifying and extracting the salient attributes of a multimedia application data set. Analysis is included under infrastructural issues since it affects categorization significantly. Different multimedia objects will require different types of analytical tools. For example, a human face can be described using different landmarks--visual, anthropometric, and cephalometric. While a forensic expert system will be interested in cephalometric landmarks, a criminal identification expert system will require anthropometric landmarks. The analytical tools for extracting cephalometric landmarks are different from those used for extracting anthropometric landmarks. In fact, source of data and methodologies used for extracting these two data sets are also different. Sometimes an initial version of an application may be built using a subset of inference rules or data types. Once successful, the application will be expanded to include the rest of the data and rules. The architecture should have the notion of graceful extensibility built into it. Graceful extensibility would allow seamless introduction o f new data types, media, and rules without having to alter an application significantly.

2.1.4. Synthesis. Most of today's expert systems use alphanumerics or (business) graphics for presenting their conclusions. With MM technology, one can envisage expert systems that will take input comprising of different data types, apply reasoning, and then synthesize an output that may be very different from the data types o f input parameters. For example, consider an expert system that can synthesize the face of a person from the skull. The examination of a skull can identify the sex, ethnic background, and estimated age of the person. Some additional information can be derived using the cephalometric landmarks obtained from the skull and the expertise (stored in a set of rules o f an expert system) of correlating these with anthropometric landmarks. Anthropometric data can be used either by a computer or an artist to synthesize the human face corresponding to the skull. The media type o f the output is an image while the input consists of different media types. In this example, an expert will establish a correlation between the anthropometric and the input data. Since it is not possible to obtain this correlation for a large number of sample cases, it may be necessary to have a limited number of norms and then interpolate the correlations for other cases not in the limited collection. The rules for such interpolation will be stored in an application's knowledge base.

430 Sometimes synthesis will not only consider static features but will also have to comprehend changes over time. For example, to articulate text descriptions or graphical/image characteristics of a video clip, it is important to preview the different frames of the clip, record the presence or absence of different features across many frames, and try to comprehend the storyline in the scene. While such a skill appears to be relatively simple for human beings, it remains a tough challenge for computer-based integration of ES/MM technologies.

2.2. ES Technology Issues in the Integration of MM/ES This section will address new unification procedures required and their implication on the inference engine. For the purposes ofdiseussion, we define a multiplace, multimedia predicate to represent some or all of the multimedia attributes of an object. An example can be the predicate H u m a n Face (Ethnic background, Sex, nose type, Eye type . . . . ). We will use such a multiplace, multimedia predicate throughout this section to discuss the impact o f such predicates on unification. We refer to the predicate that is submitted for consideration as the test predicate and the set of predicates already stored in the knowledge base as reference predicates. Unification will be applied to these two predicates. 2.2.1. Unification. Many expert systems use unification for deriving conclusions. It is central to the inference engine in an integrated ES/MM system. Unification is the symbolic binding of two atomic propositions, variables, or abstracts. Traditionally, the binding succeeds if the atoms, variables, or abstracts are the same and fails if they are different. Most of the expert systems use unification algorithms that carry out the operation on alphanumeric strings. In the integration of multimedia technology with expert systems, there are new demands on unification algorithms. One such demand will be for the unification to be carried out on ranges as opposed to point values. This technique is generally known as relaxed unification. Two other types of demands will be media-sensitive unification and unification on fuzzy data. Unification of two predicates representing the same object in different media (graphical, text, image, or video forms) should be possible (i.e., the unification operator will have to be media sensitive). Another example of media-sensitive unification is unification based on colors or on dominant features across two images. Adoption of fuzzy values will require replacement of quantitative values with qualitative values. A preunification process will be to map the quantitative value of an input feature into a valid abstract (or fuzzy) value. The

A.D. Narasimhalu output of a fuzzy unification process can itself be a valid fuzzy value. Since the features in two MM predicates (the first one the test predicate and the second one the reference predicate) considered for unification may represent the same object from different spatial viewpoints, the unification algorithm should be viewpoint sensitive. Also, given that two predicates can correspond to the same object at different distances, unification will have to be based on scale-independent reasoning on the two multimedia predicates. The preceding example clearly establishes the need for context sensitivity in the unification process. The other dimension in which the unification process needs to be extended is semantics. For example, if a reference predicate has the value "Beautiful" and a test predicate has the value "Lovely," unification should be successful. This is clearly an example of the unification process being able to relate to synonyms. As an extension of this argument, one can also propose that the synonyms can be user dependent. Certain words or symbols may be used by younger (or naive) users in place of equivalent words or symbols used by older (or experienced) users. Since the unification engine has to adapt itself to the different media and contexts, we call it an adaptive unification engine. The multiplace, multimedia predicates contain features in each of their places, and hence the unification process can also be called feature-based unification. Another issue that merits consideration is unification of a multiplace predicate in the absence of complete information. Consider the example in Figure 1. It is obvious that only some parts of a face are fed into the adaptive unification engine. This input can be considered to be a multiplace, multimedia predicate of which four places have been instantiatedwnose, eye, lip, and chin. Some other places, such as forehead, hairstyle, ears, etc., have not been instantiated or defined. The adaptive unification engine should take the incomplete set of anatomical features of a human face (in graphical form), categorize them, cast them into a multiplace, multimedia predicate, and then retrieve different faces from a repository one at a time and try the unification. Such unification will have to be scale and rotation independent. Whether a nose is fed in a vertical or horizontal mode, the unification engine should be able to accept it. When feature-based unification is carded out, it is important that two closely similar features should be considered the same. Developing mechanisms to handle similarities (within specified tolerance levels or similarity measures) in unification is an important research issue. The result of such a unification may be fuzzy. 2.2.2. Inference Engine for MM/ES. In light of the preceding discussion on the unification operator, it is

ES Integration with Multimedia Technologies

MULTIPLACEMULTIMEDIA PREDICATE (TEST FACE)

431

REPOSITORY OF FACES

RETRIEVAL ENGINE I

]

REFERENCE FACE

ADAPTIVE UNIFICATION ENGINE

t

Face

FIGURE 1. Unification of multiplace multimedia predicates

important to discuss the role of an inference engine in the framework. Given that users' queries can be imprecise and that the categorization of multimedia objects can be fuzzy, the inference engine for such applications will have to have reasoning processes very different from the previous ones. The inference engine for multimedia technology will have to cater to handling knowledge bases containing different types of data. Some of the data will be numeric, others will be patterns, and many of them will be nonnumeric. Knowledge about multimedia data will embed different types of uncertainties--incomplete information, ignorance, or vagueness. Since the inference engine will have to be sensitive to different types of context, uncertainties, and media, it will be made up of a collection of reasoning processes somewhat similar to the inference engine described in Lui, Tan, Lim, and Teh, 1990. The first part handles knowledge that comes in the form of patterns, the second part handles knowledge that comes as rules, and the third part is an exception handler that can handle nonnumeric knowledge. The reasoning process in all three levels will have to provide for handling uncertainties as defined in Cohen, 1985; Goodman and Ngyuen, 1985; and Graham and Jones, 1988. The first layer of the inference engine will use neural networks (Kosko, 1992; Lau & Widrow, 1990). It will have a suite of neural network training engines (back propagation, self-organizing, Hopfield, etc.). Users will have to decide on the type of neural network that is

best suited for a given application and invoke the desired type oftraJning engine. The complexity of training depends on the number of input nodes and the number of data for each of these nodes. Training time of a neural network can be reduced by reducing either the number of features (i.e., the input nodes) or the number of data points for each node. Once the number of input nodes is a minimum, further reduction can be obtained only by reducing the number of valid values for each of the input nodes. This is achievable by converting continuous values at each of the input nodes into range values representing some abstract subclass for that node. In our example, one of the nodes can be eyes, and there can be several types of eyes defined in this data space. Instead of storing every possible occurrence of eye, we can define a limited number of abstract subclasses for eyes (see Figure 2). Neural networks allow for some imprecision in inputs and axe still known to converge to the desired output through proper training. Hence, definition of the ranges does not have to be very accurate. The amount of explanation that can be provided by neural network engines is limited. The second layer of the inference engine handles rules. These rules can be used both for reasoning and for providing explanations and will use an adaptive, feature-based unification process that will handle uncertainty as well. The second layer will use certainty theory (Buchanan & Shortliffe, 1984) for handling confidence measures, Dempster-Shafer's theory of evidence (Sharer, 1976) for handling ignorance, and fuzzy

432

A.D. Narasirnhalu

:Iiii/I :. ~,,'~, .~..

l~:~f

"*

~'" ".~,.,.

"".~,~

~d,;

",i." ,, "

•',

ill~.il

~,'"

. .':

i

. i.

g v

"

~."

~ ,."~'~: . ~

"

~

":

-..l~!~-I[~.ld,:

~ ~ "

'

"

Slmd~ll~

"

°

FIGURE 2. Subclassification of eyes.

logic (Zadeh, 1983) for handling vagueness. Handling multiple reasoning processes within a system is neither new nor infeasible. CYC (Lenat & Guha, 1990) uses such a mechanism for its inference engine. This layer can be invoked independently, when the inferencing fails using the neural network layer, or when an explanation for a given conclusion is desired. The exception handler will address nonmonotonic reasoning (Dix, Jantke, & Schmitt, 1991; Ginsberg, 1986). Such reasoning is useful when the reasoning process involves nonnumeric computation and/or can handle exceptions to the reasoning provided by the neural network and rule layers. This layer can be used to handle new evidence and reasoning that have accrued since the last training of the neural network in the first layer. When sufficient additional material is accumulated as exceptions to the knowledge in the neural network, a fresh training of the neural network can be initiated and the corresponding evidence re-

moved from the third layer. Conversely, one could take the cases in the. neural network layer that defy convergence and store them as part of the exception handler. Given that the knowledge base for multimedia applications is likely to be incomplete and will embed different types of uncertainties, the inference engine will not stop with the first successful unification. Since the result of a unification may be fuzzy, the inference engine will have to present a ranked order of the results of unification to a user. The number of such results will not be too many, since class-based retrieval techniques are used to present a limited set of inputs to the inference engine. The three layers of the inference engine may be run either in tandem or in parallel, depending on the nature of application and platform. All the different types of inference mechanisms do not have to coexist in every application. Rather, this engine can be used as a library

ES Integration with Multimedia Technologies

FROM RETRIEVAL

433

ENGINE

-I

b-... RESULT

"

UNIFICATION ENGINE

INPUT

SUCCESSFUL, PARTIALLY SUCCESSFUL, FAILED)

• TEST FEATURES

I CONTEXT I FIGURE 3. Adaptive unification.

and only those types ofinferencing required for a specific application can be compiled into a customized inference engine for that application. For example, there may be one customized inference engine each for human face recognition, geometric shape recognition, etc. (using adaptive unifications shown in Figure 3).

1. Feature prioritization: In the Introduction, we discussed how the same subset of multimedia data may be shared among related expert systems. Thus, for the crime investigation application, photographs, an artist's sketch of an assailant, and a crime pattern may be interesting. For a forensic expert, the cephalometric landmarks may be of additional and primary interest (see Figure 4). A profile (or view) definition mechanism will allow the extraction of appropriate multimedia attributes for a given expert system. 2. Personalization: While a given expert system determines the specific set of features and their relative order of importance, different users of an expert

2.3. Application-related Issues in the Integration of E S / M M Technologies 2.3.1. Customization. The word customization can be interpreted from three different perspectives--personalization, culturalization, and feature prioritization.

ETC. NATIONAl,

~ISTaV

CORE DATA

FINGER PRINT GENE PRINT ETC.

CEPHALOMETRIC NA

NAME ADDRESS DATE OF BIRTH BLOOD GROUP r ANTHRO ETC. / POMETERIC DATA ETC.

FIGURE 4. Applicationwise feature prioritization.

434

A.D. Narasimhalu abstract concepts. This approach to knowledge acquisition does not require experts to articulate the inference rules. When such articulation is available, it can be used for purposes of explanation.

system may want some personalization. For example, two different forensic experts may somewhat differ in the set of cephalometric measures they would use in superimposing a photo on a skull. While one may use 20 landmarks, the other may use 56 landmarks in the reasoning process. Moreover, they may use different sets of rules to arrive at their conclusions. Of course, all the landmarks and rules in the two cases will not be mutually exclusive. (See Figure 5.) 3. Culturalization: This is an important feature in multimedia systems given that the content is audiovisual. Multimedia data are amenable to multiple and biased interpretations. For example, what may be considered small eyes in United States may indeed be categorized as regular or even large in Japan or some other country in which the feature descriptions vary based on local population. Hence, the inference engine should be enabled to interpret abstract values based on cultural context.

2.3.3. Conflict Resolution. Asexplained previously, two forensic experts are likely to use different cephalometric measures to derive their conclusions. Hence, the inference engine will have to allow for such differences in reasoning processes until one of them is decisively proven to be better. For more information on conflict resolution, see Lenat and Guha, 1990, and Newell, 1973. 3. F R A M E W O R K FOR T H E I N T E G R A T I O N Figure 6 gives a framework for integrating ES technology with MM technology. The framework is divided into six major modules: I. Metaknowledge base." The metaknowledge base will store knowledge that is c o m m o n across all applications. Since any organization will have a number of concurrent expert systems in use, the metaknowledge base will contain the (application independent) metarules for all the expert systems. One should consider this as a library of knowledge, some of which will be appended to an application's knowledge base in order to arrive at a complete knowledge base for that application. 2. Application knowledge base." This part of the knowledge base will be application specific. Even though our human face example is based on image, graphics, and text data types, the knowledge using cephalometric landmarks is more relevant for the forensic expert system than for the criminal iden-

2.3.2. Knowledge Acquisition. Knowledge acquisition in MM/ES systems will probably use fuzzy values such as "small," "big," "interesting," and "boring" as descriptors, With these abstract values, experts may find it even more difficult to articulate precise rules of thumb used in their reasoning. A neural network's approach to building inference engines will prove to be a viable method because the parameters or features used for decision making by an expert can be modeled as a neural network's input nodes and the conclusions reached by the expert can be represented as its output nodes. When the neural network is trained to satisfaction, it can be used to replicate expertise built around

Not a reliable measure for crimeinvestigation

/,

Ear is not an easy to describe feature

Features in this section are of interest to crime investigation ,,

fFeatur~ of Special Interest ] -Eyes and Eyebrows | - Nasal Points [ - Outline of the Face

Whole Face

Clothing can be changed and hence is not a reliable feature

L,.

FIGURE 5. Feature prioritization: Crime investigation example.

:~¢~c~

1

ES Integration with Multimedia Technologies

435

J NJZZY NEURAL I Ii ~i UNIFICATION I ENGINE

~

EXCEPTION

I~LER FEATURES/

I

INPL~

/ -

313MOUSE 31) JOYSTICK

APPLICATION K N Q W - L ~

BASE

IR l/O DEVICES PEN

f

3D SCANNERS

WINDOW MANAGER U~ATION

ETC MEDIA

DATA TRANSFOILMATION

OUTPUT

USER

INTEGRATION

3D VISORS

or l ~ I A [NT~RATION

DECOMPOSITION

3 D ~ GENERATOR

J

PLASMA DISFLAYS

J VIDEO PLAYER

J

SURROUND SOUND SPEAKERS,

LIBRARIESOF

ETC

FIGURE 6. An architecture for an expert system tool for multimedia applic~on.

tification expert system. Hence, a forensic expert system will contain cephalometric landmarks and the corresponding inference rules, while a criminal investigation expert system will not. These can be considered as views in expert systems analogous to views in data base systems. 3. Inference engine: This is the three-layer inference engine that will be context, media, user type, and uncertainty sensitive. The generic inference engine will have several reasoning strategies and should be considered as a library. Some of these will be compiled with an application's knowledge base and the relevant part of the metaknowledge base to form the expert system for that application. 4. Visualization engine." This module will have subcomponents for human computer interface proeessing. These subcomponents will be for both input and output, and there will be one for each media type. These are not I/O devices but utilities such as device drivers, user interface (UI) builders, compression/decompression engines, graphic editors, etc. Only a limited set of utilities is shown in the framework due to space constraints. When individual display elements are visualized, they are passed on to the media integrator for making a composite presentation. On the other hand, when decomposed input is received from a media decom-

poser, it is compressed, reformatted (if necessary), and stored in the application's knowledge base. 5. Media integration~decomposition: This block will also address both input and output. As an example in the output direction, video will have to be decomposed into picture and sound for being directed to the screen and the speakers. On the input side, the converse will be true (i.e., the audio output from the microphones will have to be mixed with a clip from a video camera; not all video cameras come with microphones). In some instances, such integration or decomposition will not be automatic and will require editing by application developers. 6. I / 0 interface: These are the actual I/O devices. The lists provided in the boxes are not exhaustive due to space constraints. They will include gloves, headmounted helmets for virtual reality, and other devices for other types of human computer interfaces. 7. The following sections will describe each of these blocks in some detail. 3.1. M e t a k n o w l e d g e

base

The metaknowledge base will have at least the following components. All the application-independent rules described here will be derived from different but related sources of expertise. Many multimedia systems may

436

be the result of several experts applying different sets of rules to an application, and this module explicitly captures all this expertise. • Unification rules: This module will contain metarules corresponding to different types of media, uncertainty, and context. For example, for color as a media, there can be a rule to say that the color blue may be considered a near enough match to the color indigo, thus allowing for a unification of other related facts and variables. It will also contain rules on when to use what type of inferencing in the presence of uncertainty. For example, it can contain a rule that when a user's response is "I don't know the significance of the relationship between eyebrows and eyelids," then Dempster-Shafer inferencing ought to be suggested to the user. There can be a rule that when the caption of a picture mentions big eyes of a Japanese child, the corresponding context should be set as "Asian features" so that the inferencing can be directed properly. This section will also have different strategies for conflict resolution among multiple experts. • Data transformation rules: There will be different kinds of data transformation rules. For example, there can be generic rules for transforming digital data into analog data, or rules for transforming from one unit of measures into another unit of measures or two-dimensional data into three-dimensional data. These transformations will carry with them information on side effects from the transformations. For example, when 3D data is transformed into 2D data, there is some loss of information, such as depth of field. The potential side effects will be documented so that users can relate better to the results of the transformations. Image and data compression algorithms are some additional examples of data transformation rules. Low-level (application independent) feature extraction algorithms used for analysis and rules for their use will also be stored here. Computing color or gray scale distributions of an image, segmentation, and shape recognition are some examples of feature extraction algorithms. An example of this class of rules is relating a segmentation algorithm to classes of images for which it is best suited. • Visualization rules: These can also be called cognitive rules for user interface presentation. This block will contain user models, task models, discourse models, and rules that relate these models to different contexts. User models will be used to decide the type of vocabulary that can be used for a discourse. For example, if a user is an expert, the discourse can be very terse as opposed to a more verbose discourse model for a novice. The task model will assist in the proper choice of a user interface design. If the task is more interactive, then the user interface will need to provide proper checkpoints for user interaction.

A.D. Narasimhalu

Such knowledge underlying multimedia presentation has been addressed by Aren and Hovy, 1992. The rules and structure required for automatic layout involving different combinations of media explored by a number of researchers (Aren & Hovy, 1993; Feiner & McKeown, 1990; Maybury, 1991; Singh & Green, 199 l) are examples of visualization rules. Some rules can be context sensitive, such as "The presentation to the Asian audience should be more visual than verbal." Notice that this rule has nothing to do with the application itself. It is a general rule across applications. Media integration rules: Multimedia technology has aural and visual components. Media integration rules are rules that will define spatial and temporal synchronization of different types of media. The spatial synchronization will provide rules of thumb for expertise such as multiwindow layout design (Singh & Green, 1991), overlaying different media both within a single and across multiple windows, and synchronization of results in some windows in response to an action in another window. The temporal synchronization will have rules of thumb for designing video clips (synchronizing the picture with audio), animation sequences, etc. Repository o f media objects: This repository will consist of various collections of sample multimedia objects that can be combined easily for presentation purposes. The objects could be icons, pieces of music, graphic patterns, and video clips. An application developer can explicitly invoke them during application development.

3.2. Application Knowledge base The application knowledge base will contain the following information specific to an application: • Patterns: These are decision patterns exhibited by an expert and are represented in a neural network as its input and output nodes. The input nodes will represent the different salient features required to make a decision. In the case of an expert system for human face identification, the input nodes can be features such as types of eyes, noses, chins, some spatial relationships among them, and overall shape of a face. The output nodes will represent different classes of human faces. When such a system is used for criminal detection, there may be more input nodes describing (in text form) the type of crime, sex, the nature of weapons used, etc.; the output nodes will represent different classes of criminals. Patterns need to be separated from rules since they can be fed into a neural network directly. A trained neural network can be perceived as a compiled form of knowledge patterns. One can always question the need for storing the patterns since a trained neural network would already

ES Integration with Multimedia Technologies

have compiled this information. Recall that patterns that emerge after the training of a neural network will be stored as exceptions. When these exceptions become large enough, there will be a need to retrain the neural network. The initial set of patterns will be combined with new patterns to be used as the new initial set of patterns for a retraining exercise. It is for the purposes of retraining that patterns need to be stored in this module. However, since retraining is not done every day, there will be a rule in the rule base that will ensure that these patterns are automatically stored in a warm or cold storage and not in a hot storage. By hot we mean online, whereas cold and warm refer to secondary and tertiary archival storages. * Rules: There will be several types of rules stored in the application's knowledge base. The first set is for representing the semantics of the application. This may be to represent the semantics of the features extracted using the data extraction rules in the metaknowledge base. In the case of human face recognition, the following rule might apply: the rule for linking a type o f eye to a race (such as narrow eyes for Mongoloids), or the type of chin to age (double chin for those over 50 years old). A second set of rules can represent culturalization. An example of such a rule is, "If the context is 'Japanese,' interpret big eyes to mean average o f the norm for Americans." In our example, it can also be used to define the norms for an average face for each of the cultures considered by an expert system. A third set of rules will capture feature prioritization. An example of this rule for a forensic expert system is "Cephalometric landmarks should have a higher priority than the anthropometric landmarks." Another set of rules can represent personalization. Such a rule could say, " I f the examiner is 'Philip,' present only subset A of the cephalometric landmarks to the inference engine." A fifth set o f rules will be for synthesis. A rule of this kind can say which of the cephalometric measures will identify the nature of a crime. If a crime does not alter the structure of the skull, then the synthesis will require a rule based on the anthropometric landmarks and the visual landmarks. A sixth set o f rules will be for storage-related issues. These will contain knowledge of the resolution (permitted compression levels) and granularity (the lowest level of detail, such as eye vs. pupils of an eye) at which an application's data must be stored in the system. They will also contain rules for clustering features to facilitate disc caching and fetch-ahead policies. • Similarity measures: A similarity measure for a feature defines the range within which a test sample of a feature can be considered similar to a reference sample of the same feature (Tversky, 1977). It is the basis on which CBR is developed. In CBR, the reasoning for a query proceeds by first determining

437

which of the previously defined classes is closest to the case under consideration. Since the cases are preordered according to some classification, an inference engine can converge to a conclusion faster than when there was no order. In fact, similarity measures can be adaptive in the sense that a measure can be fine tuned based on the feedback from users on whether the conclusions reached were useful. In our example, ifa user was not happy with set of faces presented against a query, and changed the similarity range, this change could automatically be recorded in the algorithms for calculating similarity measures. This dynamic reordering of a similarity measure can keep a data space optimally sorted for most queries. These measures will be used by a retrieval engine to supply the required data to an inference engine in an order based on the similarity measures defined for a feature set. Exceptions: Exceptions will store the patterns not already used for training the neural network, exceptions to the rule in the rule base and, in the case of an adaptive unification engine, the rules for nonnumeric reasoning. In general, exceptions will house the nonmonotonic application knowledge that will be used by the exception handler component of the inference engine. Features~attributes/facts: These are actual facts or data corresponding to an application. In the case of the human face example, this block will contain all the faces, with their features extracted and identified along with the personal details of the individuals to whom the faces belong. It will also contain the personalization information. In other words, view definitions on the expert system will be stored here.

3.3. Inference Engine The inference engine was discussed in great detail in Section 2.2. The framework does not explicitly reflect the fact that the inference engine will have the capability to resolve potential conflicts arising from multiple differing inputs from experts. The metaknowledge base described in Section 3.1. will contain different methods for conflict resolution, and the application knowledge base described in Section 3.2. will specify which of the methods to use for a given situation. While an inference engine in most expert systems would carry out unification using data (or facts) that are not ordered, the input to the inference engine in this framework is expected to come from a retrieval engine that would have already filtered out the unrelated data based on factors such as context, similarity measures, etc.

3.4. Visualization Engine The visualization engine will take in the conclusions of the expert system and apply different types o f vi-

438

A.D. Narasimhalu

sualization rules (being sensitive to user, task, discourse models, and context) and constraints for preparing a suitable presentation to users. In the case of input from a user, it will be responsible for directing the user's input to the application knowledge base and the inference engine. When information is for storage in the application knowledge base, it will apply various compression, storage, and other relevant rules as specified in the application knowledge base. The visualization engine will also contain the drivers and other utilities required for the I/O devices.

3.5. Media Integrator/Decomposer This module has two roles. First, it will take the different media corresponding to the conclusions supplied by the expert system in the form presented to it by the visualization engine and apply spatial and temporal synchronization. It will use the expertise stored in the media integration rules of the metaknowledge base in performing these synchronizations. Its second function is to take an incoming c o m p o u n d multimedia object and decompose it according to rules in the data transformation block of the metaknowledge base. If natural language is one of the I/O supported in an expert system, then the parsing, morphological analyses, and so on will be carried out in the media integrator/decomposer. The actual parser and related tools will be available in the metaknowledge base. Similarly, feature extraction from images will be carded out in this section of the architecture.

3.6. I/O Interface The I/O interface consists of a number of input and output devices that can be used by the user for interaction with the system. Not all of them have been listed in Figure 6 due to space constraints. Natural language input will also be one of the I/O modes available to users.

4. S U M M A R Y The integration of multimedia and expert system technologies augurs well for handling new varieties of expertise. However, the integration process is still in its infancy and requires considerable study for the building of robust operational systems. This article attempts to provide a framework for such an integration; it is hoped that the research community will refine this framework. As identified in this article, considerable research needs to be done on the adaptive inference engine and the related unification processes. The integration o f fuzzy logic and neural network technology is another research issue that is worthy of pursuing because it allows for

imprecise or incomplete knowledge acquisition without the need for an expert to provide a structure to it. The issues of usability and visualization remain the key elements in ensuring that a well-designed system is also well received and accepted by users. It is hoped that this article will generate further ideas and discussions in the relevant research communities and focus their efforts toward the new challenges envisioned in the integration of multimedia and expert systems technologies.

Acknowledgements--Thethoughtful comments of two unknown reviewers, which provided the motivation to streamline and reinforce an earlier version of this article, are acknowledged.

REFERENCES Andrr, E., & Rist, T. (1990). Towards a plan-based synthesis of illustrated documents. Proceedings of the 9th European Conference on Artificial Intelligence, pp. 25-30. Aren, Y., & Hovy, E. (1992). On the knowledge underlying multimedia presentations. In M.T. Maybury (Ed.), Intelligent multimedia interfaces. AAAI Press. Aren, Y., & Hovy, E. (1993). Structure and rules in automated multimedia presentation planning. Submitted to 1JCAI-93. Aren, Y., Hovy, E., & Mulken, S. (1993). Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 12531259. Bain, W. (1986). Case-based reasoning: A computer model of subjective assessment. PhD thesis, Departmentof Computer Science, Yale University, New Haven, CT. Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classificatton and regression trees. Belmont, CA.: Wadsworth International. Buchanan, B.G., & Shortliffe, E.H. (Eds.). (1984). Rule-based expert systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Reading, MA: Addison-Wesley.

Cohen, P.R. (1984). The pragmatics of referring and the morality of communication. Computational Linguistics, 10(2), 97-146. Cohen, P.R. (1985). Heuristic reasoning about uncertainty: An artificial mtelhgence approach. Pitman Advanced Publishing Program. Dix, J., Jantke, K.P., & Schmin, P.H. (Eds.). (1991). Nonmonotonic and inductive knowledge. Springer Verlag. Feiner, S.K., & McKeown, K.R. (1990). Coordinating text and graphicsexplanation generation. Proceedings of AAAI-90, Boston, MA, pp. 442-449. Ginsberg, M.L. (1986). Counterfactuals.Artificial Intelligence, 30( 1), 35-79. Goodman, I.R., & Nguyen, H.T. (1985). Uncertainty models for knowledgebased systems. North Holland. Graham, I., & Jones, P.L. (1988). Expert systems knowledge, uncertainty and decision, Chapman and Hall. Jurgens, H.W., Aure, H.A., & Pieper, U. (1990). International data on anthropometry. International Labor O[fice, Geneva. Kim, W., Chou, H.T., & Banerjee, J. (1988). Operationsand implementations of complex objects. IEEE Transactions on Software Engmeermg, 14(7), 985-996. Kosko, B. (1992). Neural networks and fuz:y systems. Englewood Cliffs, NJ: Prentice Hall. Lau, C.G.Y., & Widrow, B. (1990). Neural networks, I: Theory and modeling. Proceedings of the IEEE, 78(10), 1411-1413.

E S Integration with Multimedia Technologies Lenat, D.B., & Guha, R.V. (1990). Building large knowledgebased systems. Reading, MA: Addison-Wesley. Lui, H.C., Tan, A.H., Lira, J.H., & Teh, H.H. (1990). Practicalapplication of a connectionist expert systemmthe inside story. Institute of Systems Science, technical report no. TR90-52-0. Maybury, M.T. (1991). Planning multimedia explanations using communicative acts. Proceedings of the National Conference on Artificial Intelligence, AAAI-91, Anaheim, CA, pp. 61-66. Narasimhalu, A.D., & Christodoulakis, S. ( 1991). Multimedia information systems: The unfolding of a reality. IEEE Computer, 24(10), 6-8. Newell, A. (1973). Production systems: Models ofcontrol structures. In W.G. Chase (Ed.), Visual Information Processing, New York: Academic Press. Rosch, E.H. (1973). Natural categories. Cognitive Psychology, 4, 328350. Roth, S.F., & Mattis, J. (1991). Automating the presentation of in-

439 formation. Proceedings of lEEE Conference on A1 Applications, Miami Beach, FL, pp. 90-97. Salton, G. (1991). Developments in automatic information retrieval. Science, 253, 974-980. Sharer, G. (1976). A mathematical theory of evidence. Princeton, N J: Princeton University Press, 1976. Singh, G., & Green, M. (1991). Automating the lexical and syntactic design of GUIs: The UofA* UIMS. ACM Transactions on Graphics, 10(3), 213-254. Thimbleby, H. (1990). User interface design. ACM Press, Frontier Series. Tversky, A. (1977). Features of similarity. PsychologicalReview, 84(4), 327-354. Zadeh, L.A. (1983). The role of fuzzy logic in the management of uncertainty in expert systems. Fuzzy Sets and Systems, 11, 199227.