EUROPEAN JOURNAL OF W;3;‘coHAL
ELSEVIER
European
Journal
of Operational
Research
111 (1998) l-14
Theory and Methodology
Perspectives on operations research in data and knowledge management Amit Basu
’
Owen Graduate School of’Management, Vanderbilt University, Nashville, TN 37203, USA Received
12
June 1995; accepted 27 May 1997
Abstract A number of problems in the design and management of database systems and knowledge base systems (KBSs) can be addressed using techniques from operations research (OR). This article provides a perspective on these problems and the types of models that have been applied to them, and identifies some areas that pose interesting modeling and analysis questions for researchers working in areas such as mathematical programming, stochastic modeling, dynamic pro0 1998 Elsevier Science B.V. All rights reserved. gramming and simulation. Keywords:
Database systems; Knowledge base systems; Models; Performance analysis
1. Introduction Business information processing is very data intensive, since it involves computations based on large data sets. It is not unusual to find companies that process thousands of transactions every day, and sometimes every hour. Furthermore, the number of instances of entities such as products, components, personnel and machines used by a company is often in the thousands or more. Business processes that require manipulation of data about such entities increasingly depend upon efficient use of large databases. This has led to significant interest in both academia and industry in the
’ Fax: +I 615 343 71717; e-mail:
[email protected]. 0377-2217/98/$19.00 0 1998 Elsevier Science B.V. All rights reserved PUSO377-2217(97)003 16-O
effective design and management of database man-
agement systems (DBMSs). Another technology that is gaining importance is knowledge base system (KBS) technology. A KBS combines a (possibly large) database with domain-specific knowledge in the form of IF-THEN rules or other constructs such as frames, objects and semantic nets, to create a knowledge base. This knowledge base is manipulated using heuristic inference methods from artificial intelligence to solve poorly structured problems (such a problem cannot be solved using preset algorithms, and requires instance-specific procedures composed of several component rules and procedures stored in the knowledge base). KBS technology is being used both to enhance traditional DBMSs to make them more “intelligent”, as well as to build expert systems.
2
A. Basu I European Journal of’Operarional Research 11 I (I 998) l-14
There are a variety of issues in the design and management of DBMSs and KBSs for which operations research (OR) methods such as mathematical programming, queuing theory, dynamic programming and simulation analysis are very useful. Even though DBMS technology is only about 30 years old and KBS technology is even younger, there is a sizable literature on the application of OR methods of these issues. The goal of this article is to examine areas where these applications have been most successful, and to identify problems that present useful applications for exploration by operations researchers. The article is organized in four sections. In Section 2, areas in data and knowledge management where OR has had the greatest impact are reviewed. Section 3 describes some other areas where there is a significant need for OR modeling and where such techniques hold significant promise, and finally, Section 4 presents conclusions.
2. OR successes in data and knowledge management Much of the early literature on data and knowledge management is based on approaches such as mathematical logic, algebra, graph theory and algorithmic complexity theory the primary thrust of such research has been on defining and testing the functionality of DBMSs and KBSs, as well as on guidelines for the logical design of such systems. Design considerations in these approaches include semantic structure and relationships between data, and the expressive capabilities of the languages devised for the definition and manipula-
Table 1 OR applications
tion of the formal constructs used to represent data and knowledge. OR techniques, which are most applicable when considerations of cost effectiveness and efficiency are addressed, have not been utilized much in these efforts. At the same time, there has always been interest in computational and storage efficiency in system design, and in particular, there have been a number of performance analyses based on methods such as queuing and probabilistic analysis as well as simulation studies. Furthermore, as data bases and knowledge bases grow larger and more complex, the use of OR models as a basis for designing cost-effective information systems is becoming more attractive. Over the past decade or so, OR methods have been applied to a variety of problems in data and knowledge management. In this section, areas where such models have been used extensively are examined, starting with DBMS design and management, followed by KBS design and management. Tables 1 and 2 provide a categorization or ORbased research in various aspects of data and knowledge management, respectively.
2.1. OR in duta management Consider the problem of designing a database (for ease of exposition, the terminology of relational databases is used [26], although many of the studies are not limited to this data model). A database contains information about a collection of entities, such as people, products, firms, machines and factories, and about the relationships between them. In traditional relational databases,
in data management
Application
Logical design of databases Physical design of databases Design of distributed databases Query processing Concurrency control Object-oriented databases Multimedia databases
OR technique Stochastic
1291 [83.29] [78,81,82]
162,741
1621
[221 [54,331 [58.5]
]221 [5,801
[15,661 ]100,84]
~721 184,391
[38,331 181,821
models
Simulation
IP and LP
Graph
algorithms
DP
F71 ]771 [991 [921
]47,941
A. Basu I European Journal of Operational Research 111 (1998)
Table 2 OR application
in knowledge
management OR technique
Application
IP and LP Design of rule bases Implementing inference Deductive DB
1-14
Stochastic m421
processes
Simulation
Graph
algorithms
WI
the data is organized as a collection of tables called relations. Logical design of the database involves choosing the set of relations to be included in the database and specifying the structure of each, based on the semantic relationships between various data elements. Although OR techniques are not easily applicable here, an interesting approach to combine logical and physical design of a database using an integer programming model is presented in [29]. The objective of this integrated model is to design the database so that a predetermined set of query types can be processed at minimal cost. While the model provides useful insight into the design problem, it also shows that the problem is very difficult to solve, since it is NPhard. A more common approach to database design is to start with logical design, and then move to physical design. The physical design problem then involves deciding how to implement the chosen set of relations using appropriate file structure, based on considerations of storage and data access efficiency. Mechanisms such as indexing, clustering and hashing can be used to accelerate key-based access to randomly selected records in files. This is an area where OR methods have had a significant impact. Examples of OR-based models for physical design include a O-l nonlinear programming model for the design of inverted file indexes in [49], optimization models for trie indexes in [83], and a semi-Markov model for database reorganization in [73]. In the area of performance analysis of file structures, examples of OR-based approaches include a probabilistic analysis of random access files in [74], a queuing-system based analysis of B-tree indexing mechanisms and B-tree access algorithms in [62], a graph-theoretic analysis parallel clustering algorithms in [77], probabi-
DP
PW'I [11,13,46]
[16,30,60] [911
models
[16,961
WI
listic analysis of hashing methods in [68,71], and a queuing network-based analysis of data caching mechanisms in multiple-disk systems in [9]. When the users of a database are physically distributed across multiple locations interconnected by a computer network, an additional design issue is whether these users should interact with a centralized database stored at a single site, or whether some (or all) of the data should be distributed across multiple sites. Choices in the design or a distributed database include the degree of replication (number of sites where the same data is stored), the amount and type if fragmentation (horizontal fragmentation involves splitting a large relation into subsets of rows or records, while vertical fragmentation involves splitting the relation into multiple relations, each of which has a subset of attributes of the original relation), and the specific location for each fragment or copy. There are some significant design trade-offs here. From a data access standpoint, it may be optimal to fully replicate the database, so that the entire database is stored at each site. This avoids the need for any communication costs across sites during query processing. Unfortunately, full replication is the most expensive choice in terms of storage space requirements as well as in database update costs, since every update initiated at a particular site forces communication with all other sites and update at all those sites. Clearly, from this standpoint, it is much better to have a centralized database or a partitioned database (where each fragment is located at a single site and there is no replication). Thus, distributed database design is a complex optimization problem, influenced by considerations such as expected patterns of database usage (query versus update frequencies), costs of data communications such as expected
4
A. Bum / Europeun Journal of Operutional Reseurch 111 (1998) I-14
patterns of database usage (query versus update frequencies), costs of data communications and data storage, topology and capacity of the interconnection network, processing and storage capacities at each site, reliability of system resources at each site and along each link, distribution and characteristics of the user population, and locality of reference of data (the extent to which users at a site access only a specific subset of the database). Not surprising, OR optimization models have made significant contributions in this area. As shown in [95], the models that have been proposed range from linear programming (LP) models to mixed integer, O-l and nonlinear programming models, depending on the type of network (local area network (LAN), metropolitan area network (MAN) or wide area network (WAN)), replication and fragmentation constraints, locality of reference and optimization criteria (communication cost, storage cost, response time, reliability, transaction throughput, etc.). Examples of such models include those in [40,78,79,99]. Also, a O-l integer programming model is used in [82] to address the problem of data allocation in a distributed database that is using majority consensus-based concurrency control (see also [Sl]). A number of other researchers have address related problems, for example, nonlinear O-l programming models for access control in distributed databases are presented in [lo], a closed queuing network-based analysis of the data allocation problem is presented in [98], and a queuing model and simulationbased analysis of replication under alternative concurrency control protocols is described in [22]. In addition to the logical and physical design of a database, there are a number of issues in the design of a DBMS for which OR models have been used. In particular, such models have been successful in the areas of query optimization and concurrency control. When a query is posed to a DBMS, the query is analyzed by a query optimization component in the DBMS and an access plan is formulated, which specifies the data access steps needed to execute the query. The term query optimization is somewhat of a misnomer, since in practice the particular plan chosen for a query is rarely optimal. This is be-
cause the efficiency of the query execution process depends upon the actual instances of relevant data in the database. Since these instances are not known until the database is actually accessed (otherwise the query would not be necessary), query optimization is essentially a heuristic process based on available information about the structure of the stored relations, and the physical design of the underlying file structures. Excellent surveys of query optimization methods are provided in [44,59]. Query optimization occurs at two levels. First, algebraic operators in the query can be analyzed in terms of the logical design of the database to determine the sequencing of the corresponding operations. Approaches to query optimization at this level typically represent the query as a query graph, and apply appropriate transformations on the graph. The use of dynamic programming for query optimization based on analysis of the query graph is explored in [47]. Also, in this issue, optimization algorithms based on page connectivity graphs are presented in [43]. Other examples of similar work include [67,86]. The second level of query optimization is based on the physical design of the database. The major considerations at this level are effective use of indexes and has files, as well as efficient implementation of the join operation, which is typically the most expensive operation in query processing. A variety of OR techniques have been used here. For example, in [56], efficient use of has files is investigated, and in [94], both dynamic programming algorithms and greedy heuristics are analyzed for symmetric multiprocessing systems (many database machines and servers today utilize multiple processors). An interesting finding in this study is that while dynamic programming generates better plans, it becomes intractable for queries involving more than seven or eight relations; on the other hand, the greedy heuristics yield reasonable solutions over a broader range of queries. Another interesting approach is presented in [66], where a simulated annealing algorithm has shown to perform quite well. Other examples of work on query optimization at this level include [2,54,104]. In the context of distributed databases, query optimization becomes more complicated. Not only does the access plan have to identify the appropri-
A. Basu I Europeun Journal of Operational Reseurch 111 (1998)
ate database fragments to access for the query, but it also has to determine which copy of the fragment should be used. In particular, it is not always advisable to bring all the necessary fragments to the query site and then process them as for a centralized database. Rather, it may be referable to perform some of the query site and then process them as for a centralized database. Rather, it may be preferable to perform some of the query operations at the locations of some of the necessary fragments; for complex queries involving sevintermediate operations and eral operations, transmission of intermediate results could occur at several sites. In databases where there are a large number of sites as well as high levels of fragmentation and replication, the number of possible choices can become too large for enumeration, and optimization models become necessary. Such models are typically based on estimates of file and result sizes, and the use of semi-join operations. A semi-join operation allows two relations at different sites to be joined without transmitting either relation completely to the other site; this is particularly useful if only a small subset of rows (records) in each relation qualify for the result of the query [26]. A number of approaches based on mathematical programming models have been reported [38,48]. For example, in [38], the query optimization problem is formulated as an integer programming problem that attempts to minimize the communication costs between sites. This problem is shown to be NP-complete, and reasonably efficient heuristics are presented. A number of other approaches also have been used, including stochastic models in [33] and spanning tree algorithms on hash-partitioned relation fragments in [92]. Another area in DBMS design where OR models have been used is concurrency control. When a database is concurrently accessed by multiple users, there is a danger of the data being corrupted. Consequently, appropriate measures must be taken to ensure that database integrity is maintained across any valid transaction schedule. There are several well-established methods for managing concurrent access to a single database, including time stamping and two-phase locking [26]. A number of probabilistic and queuing analysis of such
l-14
5
protocols have been developed, including [52,101]. There have also been several specialized algorithms developed and analyzed for specific contexts, such as [4,36]. As with query optimization, concurrency control in a distributed database is more complex than in a centralized database. Users at different sites may access different copies of the same fragment. Also, the transaction queues at each site may be different, and may change during the concurrency control process itself (since the latter involves communication delays between sites). Furthermore, even if time stamping methods are used, there may be synchronization problems between the clocks at different sites. A large number of algorithms for this context have been enveloped [14]. The bases for these algorithms range from centralized coordination (where a single supervisory site manages concurrency) which is simple but can result in relatively low transaction throughput and reliability, to distributed control, using mechanisms such as majority consensus, hierarchical locking and quorum-based control [3,17,21,25,57]. Concurrency control is further complicated when node and link failures in a network are considered, particularly when the resulting network is partitioned. A survey of approaches for such contexts presented in [27,58] presents dynamic voting algorithms for partitioned networks, along with stochastic models for analyzing the performance of these algorithms. Other examples of the use of stochastic models, queuing models and simulation for performance analysis of concurrency control mechanisms in distributed databases include [5,18,80]. In a transaction oriented database environment, another important issue in data management is transaction routing and scheduling. To some extent, this problem is similar to that of job or process scheduling in operating systems. The algorithms that have been developed for database transaction scheduling typically focus on factors such as transaction priorities, deadlines, service order and concurrency control protocols. Examples of studies that have investigated transaction scheduling methods and the analysis of these methods using queuing models and simulation include [1,89,106].
6
A. Bum I European Journal oJ‘Operutiond Research I I I (1998)
2.2. OR in knowledge management The field of KBSs has evolved from a number of separate streams of research in computer science. Fig. 1 shows the relationships between three fields; database systems, artificial intelligence, and distributed computing. As the figure indicates, KBS research involves the interaction between database systems and artificial intelligence, while distributed KBSs additionally include consideration of distributed computing. Early research on KBS primarily focused on issues of knowledge representation, which involves the development of formal constructs for representing domain knowledge in a machine interpretable form. These constructs range from sentences in symbolic logic to IF-THEN rules in production systems, frames, scripts, semantic nets and objects [76]. Methods for accessing and manipulating knowledge represented in terms of such constructs were largely simple heuristics such as the Rete algorithm [3.5] and various graph-based search algorithms [76]. Only in recent years has there been a growing awareness of the need for storage and access efficiency for KBS, and more evidence of the use of OR techniques in this context. In rule-based systems, when the number of rules in the knowledge base becomes quite large,
I-14
the efficiency of the systems can be significantly limited by the efficiency of rule access. One important design problem in such systems is the choice of storage and organization mechanisms used for the knowledge base. Methods that have been proposed for this purpose include indexing and clustering methods, as well as methods for precompiling collections of rules into more complex structures. In [20], a Markov chain model is used to analyze the performance of different rule groupings in a real-time expert system (an expert system is a KBS whose knowledge base embodies domain expertise in a specific problem domain, such as the diagnosis of a particular disease or the design of a particular type of machine). Another example of the use of Markov chain models for analyzing rule groupings is [42]. Another approach to designing efficient KBS is through the use of special knowledge representation constructs. In this area, graph theory has been particularly useful. Since rule-based knowledge bases can be characterized using graph theoretical constructs such as dependency graphs, AND/OR graphs and influence diagrams, the properties and formal structure of such constructs can be utilized in efficiently manipulating such knowledge bases. This approach has been used by several researchers [24,34], and graph search algorithms
Leeend: DB: database systems Al: artificial intelligence DC: distributed computing KBS: knowledge base systems DAI: distributed AI DDB: distributed DB DKBS: distributed KBS
Fig. 1. Fields related to data management and knowledge management (legend: DB, database distributed computing; KBS, knowledge base systems; DAI, distributed AI; DDB, distributed
systems; AI, artificial intelligence; DB; DKBS, distributed KBS).
DC,
A. Basu / European Journal cf Operational Research 111 (1998) I-14
have been proposed as a basis for inference mechanisms in KBS [11,76]. OR models have also been used in the design of heuristic inference mechanisms. In particular, a number of researchers have performed comparative studies of the relative efficiency of heuristics for symbolic manipulation of knowledge bases versus mathematical programming methods. This work is motivated by the fact that domain knowledge represented using certain knowledge representation constructs can be reformulated as constraints in integer programs. To see this, consider the following rule stated in terms of three propositions xi , x2, x3: “IF x1 is true AND x2 is false THEN x3 is true .” (1) Inference using a knowledge base consisting of a set of rules such as Eq. (1) typically involves either determining whether a given expression can be inferred to be a logical consequence of the knowledge base, or determining whether the knowledge base is satisfiable (i.e., that there is at least one assignment of Boolean truth values to the propositions in the knowledge base that satisfies all the rules). There are a number of heuristic inference methods to solve this inference problem. However, it turns out that this expression can also be restated as the following inequality involving the O-l variables xi, ~2, x3: (1 -Xi) +x2
+.x3
B
1.
(2)
This can be used to reformulate inference problems as dynamic programming problems as well as integer programming problems. There have been a number of comparative analysis of these methods. For example, in [16], a dynamic programming approach is shown to perform better than a linear program formulation. Similarly, a survey of different approaches to the satisfiability problem is made in [45]. The different methods are classified in terms of whether they are based on discrete or continuous search spaces, and whether they are based on constrained or unconstrained search. The methods range from discrete constrained algorithms such as resolution [85] and the Davis-Putnam algorithm [28] to continuous constrained
I
models such as branch-and-bound [ 161and cutting plane methods [50], and even continuous unconstrained search models such as neural net models
WI. In other studies, a performance comparison of an integer programming (IP) solver and an expert system for a course scheduling problem is reported in [30] (where it was found that the expert system’s performance was less volatile than the IP solver), and LP formulations of constraint satisfaction problems are examined in [24]. These and other studies [51,60] indicate that inference problems in KBS can often be solved using traditional mathematical programming tools. As discussed in [30], the choice of method depends upon factors such as the viability of a single objective function, the need to model complex preferences using simple parametric constraints and the importance of explanations of the solution process (which are a side product of reasoning mechanisms in expert systems, but are difficult to obtain from IP or LP solvers). Also, in [102], alternative modes of normalization of logical expressions (namely, conjunctive normal form (CNF) and disjunctive normal form (DNF) are shown to be interchangeable. This implies that if an algorithm for learning rules from examples works better with one form or the other, efficiency gains can be achieved by transforming the examples in the knowledge base to the preferred form. Two other areas in KBS research where OR methods have made an impact are the support of multiple knowledge bases and the maintenance of consistency in knowledge bases. As KBS are used more widely and have larger knowledge bases, two design enhancements for both ease of maintenance and use of the system are (a) the partitioning of the knowledge base into several smaller components, and (b) the use of inference engines that can combine knowledge from multiple knowledge bases. A number of efforts have been made to achieve such enhancements. For example, in [13], structural properties of a graph representation of multiple knowledge bases are used to construct methods for combining knowledge from these sources to solve problems. Knowledge base partitioning is addressed in [97], where min-cost graph partitioning algorithms are used to partition a knowledge base
8
A. Bum I
EuropeunJournal qf Oprrutiond
in a way that minimizes interdependencies between partitions. Consistency maintenance also is an important issue, even in a single knowledge base. Determination of consistency essentially involves finding a model for the knowledge base (i.e., an assignment of values to the elements or propositions in the knowledge base that satisfies all the rules and constraints). In [96], a dynamic programming approach is used to detect inconsistencies in a KBS. The approach is quite general, and can be applied to a variety of knowledge representation frameworks. In the special case where the knowledge representation framework is propositional (i.e., does not use quantifiers or predicates containing variables), even IP techniques have been shown to solve the consistency checking problem [37,60]. In the context of multiple knowledge bases, an additional consideration is whether they are mutually consistent. The graph-theoretic approach in [13] is shown to also apply to the consistency checking problem in KBS containing multiple knowledge bases (see [ 121for related work using symbolic logic).
3. Promising areas for OR modeling In addition to the various areas discussed in Section 2, there are a number of other areas in data and knowledge management where OR models and methods are valuable. In this section, some of the areas where OR methods hold significant promise for future work are briefly discussed, and where available, relevant research efforts are identified.
3.1. Data management 3. I. I. Physical design
While there has been considerable work on performance modeling of file structures and file access mechanisms, emerging trends in database technology such as multimedia databases and object-oriented databases have led to the development of new file organization methods, such as R+-Trees [90] and hB-Trees [69]. While some efforts have
Rrsrwch
I II IIVYX) I --14
been made to analyze such structures, their complex and hierarchical forms are difficult to simulate or model using queuing and probabilistic models. Nevertheless, given that multimedia databases comprise one of the fastest growing areas of database technology and application, performance analysis studies using OR models would have great value for the implementation of such data bases in industry. 3. I. 2. Query processing Although query optimization is an issue that has been studied extensively, it still remains an important area for research, especially as databases become larger and more complex. A striking indication of this is in 17 out of the 40 research papers presented at the 1994 ACM SIGMOID International Conference on Management of Data, a major database research conference, were on query processing issues. OR models can be used in a variety of query processing issues and contexts. To start with, access plans generated by traditional query optimizers are static, since they are typically generated at compile-time. Thus the efficiency of these plans is very sensitive to changes in database state. Recently, there have been some efforts to develop dynamic query optimizers that can exploit run-time information to select between multiple candidate access plans. Methods such as dynamic programming may be particularly useful here, as evidenced by [23]. A second promising area for application of OR models is in query optimization for newer types of databases, such as object-oriented databases, multimedia databases and temporal databases. Given that such databases do not have the structural simplicity of flat relational databases, physical design parameters such as size and selectivity estimates are far more difficult to determine at compile time. Here again, dynamic query optimization algorithms using methods such as dynamic programming and exploiting features such as parallel processing and pipelining are sorely needed. Finally, full text databases in contexts such as medical systems involve dense indexing of unstructured documents, possibly across a large number of servers. Query processing in such a database requires new algorithms for efficient use of indexes in access plans, and techniques such as in-
A. Basu I European Journal of’Operationu1 Research 111 (1998) l-14
teger programming, Boolean programming and dynamic programming may be effective in these algorithms. 3.1.3. Main-memory databases With rapidly falling prices of main-memory components. main-memory databases are becoming more feasible. The performance characteristics of such databases are quite different from traditional disk-based databases, since the I/O bottlenecks that dominate design considerations in the latter are no longer as significant in main-memory resident databases [105]. Two issues that are particularly significant are (a) the performance analysis of main-memory databases, and (b) the development and analysis of database recovery methods. Examples of research on the use of OR methods in these areas include [41], where multivariate stochastic models are used to analyze failures, and [65], where simulation analysis is used to compare the performance of some popular recovery algorithms in main-memory database systems. 3.1.4. Object-oriented databases Object orientation has gained tremendous popularity in formation systems and computer science in a number of contexts. These include programming languages (e.g., Smalltalk, LOOPS, C++), systems analysis and design, database systems, and KBSs. The appeal of object-oriented databases (OODBs) stems from three factors. First, new applications for databases such as computeraided design (CAD), geographical information systems (GISs) and multimedia applications, all require data types such as graphical objects, audio and video objects, complex objects and lists of values, that are not supported easily in traditional DBMSs. Second, databases increasingly have to interface with application programs developed using object-oriented languages and tools, and having a common conceptual model facilitates this process. Finally, object-orientation includes features such as encapsulation (embedding information about an object’s behavior within its structure), classification (organization of objects into hierarchies or networks of classes), inheri-
9
tance (the ability to infer general properties or attributes of an object from those of the class(es) to which it belongs), and polymorphism (the ability of an object to display multiple, context-specific behaviors), which are powerful tools for information system design and use. Given that an object-oriented DBMS (OODBMS) is expected to provide all the functionality of relational DBMS, the issues in data management that have been discussed so far all apply to an OODBMS as well. The literature on the use of OR methods for the design and analysis of OODBs and OODBMSs is still quite sparse. However, there is clearly growing interest. For example, stochastic models have been used in [15] to analyze indexing techniques in an OODBMS [64], simulation models of object clustering techniques have been used to analyze their performance in [72], and simulated annealing is explored as a basis for query optimization in an OODBMS in
w4. 3.1.5. Databases for special applications As mentioned earlier, databases in domains such as CAD/CAM, geographical information systems and multimedia information systems require richer data models. In addition, DBMSs for such applications have to support special features and operators. For example, processing of even simple spatial queries such as one for finding all objects located in a specified physical region cannot be processed efficiently using traditional query processing operators and algorithms. An excellent discussion of spatial data structures and operations is provided in [107]. An example of OR-based work in this area is a simulation study of different hierarchical data structures in multidimensional databases, in [75]. Similar techniques can also be used for multimedia databases, which include nontraditional data types such as voice, audio, graphics and video. For example, a closed queuing network model is used to analyze a multimedia database in [ 1001, nonlinear stochastic models are used to analyze the effects of varying network delays on data synchronization in multimedia databases in [84], and [39] presents a simulation of analysis of a parallel processing architecture for a distributed multimedia database.
IO
A. Basu I European Journal of Operational Research I I1 (1998) l-14
3.2. Knowledge management The KBS field is till relatively young, and there is need for more work on almost every front. Better understanding of the cost factors underlying the design and use of a KBS should motivate the use of optimization models for knowledge base design (especially in multiple knowledge base systems), as well as better integration of KBS inference methods and mathematical programming methods. In addition to the KBS research issues discussed in the previous section, there are several areas where OR methods have not yet been used extensively, but where they have significant potential value. 3.2.1. Deductive databases A deductive database (also referred to as an intelligent database) complement large sets of stored data with a collection of rules, which represent relationships between stored data. Such rules not only facilitate efficient data management (by avoiding the need for storing data that can be inferred from other stored data), but can also be used as integrity constraints to maintain database integrity. In recent years, a number of DBMS products have been augmented with capabilities for rule processing. Ideally, a deductive database systems should include all the functionality of a traditional DBMS. However, in the presence of rules and integrity constraints, some of the basic processes have to be modified. For example, query optimization has to include inference steps as well as data access steps [19]. Examples of work in this area include a branch-and-bound algorithm for minimizing the cost of a set of queries in such a system in [91], a data-driven optimization algorithm based on transformation rules in [93], and a conservative compile-time approach to query optimization in [63]. A problem of particular interest in deductive databases is the generation of the transitive closure of a database (i.e., the set of facts that can be inferred from the available knowledge base). Here, graph traversal algorithms have proved to be quite effective [6,53]. Both deductive databases and expert systems are types of KBSs. However, one way in which
they differ is in the composition of their knowledge bases. The knowledge base of a deductive database consists of a large database, along with a relatively small set of rules and constraints. On the other hand, the knowledge base of an expert system typically has a large collection of rules, with a relatively small database. This implies that mechanisms for information access in one type of system are not always efficient for the other. For example, problem solving in an expert system is often top down, starting with rule-based inference, and accessing data only when necessary for a rule to fire. Conversely, a DBMS might use rules only when the necessary data are not stored, or when the set of data forming the query result has to be tested against a set of integrity constraints. OR techniques for performance analysis may be extremely useful in understanding the trade-offs between these methods, and determining which method is best suited to any given problem. Furthermore, optimization models may be applicable in query optimizes to decide which method, or combination of methods, should be used for each problem (or query) instance. 3.2.2. Exploiting parallel processing Since inference in any KBS can be represented as heuristic search over some type of graph structure, the availability of multiple processors allows such search to proceed concurrently along multiple fronts in parallel. In knowledge bases that are treestructured (i.e., with little or no recursion or cycles), parallel processing can significantly improve the efficiency of the inference process. A number of researchers have developed and analyzed algorithms for parallel search [32,46,55]. More comprehensive performance analysis of such algorithms using queuing networks and simulation studies would be of great value. 3.2.3. Knowledge discovery The problem of extracting models of reality from large bodies of data is a classical problem in statistics, and there is a rich body of research in this area. For the most part statistical methods are used to validate models built from other theories. With the evolution of large databases such as point-of-sale scanner databases, there is a growing
A. Basu I European Journal of Operational Research 1 I1 (1998)
need to augment traditional statistical analysis with methods that are more constructive and dynamic (i.e., methods that enable new models to be discovered from data, and facilitate evolution of such models over time). In recent years, the field of knowledge discovery (also known as data mining, and data dredging) has attracted much attention [70]. This field draws upon methods from neutral networks, machine learning, database systems and statistics to develop databases that have richer semantics than those available at the initial database design stage. Applications of knowledge discovery methods range from business transaction databases such as scanner databases [7] to biological and medical research [103]. OR methods can play a role both in the development of new algorithms for building models [31,93], as well as for performance analysis of knowledge discovery methods [8] using stochastic models and simulation. The latter area is particularly important if such methods are to be used in real-time databases (e.g., where the models in a scanner database may be modified as transactions are being processed). In such situations, data to be accessed from the data base includes not only that needed for transaction processing, but also the data needed by the knowledge discovery algorithms. This poses interesting query optimization problems for which mathematical programming algorithms can be used (including extensions of existing algorithms for query optimization).
1-14
II
and reliable. Furthermore, today relatively few organizations use KBSs for domains where the knowledge bases are large, or where response time is critical. Again, as with DBMSs, the primary constraint is efficiency and reliability. OR methods for optimization can be used to improve the performance of such systems, and OR methods for performance analysis can help answer critical questions about their efficiency and reliability. This article provides a perspective on the area of data and knowledge management that will hopefully help both researchers in OR as well as researchers in DBMS and KBS. The accompanying cluster of articles in this issue present interesting examples of the use of OR in data and knowledge management. The articles address issues ranging from established ones such as distributed database design, query optimization and rule processing emerging areas such as main-memory databases. Beyond these areas, and other discussed in this article, there are also some nontraditional methods in OR, such as learning algorithms, genetic algorithms, which are potentially valuable tools in data and knowledge processing. As with any technology, as the critical open questions in designing and managing DBMSs and KBSs move from “how can we do this?” to “Can we do this in a cost-effective and value-adding manner?“, the role of OR methods in these areas should grow even larger.
References
4. Conclusions 111R.K.
The design and analysis of efficient DBMSs and KBSs is an important research area where OR methods such as mathematical programming, stochastic modeling and simulation have a significant role to play. To appreciate this, consider that it is only in the last decade that large organizations have started adopting relational DBMSs for critical applications, primarily because of concerns about the efficiency of relational systems. Algorithms that better address the significant costs of data processing, as well as better models for the analysis of the performance of DBMSs, are essential for making the technology more acceptable
Abbott, H. Garcia-Molina, Scheduling real-time transactions: A performance evaluation, ACM Transactions on Database Systems 17 (3) (1992) 513-560. 121M. Abdelguerfi, A.K. Sood, Computational complexity of sorting and joining relations with duplicates, IEEE Transactions on Knowledge and Data Engineering 3 (4) (1991) 496503. in [31 D. Agrawal. S. Sengupta, Modular synchronization distributed, multiversion databases: Version control and concurrency control, IEEE Transactions on Knowledge and Data Engineering 5 (1) (1993) 126137. [41 D. Agrawal. A. El Abbadi, A.E. Lang, The performance of protocols based on locks with ordered sharing, IEEE Transactions on Knowledge and Data Engineering 6 (5) (1994) 805-818. PI R. Agrawal, M.J. Carey, M. Livny, Concurrency control performance modeling: Alternatives and implications.
12
A. Basu I Europeun Journal of Operational ACM Transactions 6099654.
on Database
Systems
12 (4) (1987)
[6] R. Agrawal, S. Dar, H.V. Jagadish, Direct transitive closure algorithms: Design and performance evaluation, ACM Transactions on Database Systems 15 (3) (1990) 427-458. [7] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD international Conference on Management of Data, 1993, pp. 207.-216. [8] R. Agrawal. T. Imielinski, A. Swami, Database mining: A performance perspective, IEEE Transactions on Knowledge and Data Engineering 5 (6) (1993) 914925. [9] R. Alonso. D. Barbara, H. Garcia-Mohna. Data caching issues in an information retrieval system, ACM Transactions on Database Systems 15 (3) (1990) 3599384. [IO] Y.M. Babad, J.A. Hoffer, A mathematical programming model for the location of access controls in a distributed database environment. Operations Research 32 (1) (1984) 2340. [ll] A. Bagchi, A. Mahanti, Three approacties to heuristic search in networks, Journal of the ACM 32 (1) (1985) 1-27. [12] C. Baral, S. Kraus. J. Minker. Combining multiple knowledge bases, IEEE Transactions on Knowledge and Data Engineering 3 (2) (1991) 2088220. [13] A. Basu. A knowledge representation model for multiuser knowledge-based systems, IEEE Transactions on Knowledge and Data Engineering 5 (2) (1993) 1777189. [I41 P. Bernstein, N. Goodman, Concurrency control in distributed database systems, ACM Computing Surveys 13 (2) (1981) 1855222. [15] E. Bertino, W. Kim, Indexing techniques for queries on nested objects, IEEE Transactions on Knowledge and Data Engineering 1 (2) (1989) 196214. [16] C.E. Blair, R.G. Jeroslow, J.K. Lowe, Some results and experiments in programming techniques for propositional logic, Computers and Operations Research 5 (1986) 633-645. [I71 F. Bukhari, S.L. Osborn, Two fully distributed concurrency control algorithms, IEEE Transactions on Knowledge and Data Engineering 5 (5) (1993) 8722881. [18] M.J. Carey. M. Livny, Conflict detection tradeoffs for replicated data, ACM Transactions on Database Systems 16 (4) (1991) 703-746. [19] U.S. Chakravarthy, J. Grant. J. Minker, Foundations of Semantic Query Optimization for Deductive Databases. in: J. Minker (Ed.), Foundations of Deductive Databases and Logic Programming, Morgan Kaufmann. Los Altos. CA, 1988. [20] I.-R. Chen, B. Poole, Performance evaluation of rule grouping on a real-time expert system architecture, IEEE Transactions on Knowledge and Data Engineering 6 (6) (1994) 8833891. [2l] S.Y. Cheung, M.H. Ammar. M. Ahamad. The grid protocol: A high performance scheme for maintaining replicated data. IEEE Transactions on Knowledge and Data Engineering 4 (6) (1992) 5822591.
Research 111 ( 1998) I-14
[22] B. Ciciani, D.M. Dias, P.S. Yu. Analysis of replication in distributed database systems, IEEE Transactions on Knowledge and Data Engineering 2 (2) ( 1990) 2477261. [23] R.L. Cole. G. Grdefe. Optimization of Dynamic Query Evaluation Plans, Proceedings of the ACM SIGMOD International Conference on Management of Data. 1994. pp. 150-I60. [24] A.E. Croker. V. Dhar, A knowledge representation for constraint satisfaction problems, IEEE Transactions on Knowledge and Data Engineering 5 (5) (I 993) 740-752. [25] P. Dasgupta. Z.M. Kedem. The five-color concurrency control protocol: Non-two-phase locking in general databases, ACM Transactions on Database Systems 15 (2) (1990) 281-307. [26] C.J. Date, An Introduction to Database Systems, 6th ed.. Addison-Wesley. Reading, MA, 1994. [27] S.B. Davidson, H. Garcia-Molina. D. Skeen. Consistency in partitioned networks, ACM Computing Surveys 17 (3) (1985) 341-370. [28] M. Davis, H. Putnam, A computing procedure for quantification theory, Journal of the ACM 7 (2) (1960) 201-215. [29] R. Dewdn. B. Gavish. Combined logical and physical design of databases, IEEE Transactions on Communications 38 (7) (1989) 9555967. [30] V. Dhar. N. Ranganathan. Integer programming vs. expert systems: An experimental comparison, Communications of the ACM 33 (3) (1990) 323-336. [31] V. Dhar. A. Tuzhilin, Abstract-driven pattern discovery in databases. IEEE Transactions on Knowledge and Data Engineering 5 (6) (1993) 926938. [32] V.V. Dixit. D.I. Moldovan, Minimal state space search in parallel production systems, IEEE Transactions on Knowledge and Data Engineering 3 (4) (1991) 435443. [33] P.E. Drenick, E.J. Smith, Stochastic query optimization in distributed databases. ACM Transactions on Database Systems 18 (2) (1993) 2622288. [34] A. Dutta. A. Basu. An artificial intelligence approach to model management in decision support systems, IEEE Computer 17 (9) (I 984) 89997. [35] C.L. Forgy. Rete: A fast algorithm for the many pattern/ many object pattern match problem. Artificial Intelligence I9 (1982) 17-37. [36] P.A. Franaszek, J.T. Robinson, A. Thomasian. Concurrency control for high contention environments, ACM Transactions on Database Systems 17 (3) (1992) 304 345. [37] G. Gallo, G. Urbani. Algorithms for testing the satisfiability of propositional formulae, Journal of Logic Programming 7 (1) (1989) 45561. [38] B. Gavish, A. Segev, Set query optimization in distributed database systems, ACM Transactions on Database Systems 1 I (3) (1986) 2655293. [39] S. Ghandeharizadeh, L. Ramos, Continuous retrieval of multimedia data using parallelism, IEEE Transactions on Knowledge and Data Engineering 5 (4) (1993) 658669.
13
A. Basu I European Journal of Operational Research 111 (1998) I-14 [40] D. Ghosh, I. Murthy, A solution procedure for the file allocation problem with file availability and response time, Computers and Operations Research 18 (6) (1991) 5577568. [41] P. Goes, A stochastic model for performance evaluation of main memory resident database systems, INFORMS Journal on Computing 7 (3) (1995) 269-282. [42] M.M. Gooley, B.W. Wah, Efficient reordering of prolog programs, IEEE Transactions on Knowledge and Data
neering, 1988, pp. 144153. [59] M. Jarke, J. Koch, Query optimization in database systems, ACM Computing Surveys 16 (2) (1984) 11 l-152. [60] R.G. Jeroslow, J. Wang, Solving propositional satisfiability problems, Annals of Mathematics and Artificial Intelligence I (1990) 1677187. [61] J.L. Johnson, A neural network approach to the 3satisfiability problem, Journal of Parallel and Distributed
Engineering 1 (4) (1989) 470-482. [43] R.D. Gopal, R. Ramesch, S. Zionts, Access path optimization in relational joins. INFORMS Journal on Computing 7 (3) (1995) 2577268. [44] G. Graefe, Query evaluation techniques for large databases, ACM Computing Surveys 25 (2) (1993) 73-170. [45] J. Gu, Global optimization for satisfiability (SAT) problem, IEEE Transactions on Knowledge and Data Engineering 6 (3) (1994) 361-381. [46] A. Gupta, CL. Forgy, A. Newell, R. Wedig, Parallel Algorithms and Architectures for Rule-Based Systems, Proceedings of the International Symposium on Computing Architecture, 1986, pp. 28-37. [47] P. Helman, A. Rosenthal, A mass production technique to speed multiple query optitimization and physical database design, ORSA Journal on Computing 3 (I) (1993) 33355. [48] A.R. Hevner, S.B. Yao, Query processing in distributed database, IEEE Transactions on Software Engineering 5 (3) (1979) 1777187. [49] J.A. Hoffer, A. Kovacevic, Optimal performance of inverted files, Operations Research 30 (2) (1982) 336354. [50] J.N. Hooker, Generalized resolution and cutting planes. Annals of Operations Research 12 (1988) 217-239. [51] J.N. Hooker, A quantitative approach to logical inference, Decision Support Systems 4 (1) (1988) 45-69. [52] M. Hsu, B. Zhang, Performance evaluation of cautious waiting, ACM Transactions on Database Systems 17 (3) (1992) 477-5 12. [53] Y. Ioannidis, R. Ramakrishnan, L. Winger, Transitive closure algorithms based on graph traversal, ACM Transactions on Database Systems 18 (3) (1993) 512576. [54] Y. loannidis, S. Christodoulakis, Optimal histograms for limiting worst-case error propagation in the size of join results, ACM Transactions on Database Systems 18 (4) ( 1993) 7099748. [55] T. Ishida, Parallel rule firing in production systems, IEEE Transactions on Knowledge and Data Engineering 3 (1) (1991) 11-17. [56] R. Jagannathan. Optimal partial-match hashing design, ORSA Journal on Computing 3 (2) (1991) 86-91. [57] S. Jajodia, D. Mutchler, Dynamic voting algorithms for maintaining the consistency of a replicate database, ACM Transactions on Database Systems 15 (2) (1990) 23&280. [58] S. Jajodia. D. Mutchier, Integrating static and dynamic voting protocols to enhance file availability, Proceedings
Computing 6 (1989) 435449. [62] T. Johnson, D. Shasha. The performance of current Btree algorithms, ACM Transactions on Database Systems 18 (1) (1993) 51-101. [63] M. Kifer, E.L. Lozinskii, On compile-time query optimization in deductive databases by means of static filtering, ACM Transactions on Database Systems 15 (3) (1990) 385426. [64] W. Kim, K.C. Kim, A. Dale, Indexing techniques for object-oriented databases, in: W. Kim, F. Lochovsky (Eds.). Object-Oriented Concepts, Databases and Applications, Addison-Wesley, Reading, MA, 1989. [65] Kumar, A. Berger, Performance measurement of some main memory database recovery algorithms, Proceedings of the Seventh International Conference on Data Engineering, Kobe, 1991, pp. 436443. [66] R. Lanzelotte, P. Valduriez, M. Zait, Optimization of object-oriented recursive queries using cost controlled strategies, Proceedings of the ACM SIGMOD International Conference on Management of Data, 1992, pp. 256265. [67] L. Li, Fast in-place verification of data dependencies, IEEE Transactions on Knowledge and Data Engineering 5 (2) (1993) 266281. [68] D.B. Lomet, A simple bounded disorder file organization with good performance, ACM Transactions on Database Systems 13 (4) (1988) 5255551. [69] D.B. Lomet, B. Salzberg, The hB-tree: A multiattribute indexing method with good guaranteed performance, ACM Transactions on Database Systems 15 (4) (1990) 625-658. [70] C.J. Matheus, P.K. Chart, G. Piatetsky-Shapiro, Systems for knowledge discovery in databases, IEEE Transactions on Knowledge and Data Engineering 5 (6) (1993) 903-913. [71] G. Matsliach, Performance analysis of file organizations that use multibucket data leaves with partial expansions, ACM Transactions on Database Systems 18 (1) (1993) 1577180. [72] W.J. McIver, R. King, Self-adaptive, on-line reclustering of complex object data, Proceedings of the ACM SIGMOD International Conference on Management of Data, 1994. pp. 407418. [73] H. Mendelson, U. Yechiali, Optimal policies for database reorganization, Operations Research 29 (1) (1981) 23-36. [74] H. Mendelson, U. Yechiali, Physical design for a random-access file with random insertions and deletions,
of the Fourth
International
Conference
on Data
Engi-
A. Basu I European Journal
14 Computers 505.
and Operations
Research
of Operational Research 111 (1998) l-14
13 (4) (1986) 489-
[75] Y. Nakamura, S. Abe, Y. Ohsawa, M. Sakauchi, A balanced hierarchial data structure for multidimensional data with highly efficient dynamic characteristics, IEEE Transactions on Knowledge and Data Engineering 54 (1993) 682-694. [76] N.J. Nilsson, Principles of Artificial Intelligence, Tioga (Press), Palo Alto, CA, 1980. [77] E. Omiecinski, P. Scheuermann, A parallel algorithm for record clustering, ACM Transaction on Database Systems 15 (4) (1990) 599-624. [78] H. Pirkul, An integer programming model for the allocation of databases in a distributed computer system, European Journal of Operational Research 26 (3) (1986) 401-411. [79] H. Pirkul, H.-P. Hou, Allocating primary and backup copies of databases in distributed computing systems: A model and solution procedures, Computers and Operations Research 16 (3) (1989) 2355245. [80] E. Rahm, Empirical performance evaluation of concurrency and coherency control protocols for database sharing systems, ACM Transactions on Database Systems 18 (2) (1993) 3333377. [81] S. Ram, S. Narasimhan. Data allocation in a distributed environment: Incorporating concurrency control and queueing costs, Management Science 40 (8) (1994) 969% 983. [82] S. Ram, S. Narasimhan, Incorporating the majority consensus concurrency control mechanism into the data base allocation problem, INFORMS Journal on Computing 7 (3) (1995) 244256. [83] R. Ramesh, A.J.G. Babu, J.P. Kincaid. Variable-depth trie index optimization: Theory and experimental results, AMC Transactions on Database Systems 14 (1) (1989) 41-74. [84] K. Ravindran, V. Bansal, Delay compensation protocols for synchronization of multimedia data streams, IEEE Transactions on Knowledge and Data Engineering 5 (4) (1993) 589-674. logic based on the res[85] J.A. Robinson, A machine-oriented olution principle, Journal of the ACM 12 (I) (1965) 2341. [86] Y. Sagiv, 0. Shumueli, Solving queries by tree projections, ACM Transactions on Database Systems 18 (3) (1993) 487-511. [87] K. Salem, H. Garcia-Molina, J. Shands, Altruistic locking, ACM transactions on database systems 19 (1) (1994) 117-165. [88] H. Salem, The Design and Analysis of Spatial Data Structures, Addison-Wesley, Reading, MA, 1989. [89] O.T. Satyanarayanan, D. Agrawal, Efficient execution of read-only transactions in replicated multi-version databases, IEEE Transactions on Knowledge and Data Engineering 5 (5) (1993) 8599871. [90] T.K. Sellis, N. Roussopoulos, C. Faloutsos, The R+ tree: A dynamic index for multidimensional objects, Proceedings of the 13th VLDB Conference, 1987.
[91] T.K. Sellis, Multiple query optimization, ACM Transactions on Database Systems 13 (1) (1988) 23-52. [92] D. Shasha, T.-L. Wang, Optimizing equijoin queries in distributed databases where relations are hash-partitioned, ACM Transactions (1991) 2799308.
on Database
Systems
16 (2)
[93] S. Shekhar, B. Hamidzadeh, A. Kohli, M. Coyle, Learning transformation rules for semantic query optimization: A data-driven approach, IEEE Transactions on Knowledge and Data Engineering 5 (6) (1993) 950-~ 964. [94] F.J. Shekita, K.-L. Tan, Multijoin Optimization Symmetric Multiprocessors, Proceedings of the VLDB Conference, 1993, pp. 479492.
for 19th
[95] O.R.L. Sheng, H. Lee, Data allocation design in computer networks: LAN versus MAN versus WAN. Annals of Operations Research 36 (I) (1992) 1255150. [96] P.P. Shenoy, Consistency in valuation-based systems. ORSA Journal on Computing 6 (3) (1994) 28 l-291, [97] R. Srikanth. A graph theory-based approach for partitioning knowledge bases, INFORMS Journal on Computing 7 (3) (1995) 286297. [98] M.M. Srinivasan, K. Kant, The file allocation problem A queueing network optimization approach, Computers and Operations Research 14 (5) (1987) 349-361. [99] A.B. Stephens. Y. Yesha. K. Humenik, Optimal allocation for partially replicated database systems on ring networks. IEEE Transactions on Knowledge and Data Engineering 6 (6) (1994) 975-982. [IOO] G. Sudhakar, A. Karmouch, N.D. Georganas, Design and performance evaluation considerations of a multimedia medical database, IEEE Transactions on Knowledge and Data Engineering 5 (5) (1993) 888-894. [loll A. Thomasian, Two-phase locking performance and its thrashing behavior. ACM Transactions on Database Systems 18 (4) (1993) 579-625. [102] E. Triantaphyilou, A.L. Soyster, A relationship between CNF and DNF systems derivable from examples. INFORMS Journal on Computing 7 (3) (1995) 2833285. [103] J.T. Wang et al., Combinatorial pattern discovery for scientific data: Some preliminary results. Proceedings of the ACM SIGMOD Internatinal Conference on Management of Data, 1994, pp. 115-125. [104] K.-Y. Whang, B.T. Vander-Zanden, H.M. Taylor. A linear-time probabilistic counting algorithm for database applications, ACM Transactions on Database Systems 15 (2) (1990) 208-229. [IO51 K.-Y. Whang, R. Krishnamurthy, Query optimization in a memory resident domain calculus database system, ACM Transactions on Database Systems 15 (1) (1990) 67-95. [106] P.S. Yu. A. LelT_ Y.-H. Lee, On robust transaction routing and load sharing, ACM Transactions on Database Systems 16 (3) (1991) 476512. [IO71 H. Samet, The design and analysis of spatial data structures. Addison-Wesley, Reading, MA. 1989.