65
Architecture of a Relational Decision Support System * Carlo DELL'AQUILA, Ezio LEFONS and Filippo TANGORRA lstituto di Scienze dell'lnformazione, Universitg~ di Bari, 1.70126 Bari, Italy Luigi COLAZZO lstituto di Informatica, Uni~ersit& di Trento, 1-38100 Trento, Italy
The architecture of the POLD relational system for distributed data analysis is presented. POLD was designed as the kernel of a decision support information system, The peculiarity of POLD consists in allowing the user to view and analyze the data information as couple ~data, semantics), the semantics being a user's suitable classification of the data. This approach furnishes the user with a powerful tool for the semantic integration of data coming from different and/or distributed sources. Both semantics and user views are dynamically definable, that is, they may change with user, current time, and application. The POLD system supports an extended version of the relational operations in order to fit the proposed approach. Moreover, this system provides features for statistical analyses, reports and graphical representations.
1. I n t r o d u c t i o n Statistical and scientific information systems [ 2 2 , 3 3 - 3 5 ] a n d d e c i s i o n s u p p o r t s y s t e m s [5] o f t e n require that large quantities of data be managed a n d m a n i p u l a t e d . S o m e r e s e a r c h fields (e.g., e a r t h resource, metereology, astronomy, nuclear physics) i n v o l v e h u g e v o l u m e s o f r a w d a t a a n d t i m e series w h i c h m u s t b e a u t o m a t i c a l l y p r o d u c e d a n d collected. T h e m o d e l l i n g a n d o r g a n i z i n g o£ w h o l e sets of d a t a in a d a t a b a s e is a s t o r a g e a n d t i m e - c o n s u m i n g effort, o f t e n useless for the a c t u a l c u r r e n t n e e d s of t h e user, b e c a u s e , n o r m a l l y , the u s e r ' s a n a l y s i s / r e s e a r c h is c o n c e r n e d w i t h o n l y a s m a l l part of the data. T h e d e c i s i o n a l a n a l y s e s in n o n - s c i e n t i f i c environments require that the user mainly accesses
Keywords: Data Semantics, Decision Support Systems, Relational Data Bases, Database Management.
Ezio Lefons received his degree in Mathematics from the University of Pisa, in 1973. At the present time. he is Associate Professor of 'Information Organization Techniques" at the University of Bail, and Coordinator of the doctoral course of research on 'Science of Human Relations'. His current scientific interests include models of databases and analytic databases. many-valued logics, and human relational interaction system modelling.
Filippo Tengorra received his degree in
Carlo dell'Aquila is Associate Professor of Computer Science at the University of Bari, Italy. He received his degree in Physics in 1969 from the University of Bail, where he taught Physics in the Engineering and in the Science Departments. His research interests are in the area of database management, decision support systems, and computer based learning
systems. * In memoriam of Professor Alberto Silvestri, the first proponent of the POLD system. North-Holland Decision Support Systems 5 (1989) 65-78 0167-9236/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)
Computer Science from the University of Bail. Italy, in 1975. He is currently Associate Professor of Computer Science at the same University. His research interests are in the area of database management systems, scientific and statistical database management, and decision support systems.
Luigi Colazzo received his degree in Computer Science from the University of Bail, Italy, in 1977. He is Research Assistant at the University of Trento, in the Institute of Information Science. His scientific interests regard conceptual models of analytic databases, and research activity support systems.
C dell'Aquila et at / Architecture of a relational DSS
66
existing (structured) data, such as administrative data, in order to obtain basic statistical data. Moreover, the correlation and the integration of data coming from different sources often become necessary. Practically speaking, this may result in a physical, logical a n d / o r semantic bottleneck [24,27,36]. Administrative as well as scientific data also cover very large sets: frequently this is the case when the data potentially useful to the current analysis are contained in collections gathered without a well-established or finalized purpose. Naturally, the organization of all primary available data in a database is of a significant cost, and conceptual modelling, even when realistic, may be sometimes an unprofitable or a misleading 'window on the real world' for the user. People may interpret data differently, even in the presence of a conceptual data model. Moreover, a data model can produce a distortion of reality because it forces the user to view reality as presented in that model [20]. The principle of semantic relativism of information, that is, allowing greater user flexibility in the interpretation of data and data model as compared to the real world (cf., e.g., [6]), must be extended for each scientific and decisional user. The data model has
[ level 0 (raw data)
k F L
level 1 (intermediate)
level 2 (operational final data)
Fig. 1. Definition levels of a real database.
to adapt or derive from the user, not vice-versa [25]. Also, in the presence of a growing quantity of data, regardless of whether it comes from external databases or from off-line raw collections, the totality of the (distributed) data may not be reasonably viewed and conceptually modelled to form a representation of a captured slice of the real world. The same user may differently resolve semantic ambiguities existing in the data according to problems, data, bases of reference, space and time [25]. On the other hand, a conceptual data model generally has to be stable and suitable for more than one user. The fact remains that it is difficult to define reality and to define the rules for human interpretations [13]. Consequently, scientific/decisional users must be furnished with a 'flexible' tool for defining and manipulating personal interpretations of real world phenomena, provided that the interpretations are beneficial for the solution of the user's problem. This paper describes our approach.
2. The 'Virtual Database' Approach The conventional relational database systems (e.g., [2,8,19,30,37,38]) do not allow the user to
C, dell'Aquila et al. / Architecture of a relational D S S
assign an application-dependent semantics to the data. However, it must be said that these systems are not generally designed to answer the complex queries that decisional applications may involve. In our approach, the whole distributed set of available raw data is called the 'virtual database' as opposed to a 'real database'. Whereas the real database concept may be similar to the external database one in the A N S I / S P A R C architecture, the virtual database concept has no correspondent in standard architecture and simply refers to the totality of data, possibly un-modelized, from which each user can extract a suitable subset by using selection/aggregation processes. The subset, properly finalized to the application by the user (semantics assigned), can be updated and queried as in an effective database [fig. 1]. The real database, on the other hand, consists of one or more updatable 'sessions', each one corresponding to a particular semantic view of the data vs the application by the user.
67
2,1. The User's Session More sessions can coexist within the same real database. For example, they can express alternative frames for interpreting the phenomena under investigation or they may contain the temporal versions of the phenomena. Finally, they may serve to separately analyze phenomena, which may be related subsequently. The P O L D system prototype that we utilized [12] allows the user to construct as many sessions as he deems useful from the virtual database. In order to furnish all the facilities for the query process, a user session consists of a set of objects called compounds. From the conceptual point of view, a compound corresponds to an interpreted phenomenon. From the logical point of view, the compound is a normalized relation [9] or a multirelation (i.e., duplication may occur). Physically, it consists of the following four files: (a) The .Data. file. It is a sequential file and contains the primary data selected from the virtual database. In what follows, we assume
ibm
.°
[<
o~
;; = " - -
;; - - -
,.,-,,-
6
"=
~
VIA
A
s_6¢~ ":2~q'(q'~" E =" =~v, a o , , ^ , o
.~
A
"=
~*
E
II
~"
II
C=
~,~
~ .~t:
:t~ qt:
: I t ~l:
~t:
=~
:It: ~l~
:It:It:
:It:
=t:
I
I:
T.oa.. I 3 2 3 321 4 3
11
1
~AT.Stnng. 0 O O 0 1 0 0 O 0 0
1 1940 1 248 5 3 2 0 2 .. 1 1940 2 165 3 2 4 1 1 .. 004521100.. 1 1967 2 86 3 3 1 1 1 ..
s o 2196 3,50 4 3 1 01 6,5,9,0 23,0 6, , 7 , s 19,0 13 5 6 4 3 12 01950 1 2 3 , 0 0 0 3 021 543, ,0 ,02, 0,100 31100
0 0 0 0 0 1 1 0 0 0
0 0 1 0 0 0 0 1 0 1
1 1 0 1 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 1 0 1
1 1 0 0 1 1 1 0 1 0
i (#2 and #8 and #30 and not (#36)) or (#2 and #8 and #31 and not (#36)) Fig. 2. E x a m p l e of u r b a n i s t i c census c o m p o u n d .
1 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0
.01 01 O1 O0 11 O1 O0 1001 O0 11 O1 O1 0001 Ol 100010 O1 100010 O0 011001 O10001 O0 11 O1
O0 O0 O0 O0 O0
O0 O0
01000 00010 10001 10000 10000 10000 O0100 00000 10001 10001
010 100 000 100 1 O0 Ol 0 010 001 000 000
cp Luxury-fiat query (matrix form) *
1
* * *
* 1 * **
**
1
* * 1"
*
. .
**
*
**
* ***
* * *
*
**
*
*
'"[ 1
**
* * 1"*
*
* * 0
* * 0
68
C dell'Aquila et al. / Architecture of a relational DSS
that the .Data. file be a relation or, possibly, a multirelation. (b) The .Schema. file. It is the usual schema, that is, it contains the description of the .Data. file: attribute names, primary keys, types and formats of the data. (c) The .Test. file. This file contains the definition of suitable semantics that the user assigns to the primary data stored in the .Data. file. The semantics is specified by means of conditions of classification. The aim, as will be discussed later, is to furnish information for the creation of the .String. file. (d) The .String. file. This is a file of bitstrings and is dense with respect to the .Data. file (that is, each bitstring refers to the corresponding tupie in the .Data. file). A bitstring is the result of the application of the classification (defined in the .Test. file) on the correspondent primary tuple. The .String. file represents all the information in the .Data. file that the user currently deems relevant to future queries.
real application for the urbanistic planning of a town based on the census of buildings. For the sake of simplicity, we consider a session composed of only the FLAT compound shown in fig. 2. A tuple in the FLAT.Data. refers to a fiat in a building. Only the attribute names are shown in the related FLAT.Schema.. The FLAT.Test. contains the definition of the user's set of conditions of primary data classification that is considered relevant to a certain investigation. The FLAT.String. contains the relative boolean values of the classification conditions defined in the FLAT.Test. and applied to the FLAT.Data.. For instance, the second bit in the ith tuple of the FLAT.String. represents the truth value of the second condition (test # 2 ) on the ith tuple of the FLAT.Data. file, or, in other words, whether or not the ith fiat is located in the central district.
2.2. Example of the Compound Structure
Before discussing in detail the user's definition and manipulation of the real database, in this section we give an architecture overview of the POLD system. The architecture is shown in fig. 3.
3. POLD Architecture Overview
In order to clarify the structure of the compound, we will consider an example taken from a FOLTuser
I
Applications
local communication [ system
I Applications
Filing system
Local filing system
POLT system
I
physical file processor
POLD Relational System Run Module
[
I 1 network
I • Netwok filing system Physical file processor Fig. 3. Architecture of the POLD distributed system.
communicadon system
C. dell'Aquila et al. / Architecture of a relational DSS
Data Editor --
Network Status
_Session Status
Session
--
69
FCatalog ~ IMrtctory [.List
._~-~ LList
Definition __~Re~D~name Session LOear 'Status
_
POLO -
.0r
Query Editor-
--]] Transfer
LRename ~ p o u a d --[[-List New LDelete
_ SessiOnstat,~_~£D~
Procedure
m
String
Creation
Run
Test ~ nt lvtanagem~t [ Mode-modify ~tore
-
Status
~nion I Intersection Relational _] Difference Operations-~ Selection
lete -delete
I Join..
User LPr°jectt°n Routines Fig. 4. Commandmenu of the POLD system.
The lowest layer consists of a network filing system utilized mainly for the transfer of sequential files belonging to any logical-physical level [fig. 3]. The second layer consists of the actual POLD subsystem and is subdivided in three modules:
files). That is, it is used to transform data of level 1 in data of level 2. (3) The Run module is devoted to the data manipulation and, in particular, to the (extended) relational manipulation of compounds.
(1) The Data Editor provides the user with the transfer of data from level 0 (virtual database) to level 1 (selected data relevant to the application). (2) The Query Editor allows the user to define sessions, and, in particular, to assign the semantics to the data (i.e., to define .Test.
The complete POLD command menu furnished to the user is shown in fig. 4. 3.1. T h e D a t a E d i t o r
The Data Editor implemented performs the following functions:
C. dell'Aquila et al. / Architecture of a relational D S S
70
Maintenance of appropriate catalogues of available data. In fact, the raw data are usually stored on off-line memories, and files sometimes span hundreds of tapes. - Transfer of catalogued files to direct access devices according to the selection/aggregation criteria defined by the user. Creation of new files a n d / o r updating of data. Definition of the schemata of the selected data. - Data recovery. -
-
-
A more detailed description of the Data Editor, or its equivalent module, cannot profitably be given because it strongly depends on the host operating system in use. 3.2. The Query Editor
The Query Editor is thoroughly described in [16]. It permits the user to define sessions and procedures for performing the subsequent relational queries. At this stage, the user must only be aware of the semantics of the data stored in the .Data. file. The Query Editor performs the following functions: -
-
-
Maintenance of the session catalogue. The catalogue contains the session and compound definitions grouped by user. Creation and deletion of sessions. Opening of the current session.
In particular, the Query Editor is subdivided in two submodules [fig. 4]: (1) The Session manager works on the session catalogue, and manages the definition of all user sessions and compounds. It contains a Filer option that permits the user to construct a session starting from his other sessions, or from sessions owned by other users (provided that the proper authorization exists). (2) The Procedure manager allows the user to define the classification conditions and the catalogued procedures (the .Test.file) by: (a) Defining new classifications and procedures. This facility is performed by the Test management block. (b) Utilizing existing definitions and procedures to easily produce more complex conditions of classification. This facility is performed by the Work Area block.
3.3. The Run Module
The Run module supervises the manipulation of the data contained in the current session. Generally, data manipulation operations at the internal level work on the compound .String. files. The materialization of the .String. file for a given compound can be directly requested by the user by utilizing the command menu. However, if the proper .String. files are not currently present, then the POLD system will automatically materialize them on the last version of the .Data. and .Test. files, before executing the user's relational manipulation involving those compounds. We chose to also furnish the user with an explicit (manual) command because the materialization of a .String. file may be time-consuming. Therefore, it might be useful for the user himself to decide the proper time to (re-)materialize the file. The Run module performs the following functions: - Creation of the .String. file of a given compound. of relational operations on compounds. Execution of user's application procedures (e.g., statistics, graphics, reporting), and of final data processing. -
E
x
e
c
u
t
i
o
n
-
These operation are handled by the following submodules respectively: (a) String Creation, which requires that the .Data. and .Test. files exist in the current compound. (b) Relational Operations, which works on completely defined and materialized compounds. (c) User Routines, which provides a standard interface for the user's applicative programs. 3.4. The POLD Distributed Environment
The main purpose of the POLD project was to design a tool for data analysis in a distributed environment. Therefore, the system has a multinode architecture, each node having the structure illustrated in fig. 3. The catalogued files may be stored in any nodes of the system network, and each user has access to the whole set of primary files. By using the Data Editor, the local or remote user can copy network files into his own .Data. files. He can also avoid
C. dell'Aquila et al. / Architecture of a relational DSS
transferring and duplicating extremely large files by directly using the network files. The user may define sessions with compound files stored on one or more nodes. However, the session preparation will be facilitated if (copies of) the .Schema. and .Test. files are allocated on the local node. Every user's operation is stored sequentially in a command file, called Run Status. For each operation, the Run Status also contains the session and node identifiers, the operation type, the involved compounds, and a status flag. The aim of the Run Status command file is to process asynchronously each single relational or user-defined operation. In addition, it permits the recovery of the last defined operations in the case of system failure, time out, or any other interrupt cause. The prototype system includes a microprocessor version, called POLT system [17] (see fig. 3). The microcomputer plays the role of an intelligent workstation linked to a master node by a local communication protocol. The intelligent terminal POLT allows the user to make final simple processing of the retrieved data without engaging the master resources. The POLT systems support the relational operations (join is not implemented in the present version) and simple statistical and graphical functions.
4. T h e U s e r C l a s s i f i c a t i o n
of Data
The user's data view is performed by creating a set of elementary conditions of classification, that is, by creating the .Test. file. Due to the unforeseeable evolution of the data analysis in scientific/decisional applications, the user classification (and, consequently, the .Test. file) must be easily modifiable at any time. In the POLD system, a test can be a simple condition or simple test, a multiple condition or multiple test or a catalogued procedure. The syntax for the test definition is described in [16]. The syntax of the simple test requires the specification of: The test identifier; An attribute name; - A comparison (___,=, < , > , > . . . . ) or a set membership ( ~ , ~ ) operator;
-
-
71
- An operand. This can be an attribute name, a constant, an aggregate function defined on a set of attributes, a value set, or a value range; - An optional weight. The multiple test corresponds semantically to a list of m simple tests. A multiple test permits the user to define multiple conditions in a simple and compact way, avoiding cumbersome sequences of simple tests. The possible syntactic forms for the multiple test are: (a) Explicit list. The multiple conditions are specified by a list of simple test identifiers. (b) Value list. The attribute (specified by its name) is compared to each item of an ordered value list (on the basis of a specified operator). (c) Hystogram list. This form represents a compact way of classification and may have several syntactic specifications. The value range of a specified attribute is subdivided into a given number of classes, and the truth value of the test is whether or not the attribute value belongs to the class, for each class. The conditions expressed by a multiple test can be mutually exclusive or not. For example, the age of a man can be partitioned into mutually exclusive classes of age. On the contrary, the classification per jobs of an employee cannot be a partition of the job set, if an employee can be currently assigned to more than one job. There are two important reasons for defining multiple tests: (1) The use of mutually exclusive multiple tests can drastically reduce the response time to queries since they require a minimal amount of storage (the response time depends strictly on the bitstring length). (2) The possibility of defining many tests at the same time reduces the user's work preliminary to the query process. The catalogued procedure is defined as boolean formula of existing simple tests. Therefore, the catalogued procedure can refer to one or many attributes. The definition of the elementary conditions of classification by the user himself guarantees a higher degree of semantic consistence of data coming from different sources and obtained with different criteria. In fact, the user can define equivalences or similarities between sets of values by adopting appropriate classifications.
72
c dell'Aquda et al. / Architecture of a relational DSS
4.1. E x a m p &
processing very large data sets in a distributed environment:
With reference to the surface attribute of the FLAT.Data. in fig. 2, all the instanced flats have different surface values. The four tests defined on this attribute (tests *:6, *:7, #8, and *:9) specify only four classes of surface values significant for user investigation. For example, on the basis of test *:6, the flats # 3 (which has the original primary ,;,alue of 45 m2), # 4 (86 m2), # 8 (23 m2), and #10 (100 m2) have been considered to produce equivalent surface information. (This is reflected in the Flat.String.; the 6th bit has value 1 in the 3rd, 4th, 8th, and 10th bitstrings, and 0 otherwise). As an example of data integration, we can consider the problem of measure unit conversion. If the user wishes to relate data values expressed in unit a (e.g., centimetres) from source a, with data values referring to an homogeneous entity set expressed in unit b (e.g., inches) from source/3, then the respective .Data.a and .Data.ft are not directly comparable. The integration that reflects the desired classification of the user is obtainable by defining the proper classification in .Test.a (that is, with reference to 'centimetres') and the physically equivalent one (lcm = 0.3937 inches) in .Test.ft. It is to be observed that in this way, no real data conversion is performed (thus saving time and storage), and that the .String. files contain fully comparable and union-compatible data (information expressed by boolean values is independent of the measure unit).
(1) The structure of the indexes is often strongly dependent on the system configuration at the physical level. Therefore, it is very difficult to maintain or to create the index structures for data sets resulting from relational operations when data are transferred from one node to another. (2) Some relational operations derive from unpredictable query strategies, particularly frequent in decisional applications. In these cases, time-consuming sequential processing procedures may occur however. (3) The virtual database approach, as described in section 2, does not require the maintenance of all the source data on direct access devices; therefore, casual/indexed access to files stored on sequential devices (e.g., tape devices) cannot be used.
5. The String File The use of the .String. file in addition to the .Data. file must be carefully discussed, because this is a very important part of the system architecture. Efficiently answering queries in very large databases normally requires the use of several kinds of indexes, for example: B-trees [14,21,28]; lists, multilists, inverted/binary lists, and pointer chains [3,4,15,21,23]; multiattribute combined indexes [26]; precomputed query/procedure answers or view indexes [29,31,32]; and generalized access path structures [18]. However, there are three serious disadvantages in using indexes when
To reduce query complexity, as an alternative to partitioning the data set by indexes, we propose to minimize the query response time by: (a) Reducing the cardinality of data sets by appropriate selections at the session definition stage; (b) Reducing the degree of the relations by appropriate projections, at the previous stage; (c) Pre-defining all the elementary conditions of classification foreseen by the user at a given time for each relation, and applying and storing them in the bitstring form. In many real cases, it is possible to obtain a high compression factor of the original data by using these procedures. Even if operations (a) and (b) can be carried out in many relational systems by using indexes, the time required to create sets of indexes for a given user can be lengthy. On the contrary, the materialization of the bitstring corresponding to a given set of conditions for each tuple is a simple, often linear operation. Only one bit is required in the bitstring of the .String. file for each simple test. On the contrary, the multiple test requires round(log2(m + 1)) or m bits, according to whether the equivalent m simple conditions are mutually exclusive or not. A catalogued procedure can be 'permanent' or 'stored' in the mode. A permanent catalogued
C dell'Aquila et al. / Architecture of a relational DSS
procedure is recorded in the executable form (e.g., see the matrix form in fig. 2). In this case, no additional storage is required in the .String. file. On the other hand, a stored catalogued procedure requires one bit to be allocated in the bitstring. Consequently, queries involving stored catalogued procedures are executed very fast. The time when the .String. file for given .Data. and .Test. files should be materialized is very important in order to guarantee the semantic integrity of the session at the query time. In fact, with time, the user may update both the .Data. or .Test. files. Consequently, it is important that the (re-)materialization of the .String. file be subsequent to the last modification of the .Data. or .Test. files.
6. The
Relational
Operations
The basic data manipulation which is furnished to POLD users consists of the union, difference, intersection, selection, join, and projection operations. These are extensions of the corresponding relational operations [9,10]. As in relational algebra [11], the data manipulation is performed under the direct control of the user in order to avoid overloads in the network and to control semantic inconsistencies. To properly understand the extended relational operations, it is opportune to view the POLD relation as mainly constituted by the two following parts: (a) The informative part, consisting of a set of tuples of certain property values (the .Data. file); (b) The descriptive part, consisting of a set of boolean values of simple tests, multiple tests, or (stored) catalogued procedures (the .String. file). There exists a 1-1 correspondence between the .Data. tuples and the .String. tuples in the compound. Although the information contained in the .String. file might sometimes be considered redundant with respect to that in the .Data. file, the .String. file has the following advantages: - The .String. file represents the information in more compact way than the .Data. file because only the properties relevant to the user are coded (in binary format);
73
The ,String. file is the classification made by the user. Therefore, the user can maintain better control of semantic ambiguities; - The .String. file is internally processed as a sequence of bits. (The representation as a binary sequence is machine independent).
-
The POLD system performs the relational operations on the basis of the descriptive part of a relation (i.e., on the basis of the .String. files).
6.1. The Semantics of the Relational Operations Let X = (X.Data., X.Schema., X.Text., X. String.) be a compound. In what follows, we denote the generic tuple of X.Data. by Dx, and the bitstring of X.String. corresponding to D x by S~.
6.1.1. Union, Difference, and Intersection Let A and B be two compounds such that A.Schema = B.Schema. A.Test. = B.Test. The compound resulting from the operations of union, difference, and intersection is the compound C whose .Schema. and .Test. files are so defined: C.Schema.= A.Schema. C.Test. = A,Test. In order to specify the resulting C.String. and C.Data. files we introduce the following sets: DA= { D a I ~ D ~ : S ~ = Sb)
SA = (Sal~Sb:Sa= Sb) UNION. In the compound C = A U B it results that: C.String. = SA U B.String. C.Data. = D A u B.Data. DIFFERENCE. In the compound C - A - B it results that: C.String. = SA C.Data. = D A. INTERSECTION. The compound C = A N B is defined by: C = A - ( A - B).
6.1.2. Selection and Join Let tx be a subset of tests in X.Test., /3(tx) be a catalogued procedure on tx, and s t~ be the sub-string of S~ corresponding to tx
C dell'Aquila et al. /Architecture of a relational D S S
74
(we assume that fl(tx) - fl(Stx)). SELECTION. The selection of the compound A with respect to the condition fl(tA) is the compound C defined by: C.Schema. = A.Schema. C.Test. = A.Test. C.String. = {Sa [ fl(StA ) ~- .true.) C.Data. = (Da JSa ~ C.String.} JOIN. The join of the two compounds A and B with respect to the join condition fl(t A, tB) is the compound C, such that: C.Schema. = A.Schema. • B.Schema. C.Test. = A.Test. U B.Test. C.String. = { S a • S b [ f l ( S t A , StB ) ~---.true.} C.Data. = {D a • D b ISa. S b ~ C.String.), where ' . ' represents the usual concatenation of tuples. In particular, the equijoin condition fie defined by flE(tA, ta) = .true. ** StA = StB corresponds to the extension of the usual equijoin.
6.1.3. Projection Let 1X tx
be a list of attributes in X.Schema., be the set of tests in X.Test. referring only to the attributes in 1~, st~ be the sub-string of S~ corresponding to t,, d~ be the usual projection of Dx on 1~. The projection of the compound A on the attribute list 1A is the compound C defined as follows: C.Schema. = A.Schema. reduced to 1~, C.Test. = A.Test. reduced to t A C.String. = {StA } C.Data. = (dtA}
6.2. Example of Relational Operations The first example considered refers to a selection operation on the FLAT compound in fig. 2. Suppose we are interested in selecting the 'luxury flats'. We may define the concept of 'luxury flat' in terms of a boolean form of the existing simple and multiple tests in the FLAT.Test., that is, by
means of a catalogued procedure cp. For example: cp Luxury-flat: (district = centre) and
(surface > 200) and (bath_ room no. > 2) and (heating system ~ none); which is equivalent, in terms of tests, to: ~:2 A ~ 8 A ( ~ 3 0 v ~ 3 1 ) A - , # 3 6 . The Luxury-flat condition may be defined as permanent or stored. If the permanent mode is chosen, then the query condition is recorded in form of query matrix. Otherwise, i.e., if the stored mode is chosen, the Luxury-flat condition is simply added to FLAT.Test. (e.g., as test ~37), and the FLAT.String. has to be updated. The query matrix is obtained from the minimal canonical disjunctive form of the query condit i o n - each row of the matrix corresponding uniquely to an OR minterm. The values in the row are defined as follows: the ith value is '1' if the ith test is requested 'true' in the OR term, '0' if it is requested 'false', or ' *' if it is requested 'don't care' (that is, if the ith test is not involved by the OR term). In other words, the ' *' value appears for all classification tests that are not relevant to the user's luxury flat concept. The query matrix form of the luxury-flat catalogued procedure is shown in fig. 2. The selection procedure uses a fast algorithm, based on a three-valued (0, 1, *) logic [1,7]. The algorithm applies the query matrix, suitably coded in binary, to each tuple in FLAT.String., and produces the truth value without having to access the FLAT.Data. tuple. (In the example in fig. 2, only the 7th bitstring, and therefore the flat :~7, satisfies the selection condition). The .Data. file of the resulting compound will contain exactly the tuples of FLAT.Data. that correspond to the tuples selected in FLAT.String.. The selection operation can also be involved when applying statistical functions to compounds. For example, the query 'count flats where Luxury-Flat' requires an internal selection operation on the Flat compound. In these cases, the user may be interested in knowing only the count result and not in the actual compound resulting
C. dell'Aquila et al. / Architecture of a relational DSS
[-,
[-..
•
.
.
.
.
.
3
.
.
.
.
.
4
.
.
.
.
.
6
.
.
.
.
.
7
.
.
.
.
.
9
.
.
.
.
.
, . 7 , . ~ , . ~ . .
FLAT.SIzing.
I 1
.
1000 0100 0001 0010 0100 01 O0 1000 0001 00 1 0 0001
10 . . . . .
Fig. 5. T h e c o m p o u n d
I
o f flats f o r sale.
from the selection. Therefore the POLD user is allowed to opportunely specify whether or not the compound has to be actualized.
<-
d
d g,~
.
•
,
CLIENT.Data. a . . 160 . . b . . 145 . . c.. 70.. d.. 40 . . e.. 65.. f.. 85.. Fig. 6. T h e c o m p o u n d
~
~m
_
[
.
~'~6"
~'..
CLIENT.String. . . .. . . .
. . . . .
1 1 00 0 00 0
0 0 0 1 0 0 1 0 0 0 1 1 0 1 10
o f p o s s i b l e b u y e r s o f flats•
75
Now, the join operation will be considered• Suppose that the proprietors of the instanced flats have instructed a real estate agency to sell their flats• By a suitable preliminary manipulation of (a copy of) the Flat compound, the agency classifies the flats into the following orthogonal (i.e., mutually exclusive) categories: de luxe flats, first-class flats, second-class flats, and economic ones, as shown in fig. 5. The actual criteria adopted by the agency are not relevant to the comprehension of the example, and therefore are not listed• The flats 1 and 7 have been classified as the luxe ones, the flats 2, 5, and 6 as first class, flats 4 and 9 as second class and the others as economic flats• Moreover, the agency has at its disposal data referring to potential buyers; the data include the code of the client and his available funds. In order to propose a selected list of adequate flats to each client, the agency states the following classification of the available funds (shown in the Client compound in fig. 6): - Very High Availability (V.H.A.) has been defined as available funds >_ 140 (therefore, clients 'a' and 'b' have V.H.A.); - High Availability (H.A.) is defined as available funds in [80,150] (therefore, client 'b' is classified both V.H.A. and H.A.); - Mean Availability (M.A.) is defined as available funds in [50,90] (therefore, clients 'c', 'e', and ' f have M.A.), and, finally, - E c o n o m i c Availability (E.A.) is defined as available funds in [30,50] (client 'd' has E.A.). Defining the join of compounds Client and Flat in order to produce the association between each client and the possible flats he could buy, the agency has to establish the proper semantic correspondence between type of flat and type of client availability. If the association based on the strict semantic correspondence given in table 1 is adopted, then the corresponding join condition (boolean form of the tests given in figs. 5 and 6) is: (fl A Cl) V (f2 A C2) V
(f3 A C3) V (f4 A C4).
With reference to only the client and flat codes, the resulting sale proposals are those reported in table 2. On the other hand, if an enlarged semantics of association of flats to clients is adopted, or if the former fails to give the desired results, then table 1
76
C dell'Aquila et al. / Architecture of a relational DSS
Table 1 Strict semantic correspondence between type of flat and fund availability. Flat category
Fund availability
de iuxe 1st class 2nd class economy
V.H.A. H.A. M.A. E.A.
Table 2 Sale proposal of fiats underlying the semantics in table 1. Client code
List of proposable flats
a
(1, 7)
b c d e f
(1,2,5.6,7) (4, 9) (3, 8, 10) (4, 9) (2, 4, 5, 6, 9)
It is interesting to notice that the semantics in table 1, or the corresponding join condition, expresses a nearly equi-join. The usual equi-join would be produced if the agency adopted a classification of the fund availability into orthogonal categories.
6.3. Performance Evaluation We discuss the performance evaluation with regard to the selection operation on a very large data set, since selection is the most frequently used basic operation. Let us suppose that the .Data. file in the current compound has N tuples, each one stored on n bytes. We also suppose that the user has defined Ks simple conditions, K m multiple tests, each one consisting of m i orthogonal conditions (i -- 1, 2 . . . . . Kin), and K p catalogued stored procedures. Therefore, each tuple in the .String. file requires
Table 3 Extended semantic version of table 1. Flat category
Fund availability
de luxe 1st class 2nd class economy
V.H.A., H.A. V.H.A., H.A., M.A. H.A., M.A., E.A. M.A., E.A.
must be properly extended. Table 3 shows a possible extension; it consists of relating each fund availability class to its immediate predecessor and successor.
On the basis of table 3, the join condition (fl A (c, V c2) ) V (f2 A (c, V c 2 V c3) ) V (f3 A (C2 V Ca V C,)) V (f4 A (Ca V C4) ) can be derived. The resulting sale proposals are listed in table 4. Table 4 Sale proposal of flats underlying the semantics in table 3. Client code
List of proposable flats
a b c d e f
(1, 2, 5, 6, 7) (1,2,4,5,6,7,9) (2,3,4,5,6,8,9,10) (3, 4, 8, 9, 10) (2, 3, 4, 5, 6, 8, 9, 10) (1,2,3,4,5,6,7,8,9,10)
Km
S b = Ks + E i round(log2(mi + 1)) + Kp 1
bits, or, equivalently, S a = r o u n d ( S b / N b) bytes, where 1 byte = N b bits. The compression factor obtained by the .String. file with respect to the .Data. file is n / S B. If we denote by C a the track capacity in bytes of the disk, then each track can contain N s = trunc(Ca/SB) bitstrings and the .String. file will require round ( N / N s ) tracks. Since the .String. file is sequentially processed, we must consider the minimum seek time. Moreover, no seek operation is required to read all tracks in the same cylinder. Therefore, the mean seek time is negligible. If R denotes the disk revolution time, an entire track is transmitted in the mean time 1.5 × R. Therefore, each bitstring is transmitted at the rate t s = 1.5 ×
R/Ns. Example. We suppose that K s = 100, K m = 10, and m i -- 200, for each i = 1, 2 . . . . . Kin, Kp = 20, C a = 2500 bytes, and R = 25 msec. Then, assuming N b = 8, it results that S b = 200 bits, that is, S B = 25 bytes. Thus, each bitstring is
C dell'Aquila et aL /Architecture of a relational DSS
transmitted at the mean time of t s = 0.375 msec and the transmission rate ( 1 / t s ) of 103+ 104 strings per second can be reached. The selection procedure consists in evaluating a boolean form of the test values in the bitstring. The problem is to evaluate the boolean form in a time comparable to the transmission time t s. The algorithm used is based on a three-valued logic selection method [1,7] largely tested in Physics experiments of high statistics. The boolean form evaluation is reduced to a boolean matrix operation on the bitstring, where the boolean matrix is obtained directly from the boolean form. The time t B required for testing one bitstring has the upper limit [1]: tB < round(Sb/W) × (tAND + tCOMP) X NOR, where is the capacity in bits of the machine-word, tAND, tco~t P are the time to execute the A N D and the COMPARE logical instruction between words respectively, and NOR is the number of the OR minterms in the minimal distributive form of the selection condition. With reference to the previous example, if we assume w = 32, tAN D = tCOMP ----10 ~tsec, and NOR = 3, then we have t B < round(200/32) × (10 + 10)~sec × 3 = 0.42 msec; that is, the processing rate is of the order of magnitude of 1 0 3 . 104 bitstrings per second, comparable to the transmission rate. With adequate buffering techniques, the transmission and selection times can be partially overlapped, thus gaining a total processing time of the same order of magnitude as above. Since the computing time derives strictly from the sequential processing, then the ordering and any other physical allocation parameters of the bitstrings are not relevant. Consequently, in a multiprocessor environment, the parallel processing of a suitable partition of the bitstring set presents no problems, and improves the performance noticeably.
W
77
ning, metereological and environmental data management). The experiments carried out have shown that the P O L D structure can satisfy most of the requirements of a scientific user in a distributed environment. On the other hand, we wish to point out that the system was not designed to fulfill all the functional requirements of a distributed database system. For example, the concurrent updating, and, consequently, the lock and unlock techniques have not been analyzed. Furthermore, the optimization of the distributed transactions is left to the user. In the system design we favoured efficacy criteria rather than efficiency criteria on the basis of the following considerations: (1) In decision support information systems, particularly if distributed, it is very important to provide the user with feasible strategies to answer complex queries, while also furnishing reasonable response time to common queries. The sequential scanning method provided by the system permits a response time which is acceptable and competitive in comparison to other methods. However, we deem that the response time to particular queries in scientific/decision support environments is not the most important need of the user (obviously, within certain limits). (2) More sophisticated data architectures may require time and money for software maintenance, and additional time for recovery and restart procedures. Moreover, they often are unable to furnish an acceptable reliability. On the contrary, P O L D utilizes serial organization of files (inverted lists, multilists, tree indexes, and so on, are not used). This lowers the performance only in some cases (for example, when queries qualify only a few tuples, not frequent however in statistical, scientific and decisional applications). The experimentally obtained selection rate of 10 3 + 10 4 elements per second is acceptable for the majority of research applications.
References 7. Conclusions
The POLD system has been used on a few real cases (nuclear physics experiments, regional plan-
[1] N. Armenise, G. Zito, A. Silvestri, E. Lefons, M.T. Pazienza, and F. Tangorra, POL: an interactive system to analyze large data sets, Computer Physics Communications 16, Nr. 2 (1979) 147-157.
78
C dell'Aquila et al. / Architecture of a relational D S S
[2] M.M. Astrahan, M.W. Blasgen, D.D. Chamberlin, K.P. Eswaran, J.N. Gray, P.P. Griffiths, W.F. King, R.A. Lorie, P.R. McJones, J.W. Mehl, G.R. Putzolu, I.L. Traiger, B.W. Wade. and V. Watson, System R: relational approach to database management, ACM Transactions on Database Systems 1, Nr. 2 (1976) 97-137. [3] A.T. Berztiss, Data Structures: Theory and Practice (Academic Press, New York, 1975). [4] M.W. Blasgen and K.P. Eswaran, Storage and access in relational data bases, IBM System Journal 16, Nr. 4 (1977) 363-377. [5] R.H. Bonczek, C.W. Holsapple, and A.B. Whinston, Foundations of Decision Support Systems (Academic Press, New York, 1981). [6] M.L. Brodie. On the development of data models, in: M.L. Brodie, J. Mylopoulos, and J.W. Schrnidt, Eds., On Conceptual Modelling (Springer Verlag, New York, NY, 1984) 19-47. [7] V. Capasso, A. Circella, and A. Silvestri, Una logica a tre valori per il calcolo del valore di verith di funzioni booleane complesse, CSATA Report, Bari (1974). [8] D.D. Chamberlin, M.M. Astrahan, M.W. Blasgen, J.N. Gray, W.F. King, B.G. Lindsay, R.A. Lorie, J.W. Mehl, T.G. Price, F. Putzolu, P.G. Selinger, M. Schkolnick, D.R. Slutz, I.L. Tralger, B.W. Wade, and R.A. Yost, A history and evaluation of System R, Communications of the ACM 24, Nr. 10 (1981) 632-646. [9] E.F. Codd, A relational model of data for large shared data banks, Communications of the ACM 13, Nr. 6 (1970) 377-387. [10] E.F. Codd, Further normalization of data base relational model, in: R. Rustin, Ed., Data Base Systems (PrenticeHall, Englewood Cliffs, NJ. 1971) 33-64. [11] E.F. Codd, Relational completeness of data base sublanguages, in: R. Rustin, Ed.. Data Base Systems (PrenticeHall, Englewood Cliffs, N J, 1971) 65-78. [12] L. Colazzo, C. dell'Aquila, E. Lefons, A. Silvestri, and F. Tangorra, POLD: un sistema relazionale distribuito di supporto alle decisioni, Rivista di Informatica XVII, Nr. 1 (1987) 17-37. [13] L. Colaz.zo and E. Lefons, Analytic data base modelling, Proceedings of IV International Conference on Statistical and Scientific Database Management (1988) Vol. 2, 101-126. [14] D. Comer, The ubiquitous B-tree. ACM Computing Surveys 11, Nr. 2 (1979) 121-138. [15] C.J. Date, An Introduction to Data Base Systems (Addison & Wesley, Reading, MA, 1981). [16] C. dell'Aquila, E. Lefons, A. Silvestri, and F. Tangorra, POLD/2: specifiche di utilizzo del Query Editor per la creazione delle sessioni e delte procedure di interrogazione, C N R / P F I Report, DATANET 18 (1983). [17] V. Di Ges/a, A. Machl, and A. Alfano, I1 sistema POLD per ranalisi di dati distribuiti: specifiche di utente del terminale intelligente ITERM, C N R / P F I Report, DATANET 16 (1983). [18] T. Haerder, Implementing a generalized access path structure for a relational database system, ACM Transactions on Database Systems 3, Nr. 3 (1978) 285-298. [19] G.D. Held, M.R. Stonebraker, and E. Wong, INGRES: a
relational data base system, Proceedings NCC 44 (1975) 409-416. [20] W, Kent, Data and Reality (North-Holland. Amsterdam, 1981). [21] D.E. Knuth, The Art of Computer Programming. Vol. 3: Sorting and Searching (Addison-Wesley, Reading, MA, 1973). [22] IEEE, Special section on Statistical/Scientific database management, IEEE Transactions on Software Engineering 11, Nr. 10 (1985) 1038-1091. [23] D. Lefkovitz, File Structures for On-line Systems (Spartan Books, New York, 1969). [24] E. Lefons and A. Silvestri, The use of multidatabase in decision support systems, in: F.A. Schreiber and W. Litwin, Eds., Distributed Data Sharing Systems (North Holland, Amsterdam, 1985) 25-41. [25] E. Lefons, Modello di oggetti funzionali per basi di dati analitiche, Rivista di Informatica XVIII, Nr. 3 (1988) 305-339. [26] V.Y. Lum, Multi-attribute retrieval with combined indexes, Communications of the ACM 13, Nr. 11 (1970) 660-665. [27] L.B. Methlie, Data management for decision support systems, Data Base 12, Nr. 1-2 (1980) 40-46. [28] S. Rao Kosaraju, Insertions and deletions in on-sided height-balanced trees, Communications of the ACM 21, Nr. 3 (1978) 226-227. [29] N. Roussopoulos, View indexing in relational databases, ACM Transactions on Database Systems 7, Nr. 2 (1982) 258-290. [30] J.B. Rothnie Jr., P.A. Bernstein, S. Fox, N. Goodman, M. Hammer, T.A. Landers, C. Reeve, D.W. Shipman, and E. Wong, Introduction to a system for distributed databases (SDD-1), ACM Transactions on Database Systems 5, Nr. 1 (1980) 1-17. [31] L.A. Rowe and M.R. Stonebraker, The POSTGRES Model, Proceedings of the 13th VLDB (1987) 83-96. [32] T.K. Sellis, Intelligent caching and indexing techniques for relational database systems, Information Systems 13, Nr. 2 (1988) 175-185. [33] A. Shoshani, Statistical databases: characteristics, problems and some solutions, Proceedings of the 8th VLDB (1982) 208-222. [34] A. Shoshani, F. Olken, and H.K.T. Wong, Characteristics of scientific databases, Proceedings of the 10th VLDB (1984) 147-160. [35] A. Silvestri, Scientific database modelling, Memorie Societh Astrofisica Italiana 56, Nr. 2-3 (1985) 491-525. [36] R.H.Jr Sprague and E.D. Carlson, Building Effective Decision Support Systems (Prentice-Hall, Englewood Cliffs, N J, 1982). [37] M.R. Stonebraker and E.J. Neuhold, A distributed data base version of INGRES, Proceedings of the 2rid Berkeley workshop on Distributed Data Management and Computer Networks LBL (1977) 19-36. [38] R. Williams et al., R*: an overview of the architecture, P. Scheuermann, Ed., Proceedings of the 2nd International Conference on Databases: Improving Database Usability and Responsiveness (Academic Press, New York, 1982) 1-27.