Framework for query optimization in distributed statistical databases

Framework for query optimization in distributed statistical databases

Framework for query optimization in distributed statistical databases M H Sadreddini, D A Bell and S McClean Recently, there has been a growing inter...

2MB Sizes 11 Downloads 123 Views

Framework for query optimization in distributed statistical databases M H Sadreddini, D A Bell and S McClean

Recently, there has been a growing interest in statistical database (SDB) research. When SDBs are dispersed among computing facilities at various sites (e.g., in health-care networks) an additional dimension is added to the already difficult problems faced by the SDB designer. A distributed statistical database management system ( DS-DBMS) consists o f micro data (i.e., raw data) and macro data (i.e., aggregated objects called summary tables), which can be considered essentially as aggregated views o f the raw data in a special format. The first part o f the paper gives an overview o f a model for the representation o f both raw data (micro data) and summary tables (macro data). The model is an extension of the relational model (so that existing distributed database systems can be exploited). Most o f the first part is devoted to defining operations on macro data sets. Based on these operations, a set o f equivalent relational operations is described, as one of the main objectives in defining the micro and macro data sets, and the operations on them, has been to use as much as possible the capabilities that are already offered by most relational DBMSs. The second part o f the paper deals with one o f the important aspects o f performance in a DS-DBMS, namely, the efficient processing o f queries. This is heavily influenced by the performance o f query optimizers. However, to provide query optimization in a DS-DBMS, special issues are raised that manifest themselves in different scenarios. Some of the important issues and problems raised are discussed and solutions proposed. In addition, a set of transformations on macro operations (similar to those in relational algebra) are introduced, which can be used for optimizing queries in DS-DBMSs. databases, statistical databases, distributed statistical databases, database management systems, query optimization

Recently there has been a growing interest in statistical database (SDB) research Hr. This is partly due to the inadequacy of database management systems (DBMSs) to support statistical applications and partly due to the inadequacy o f statistical packages to support data-management facilities. SDBs can be divided into two groups: micro SDBs and macro SDBs. Micro SDBs handle information about single events or individual entities, whereas macro SDBs contain mainly summary data of micro SDBs - - t h a t i s , the precomputed results of statistical operations such as SUM, C O U N T , and AVERAGE.

Department of Information Systems, Faculty of Informatics, University of Ulster, Newtownabbey,Co. Antrim BT370QB, N. Ireland Vol 34 No 6 June 1992

SDBs are widely used by a variety of users, such as scientists, economists, and politicians, in various application areas, such as health care, census data evaluation, and management decision making. When SDBs are dispersed among computing facilities at various sites (e.g., in health-care networks), however, an additional dimension is added to the already difficult problems faced by the SDB designer. A distributed statistical database management system (DS-DBMS) can be defined as a system that provides a means of modelling, storing, generating, and manipulating SDBs, which are distributed among different computing facilities on different sites. A DS-DBMS consists of micro data (i.e., raw data) and macro data (i.e., aggregated objects called summary tables), which can be considered essentially as aggregated views of the raw data in a special format. The architectural issues for DS-DBMSs and alternative architectural proposals for providing statistical analysis capability in a heterogeneous distributed environment have been discussed in other papers ~7,18. A configuration for a DS-DBMS has been proposed jg, to answer global statistical queries in situations where the statistical global query requires information from a number of local sites. The fact that data are distributed and heterogeneous can lead to special problems when conducting statistical analyses that are not encountered in centralized SDBs. Efficient methods of modelling, representing, and selecting of summary information for materialization, and alternative approaches to answering statistical queries, are interesting questions for researchers in this area and have a strong impact on performance and ease of use. The problems of distributed data have received substantial attention in the area of distributed database management systems (DDBMSs). A number of research projects has been undertaken in recent years to study and prototype heterogeneous DDBMSs, among them Multibase 2°, MRDSM2% and Multistar 22. The problems associated with integration of summary tables have been addressedS,23-30. The first part of the paper gives a summary of a model called the M I M A D model (which is an extension of the relational model) for representing both micro and macro data. (The detail of this model is described in another paper.) This model allows the use of existing DDBMSs' facilities (in particular the authors' own prototypes 3~) with macro data. A set of operations on macro data and

0950-5849/92/060363-15 © 1992 Butterworth-Heinemann Ltd

363

Framework for query optimization in distributed statistical databases

Table 1. Statistical operations and type of summary information required Statistical procedures

Count Average Standard deviation Range Median SIQR Correlation Regression T-test ANOVA Tables

Summary information N

X

S

,/ / •/

/ ,/

¢.

min

max

/

/

×2

t(sum)

t(N,X,S)

VCM

Raw data

,/ ,/ ,/ / / / ¢,

¢, ,/

Nonparametric Multivariate

/

,/ /

N = sample size; X = sample mean (average); S = standard deviation; t(sum) = contingency table; VCM = variance-covariance matrix; t(N,X,S) = tables containing sample size, sample mean, and standard deviation of different classes; SIQR = semi-interquartile range; ANOVA = analysis of variance. the relational equivalent o f these operators is then defined. One o f the main objectives in defining the micro data sets, the macro data sets, and the operations in the M I M A D model has been to use as much as possible the capabilities that are already offered by most relational DBMSs. Thus most of the operations defined in the M I M A D model can be supported by m a n y relational DBMSs. Efficient query processing in relational systems is heavily influenced by the performance of optimizers 32-34. The representation of macro data as well as micro data is basically relational, so it would be nice to use conventional relational optimizers to expedite statistical query processing. It is possible of course to take advantage of developments in conventional query optimization. However, there are enough differences to encourage a search for specialized distributed optimization o f statistical queries, where the use of both micro and macro data sets has to be considered. However, to provide query optimization in a D S - D B M S environment, special additional issues are raised, which are not addressed by conventional query optimizers. The second part of this paper discusses some of the important issues and problems that are raised by query optimizers in a D S - D B M S and proposes some solutions. A set o f transformations on macro operations (similar to those applied to relational algebra) is introduced, which can be used for optimizing queries in DS-DBMSs. The authors do not claim to have solved all problems in this area. They have, however, made some observations, clarified some issues, and contributed in some specific areas. The next section gives a summary of the constructs in the M I M A D model and some definitions. A set of operations on macro data sets is then defined. After that, the relational equivalents of these operations are described. A list of constraints in the M I M A D model is given, and the final section discusses some of the important issues 364

and problems that are raised by query optimization in a D S - D B M S and proposes some heuristics to solve them.

DATA MODELLING REPRESENTATION

AND

In statistical data analysis, the basic element for data analysis, after data have been collected, is a data matrix that contains observations on some variables. Although the data matrix forms the raw material of statistical analysis (e.g., 'population census of a state'), in m a n y situations the data matrix is never published, partly due to its size and partly for reasons of confidentiality. Thus it is often necessary as a first step to condense the detailed information in the data matrix to produce summary statistics before proceeding further. To reduce the size of the data matrix, summaries of various kinds can be produced. Table 1 summarizes a number of data-analysis techniques and the type of summaries that they require. Here a data model called the M i c r o / M A c r o Data ( M I M A D ) model is proposed, which is an extension of the relational model. The M I M A D model is based on two main types of objects at two different levels of abstraction. At the lower level of abstraction, Micro Objects (MIOBs) describe the elementary objects that are the individual entities. At the next level of abstraction, Macro objects (MAOBs) represent the result of some statistical/mathematical operations on the MIOBs. These micro and macro objects have relational representations called micro and macro relations, which represent data matrices and summary statistics of the data summarization process, respectively. A micro relation R is defined as a relation in the relational model, which represents a set of MIOBs that have similar properties. The properties of the MIOBs are represented as attributes of the relation, and the individual entities are represented by tuples in the relation. A micro relation can be denoted as R(ID, A~, A 2 , . . . ,A,), Information and Software Technology

M H SADREDDINI,D A BELLAND S MCCLEAN Table 3. Primitive summary functions and statistical analysis procedures

Table 2. Micro relation instance of survey data

ID

SEX

S-H

AGE

WEIGHT HEIGHT

e~ e_~ e~

male male male male male male female female female female female female

smoker smoker non-smoker non-smoker smoker smoker non-smoker smoker non-smoker non-smoker smoker smoker

17 22 32 41 45 59 20 30 43 48 55 62

73.2 71.0 77.3 79.8 77.6 75.9 60.2 61.9 66.4 64.0 67.9 63.7

e4

e5 e6 e7 e8 e9

e~0 e, e~2

1.73 1.75 i.75 1.74 1.70 1.72 1.62 1.60 1.63 1.57 1.59 1.57

where I D is the identifier attribute, and A~,/12 . . . . , A n are attributes, where each attribute A~ corresponds to a d o m a i n Di ( i = 1. . . . . n). Some o f these attributes can represent categorical variables. This is information o f descriptive nature, such as sex: [male, female], marital status: [single, married, widowed], and smoking-habits: [smoker, non-smoker]. Numerical variables hold numerical information, such as age, weight, and height. F o r example, the micro relation shown in Table 2 represents a set o f entities (MIOBs) I D = {et,e 2. . . . ,e~2} and their values for the attributes (variables) A~ = SEX, A 2 = S-H, A 3 = A G E , A4 = W E I G H T , and A5 = HEIGHT. Given a micro relation R (ID, A~ . . . . . A . . . . . . Ap), which describes a set o f M I O B s with category attributes AI . . . . . A,, and numerical attributes A,,+~,... ,Ap, and corresponding d o m a i n s D~ . . . . . Dp, a m a c r o relation R can be represented as a relation that describes a set o f m a c r o objects with similar properties, denoted as R < C I. . . . . C,,; PI,P2; S I. . . . . S k > , where C~. . . . ,C, are called the category attributes, P~ and P2 are called the N V attributes, and S~ . . . . . Sk are called the s u m m a r y attributes. The domains of the n category attributes C~. . . . . C,, are the classification X~. . . . . X, o f D~ . . . . . D, (i.e., Xi is a partition o f Dr). The d o m a i n s o f the two N V attributes Pt and P2 are the set o f numerical attributes described on R (i.e., A n . 1. . . . . Ap). The d o m a i n s o f the k s u m m a r y attributes S~. . . . . Sk with corresponding statistical functions fa~ . . . . . fak, are taken from the set o f real numbers: A m a c r o relation whose s u m m a r y attributes are additive is called a primitive m a c r o relation. The main advantage associated with using additive s u m m a r y attributes is that it produces a standard structure that is appealing for solving some data integration problems (discussed in another paper). Table 3 shows statistical analysis techniques that can be supported by using primitive m a c r o relations. In the following, for d e m o n s t r a t i o n purposes, the category attributes are represented by prefixing a "*'. In the case o f univariate and bivariate summaries, when no categories are involved "*-' represents the name o f the category attributes and '-" ( N U L L ) represents the values o f the category attribute. Also, in the case o f univariate summaries, "-' values are used for the values o f the P2 Vol 34 No 6 June 1992

Statistical procedures

Primitive summary functions N

Count Average SD Skewness Kurtosis Variance Covariance t(sum) t(groups) VCM Correlation Regression T-test ANOVA

¢ 4 4 ¢ / 4 4 ¢ / ,/ / / ,/ /

×2

/

E

E~

¢ / / / ,/ /

¢ ¢ ¢ / ,/

,/ / / / ,/

/ ,/ / / ,/

/

/

E'

E'

,/ ,/

/

EE

/ / / / /

SD = standard deviation; t(sum) = contingency table; VCM = variance-covariance matrix; t(groups) = tables containing sample size, sample mean, and standard deviation of different classes; ANOVA = analysis of variance; N = cardinality = (E,1), Y = (E,x,), E 2 = (E,x,2), E 3 = ()~,x,3), E 4 (E/Xt'), EE (E,x, × y,). =

=

Table 4. Instance of category macro relation

*Sex

*Socio-gp

Pt

P2 N

Male Male Male Male Female Female Female Female

Non-manual Manual Unemployed Other Non-manual Manual Unemployed Other

weight weight weight weightweightweight weightweight -

102 108 39 16 134 116 38 53

SD

Mean

10.85 11.88 10.57 11.21 8.89 11.97 11.37 10.32

77.2 75.0 71.7 73.3 62.0 65.0 61.2 61.7

attribute, as only one numerical variable is used, i.e., the values o f Pj. Table 4 shows an instance o f a m a c r o relation that represents s u m m a r y information on the weight o f groups o f individuals categorized by their Sex and Social groups. The d o m a i n o f the Sex category is {Male, Female} and the d o m a i n o f the Social g r o u p is {Non-manual, Manual, U n e m p l o y e d , Other}. The c o m b i n a t i o n o f the values o f these two category attributes produces eight categories, and for each category the s u m m a r y attributes N(COUNT), SD(STANDARD-DEVIATION), MEAN ( A V E R A G E ) are shown in Table 4. The c o m b i n a t i o n of the values from the d o m a i n o f the category attributes and the d o m a i n of the N V attributes can be used to identify the tuples o f the m a c r o relation (i.e., a composite key). Each tuple o f a m a c r o relation describes some statistical characteristics o f a category (subset/group) o f individuals. Let RI < C , . . . . . Cni; Pli, P 2 i ; S l i , . . . ,Ski > be an arbitrary m a c r o relation with cardinality (IRil)m and degree (n + 2 + k), where ti represents a tuple in R~, and t~[ ] is the 365

Framework for query optimization in distributed statistical databases projection of ti on some attribute(s) of Ri. Although it is sufficient to represent a macro relation by its name, the notation R~(CC~,PP~,SS~)is used to denote a macro relation with name i~ and macro scheme (CCi,PPi,SSi). CCi = {C~. . . . . C,i} is defined as the category-subscheme, PPi = {P~i,P2i} as the NV-subscheme, and SSi = {S~i. . . . . Ski} as the summary-subscheme. Two macro relations Rj(CC~,PP~,SS1) and R2(CC2,PP2,SS2) are category-compatible C G = CC2 if there is a one-to-one correspondence between the category attributes of the two relations (i.e., the Ci category attributes from C G and CC2 are drawn from the same classification). Two macro relations RI(CC~,PP~,SSO and R2(CCz,PP2,SS2) are summary-compatible SS~ = SS2 if there is a one-to-one correspondence between the summary attributes of the two relations (i.e., the Si summary attributes from SS~ and SS2 represent the same summary function). Two macro relations RI(CCI,PPt,SSI) and R2(CC2,PP2,SS2) are macro-compatible if they have the same degree and they are category- and summary-compatible (i.e., CCj = CC2, and SS~ = SS2). Note that there is always a one-to-one correspondence between the NV attributes of two macro relations (i.e., they consist of Pl and P2). OPERATIONS

IN MIMAD

MODEL

This section introduces a number of operations that are associated with the M I M A D model. This set of operations is rich enough to allow enough things to be expressed to make the system useful, but it is not complete. These operations facilitate access to data and manipulation of the contents. A high-level data language designed for statistical users can be translated into these operations. The operators shall be explained and their use illustrated by examples. The operands of the operations correspond to the constructs that were described in the previous section, namely, the micro and macro relations. The micro relations have purely relational extensions, therefore, the operations of the relational algebra may be applied to them directly. Many extensions to the relational model have been suggested to provide additional functionality, such as arithmetic computation, aggregate functions, sampling operations, and statistical operationsL In addition, other operators should be defined, which allow the creation of the macro relations from the underlying micro relations. As macro relations contain more semantics than traditional relations, however, some of the relational algebraic operators need to be redefined to cater for the structural and semantic differences, namely, the distinction made between the categorical, the NV, and the summary attributes. In the above, primitive macro relations were introduced. The difference between these relations and macro relations is that all summary attributes of primitive macro relations must have additive properties. This property contributes to solving integration 366

Table 5. Macro relation

*M-SMOKING

P,

P2

N S

weightweight-

N

Z

E2

227 132

782294 2762005400 423572 1402509116

problems as well as to defining some operations on macro relations that use aggregate functions. In the following definitions, only additive summary attributes are considered when aggregate functions are required. In addition to algebraic operators, special statistical operations have to be defined on the macro relations. Statistical operations produce the required analysis on the macro relations, for example, producing the correlation coefficient measurement from a macro relation would require a statistical operator. Generally, the set of operations on the primitive macro relations can be divided into two main types: (1) ~ or closed operators, which are mappings from primitive macro relations to other primitive macro relations, i.e.: • for unary operators ~ : Ri~llj • for binary operators ~ : R~.Rj~I~ where R is a primitive macro relation. Examples are given in the Appendix. A special case of the binary operator is the Macro-Reclassification (MREC) operation, which uses a macro relation and a binary relation as its operands. (2) [3 or open operators, which are mappings from primitive macro relations to some tables (which are not primitive) or real numbers: • for unary operators 13: R i ~ R E L 0 q) • for nary operators 13: {Ri. • • R j } ~ R E L U q0 where R, REL, and qo represent a macro relation, a set of real numbers, and a set of tables that includes summary tables which do not have additive properties, respectively. These include the statistical data analysis (SDA) procedures, which can be considered as operations on macro relations. These operations can be applied on macro relations to express various statistical data-analysis computations performed by statisticians (e.g., mean, standard deviation, correlation, and analysis of variance (ANOVA)). Example Suppose the macro relation in Table 5 represents children's weight at birth (in grams) by whether their mothers smoked in pregnancy. The statistical analysis procedure is group comparisons analysis, which produces the statistics shown in Figure l from the macro relation in Table 5. In Table 5 and Figure 1, M-SMOKI N G represents Mother-Smoking, N stands for NonSmokers, and S stands for Smokers; P~ represents the variable weight, which is the weight of child, N represents Information and Software Technology

M H SADREDDINI, D A BELL A N D S M C C L E A N

STD DEV SUM-OFSSSSSSSSSQUARES N 3446.2291 540.5711 66041072.1 S 3208.8788 575.0401 43317910.1 Within Groups Total 3358.9582 553.4688 109358982.2 M-SMOKING

MEAN

Criterion Variable weight SUM-OFsource SQUARES. DF MEAN-SQUARE 4702024.2245 15.34 Between GPs 4702024.2245 1 Within GPs 109358982.2 357 306327.6811 Eta = 0.2030 Eta Square = 0.0412 where GP stands for Groups.

SELECT CI,C2,PI,P2,N,SI FROM R, UNION SELECT CI,C2,PI,P2,N,SI FROM 112 The MINT operation can be simulated by using the SELECT and EXISTS quantifiers in SQL. For example, the result shown in Figure 2(e) can be produced by the following operation: SELECT CI,C2,Pt,P2,N,SI FROM RI WHERE EXISTS (SELECT Ct,C2,PI,P2,N,SI FROM R2 WHERE R2.C~=RI.G and R2.C2= R,.C2 and R2.P~= 11j.P~ and !i2./2 = R, .P2)

Figure 1. Result of group data analysis the number of cases, ~ represents the sum of the weights of children, and ~2 represents the sum of the squares of the weights of the children at birth in grams. USE OF RELATIONAL OPERATORS MACRO or-OPERATIONS

FOR

Data-manipulation languages of current relational DBMSs (e.g., QUEL, Query-By-Examples, and SQL) have capabilities beyond those of relational algebra and calculus. This means that these languages could be used to provide the m-operators already described. The following describes how these operators can be expressed using SQL, as SQL is the most commonly implemented relational query language. The SELECT operator in SQL can be used to provide the MPR operation on macro relations. However, a restriction is imposed on the SELECT statement, that is, it has to contain the names of all the category and NV attributes. For example, the result shown in Figure 2(a) can be produced by the following SELECT operator: SELECT G,C2,PI,P~.,N FROM 111 The SELECT operator and the WHERE clause in SQL can be used to provide the MSL operation on macro relations, For example, the result shown in Figure 2(b) can be produced by the following SELECT and WHERE operators:

G,C2,PI,P2,N, SI FROM R2 WHERE CI = cl3 SELECT

The MAG operation described above on macro relations can be simulated by the aggregate functions and the GROUP BY operators in SQL. For example, the result shown in Figure 2(c) can be produced by the following operator: SELECT C2,Pt,P2,SUM(S2) FROM R.~ GROUP BY C2,P,,P2 The MUN operation can be simulated by using the SELECT and UNION operators in SQL. For example, the result shown in Figure 2(d) can be produced by the following operation: Vol 34 No 6 June 1992

The MDF operation can also be simulated by using the SELECT and NOT EXISTS quantifiers in SQL. For example, the result shown in Figure 2(f) can be produced by the following operation: SELECT CI,C2,Pt,P:,N,SI FROM R1 WHERE NOT EXISTS (SELECT C~,C2,P,,P2,N,S~ FROM !12 WHERE 112.C,= RI.C~ and R2.C2= R,.C2 and 112.Pb= Rt.P, and 112.P:= R~.P2) The MJN operation on macro relations can be easily simulated by the join operation on all the category and NV attributes and some of the summary attributes. For example, the result shown in Figure 2(g) can be achieved by the following operation: SELECT FROM WHERE

R,.CI,RI.C2,R,.PI,RI.P2,R,.N,RI.SI,l;13.S2

111,113 R,.CI = R3.C, and Rt.C2= R 3 . C 2 and R,.P~= 113.P, and R,.P2= R3.P2

The MREC operation can be achieved by the relational equi-join and the aggregate and Group By operators. For example, the result shown in Figure 2(h) can be produced by the following operations (Note that CL is a binary relation that describes a mapping between two category attributes, see Figure 2.):

RL.Ct,CL.C3,111.PI,11t.P2, SUM(R,.N),SUM(R~.S0 FROM RI,CL WHERE Rt.C2= CL.C2 GROUP BY RI.C,,CL.C3,RI.P,RI.P2 SELECT

The MUNAG operation can be simulated by using the SELECT and UNION operators to concatenate the two tables, and then using the aggregate operator together with the Group By clause to produce the required results. For example, the result shown in Figure 2(i) can be produced by the following operations: SELECT

Cj,C2,P~,P2,N,SI FROM 111

UNION

SELECT

CI,C2,PI,P2,N,S, FROM !!2

and: SELECT

C,,PI,P2,SUM(N),SUM(SI) 367

Framework f o r query optimization in distributed statistical databases

C l C2 Pl /02 N Sl

Cl G P, P2 N S,

Cl G PI P2 $2

GG

Ci C2 P, /°2 N

Cll Cl2 ell el2 Cll Cl2

Cll cl2 {713 Cl3

Cll Cl2 Cll Cl2 {711 c12

C21 {73t C22 C31

Cl2 C21

C21 e21 {722 c22 C23 {723

Pl Pl Pl Pl p~ Pl

-----

nl n2 n3 n4 n5

-

/'16

Macro relation

Sll st2 Sl3 Sl4 Sis Sl6

C21 {721 c21 c22

Pl Pl Pl Pl

--

nt' n2 t n3' ha'

Sll' Sl2' Sl3' Sl4'

M a c r o r e l a t i o n 112

Pl Pl Pl Pl Pl

p~

--

$21 s22 s23 S24 s25 s26

C23 {732

CL relation

Macro relation !13

111

C~ C2 Pi P2 N

C21 {721 {722 {722 c23 c23

Sl

Cll {721 Pl Pt -

nj n2

{711 {722 P l Ct2 C22 PX {711 {723 Pl

----

n3 n4 n5

Ct2 C23 P l

--

n6

(a) MPRu 0R,)

C2 Pl P2 Sl

C1 (72 PI P2 N S,

CI C2 P, P2 N S,

C, (72 P~ P2 N St

CIl C21 P l Cl2 C21 e l

{711 {722 Pl Cl2 {722 Pl

{713 C21 P l

--

n3' s13'

C2I P l

--

Xl

Cll

C2I Pl

--

nl

all

Cl3 C22 P l

--

n4' sl4'

£'22 Pl.

-

X2

c23 Pl -

x3

{7t2 cH cl2 Cll Cl2 {711 {712 {713 Cl3

{721 c22 {722 C23 {723 {721 {721 C21 {722

Pl Pl P~ Pl el

--

n2 n3 n4

st2 sl3 s14

--

n6

Pl

hi'

Pl Pl

----

el

-

n4' Sl,'

(b) MSELcI=q3OR2) (c) MAGc2OR3)

---

nl n2

(e) MINT(11~,112)

Sll s12

--

c, c23 p~ c~2 c23 p~ -

n3 n4

si3 s14

n5 s~5 n6 s~6

n5 s~5

(f) MDF (R,,R2)

si6

SII' n2' SIS n3' s13 t

(d) MUN (11,,112)

Ci C2 Pl P2 N

S, $2

C, C3 P, P2 N

S1

Ci PI P2 N Si

c . c21 p~ -

nl s,l s21

ell c31 Pt

-

x,' yj'

tit

ct2 c2, Pt

-

n2 s,2

S22

---

n3

823

n4 Sl4 S24

---

X2' Y2' n5 s15 n6 Si6

c12 Pl - xz" Yz"

Cll C22 P] C12 C22 P l

Cl2 C31 P l ell c32 Pl el2 C32 Pl

Sl3

Cn

C23 Pl

-

n5

sis

cl2

c23 P l

-

n6

sl6 s26

Pl

cl3 Pt

-

-

n|" Yl"

x3" Y3"

w h e r e x~ = s2~ + &2, x2 = s23 + s24, x3 = s25 + s26, x ( = nt + n3, y t ' = sit + si3, x2' : n2 + n4, Y2' ----S~2 + Si4, Xl" ----nl + n3 + n5 + h i , y l ' = sil + s13 + sls + sll ", x z " - - n 2 + n 4 + n 6 + n 2 ' , Yz" = s~2 + sj4 + s~6 + s12', x 3 " = n3' + n4', Y3n : Sl3' + SI4'.

(i) MUNAGq (R,,112)

s25

(h) MREC (CL,110

(g) MJN (R,,113) Figure 2. M a c r o or-operations

FROM Rl2 GROUP BY C1,Pj,P2; where Rl2 is the result of the above U N I O N operation. Note that clearly some 'safety devices' must be incorporated into t h e system support to ensure that the requirements and restrictions listed in the definitions (see the Appendix for definitions) of the operators are met. The following summarizes some of the main points associated with the operations defined in the M I M A D model. • Micro objects have purely relational extensions. Therefore, any relational operation may be performed on them. • Numerical operations 2, which augment the relational algebra, can be applied to micro relations. . Special operators are needed to produce macro relations from the base micro relations. • Some operations (which are similar to algebraic operations) are defined on the macro relations. • Special operations have to be supported that provide the actual statistical analysis capabilities. 368

• Statistical data analysis (SDA) operations cannot be simulated by the capabilities of the current relational DBMS, thus they have to be added.

CONSTRAINTS

IN MIMAD

MODEL

This section states certain important properties of micro and macro relations. Most of these properties are consequences of the definition of the 'relation' in the relational model. Let X and Y be subsets of attributes in a relational scheme R(At . . . . . A,). As usual, the functional dependency (FD) X- > Y holds in R if, for any instance r of R, it is not possible for any two tuples to agree on X components, but they disagree on one or more Y components. The micro relations have purely relational extensions and thus the above F D can easily be adapted for these relations by allowing each tuple to be recognised by a key (identifier). For example, in Table 2 the attribute I D can be considered as the key o f the micro relation with no two values of the I D being the same, that is, each tuple of Information and Software Technology

M H SADREDDINI, D A BELL AND S MCCLEAN

the micro relation can be uniquely identificxi by the key. Note that the key of a micro relation cannot be null. In the case of the macro relations R, the combination of values from the category and NV attributes can satisfy the F D constraints, by representing the category and NV attributes as a composite key o f the macro relation. This constraint may not hold for some cases of the M U N operation, however, as the result may contain multiple tuples with the same values for the category and NV attributes. The constraints in the M I M A D model are now summarized: • Micro relations have purely relational extensions, which represent individual entities. • Each tuple in a micro relation describes some characteristic or observation on one entity (object or individual) from an entity set E. • Each micro object has a unique identification (key), usually appearing in the first column of the micro relation, which cannot be null. • Macro relations do not represent individual entities. • Each tuple in a macro relation describes some statistical characteristics about a set of entities (i.e., a subset of E). • Macro relations consist of three types of attributes, namely, category, NV, and summary attributes, and have composite keys that consist of the category and the NV attributes. • Composite keys of macro relations may contain some null values for some of their components, but not all the components. • Composite keys of macro relations do not contain null values '-' for the P~ attribute of the NV attribute. • Tuples of macro relations that have two non-nuU values for their NV attributes contain null entries '-' for all the summary attributes that have a corresponding summary function with degree 1 (i.e., univariate summaries such as COUNT(N), S U M ( ~ ) , and SQUARED-SUM(~2)). • Tuples o f macro relations that have only one null '-' value for their NV attributes contain null entries '-' for all the summary attributes that have a corresponding summary function with degree 2 (i.e., bivariate summaries such as S U M - O F - S Q U A R E D ( ~ ) ) . • Macro relations have pure summary attributes and primitive category attributes. A pure summary attribute does not mix with any component of category attributes. For example, A V E R A G E - A G E - O F F E M A L E - S T U D E N T S represents a nonpure summary attribute, because F E M A L E is a category and A V E R A G E - A G E represents a summary attribute. A primitive category attribute is a category that is not composed of two or more categories of the category attributes. For example, M A L E - S M O K E R S represents a nonprimitive category attribute, because M A L E and S M O K E R S belong to two different categories. • Primitive macro relations are macro relations whose summary attributes are additive. Vol 34 No 6 June 1992

• An integrity constraint is required for the MJN operator to make sure that, in the cases where there are two identical summary attributes in the operand macro relations, the corresponding values of these summary attributes are identical. • A special case of macro relation is defined with the null category attribute (*-), which contains null values (-). They represent macro relations where no category is specified. Most relational systems augment the relations by providing facilities to specify explicit constraints on the relations. For example, in System R, constraints are defined on the domains o f the attributes by providing facilities to specify: • scopes for attributes, that is, the scope limits the possible values that an attribute can have • comparability domains for attributes, which determine if it is semantically meaningful to compare two values from two different attributes • units for attributes, such as inches, centimetres, or dollars, to disallow some arithmetic operations that do not use the same units In addition to these constraints, more general types of constraints called assertions can be specified in DBMSs such as System R. Assertions specify constraints that must be satisfied by the database. The use of facilities to specify constraints (such as assertions and the above) can enhance the integrity of a database that stores micro and macro relations. QUERY

OPTIMIZATION

An important aspect of performance in relational systems is the efficient processing of queries, which is heavily influenced by the performance of optimizers. Because the relational database underpins the representation of macro data as well as micro data, it would be tempting to use conventional relational optimizers to expedite query processing. There are several differences, however, so the search for specialized distributed statistical query optimization is likely to be rewarding. These differences include: • The underlying system does not have to be relational. • Optimizing queries on micro and macro data sets adds another dimension to the optimization process, as in some queries either types of the data sets can be used to answer the queries. • Unlike conventional distributed query optimizers, semi-join methods cannot readily be applied to distributed macro data sets, due to the nature of macro operations. • Some of the operations on macro data sets cannot be translated into relational algebra. • The space of query optimization for statistical queries might be smaller than the conventional queries, because statistical queries have simpler and more predictable structures than conventional queries. 369

Framework for query optimization in distributed statistical databases

One of the main performance factors in a DDBMS is the query optimization method used. Some benefits of query optimization include that it can reduce volumes of data to be transmitted to other nodes, eliminate nonrelevant nodes from the processing of a query, and provide more efficient division of the processing tasks between the different nodes. Examples of such query optimizations are promotion of selection and projection operations, use of the semi-join operation, simplification, amelioration, and use of statistical information for cost estimation and selection of processing nodes. To provide query optimization in a DS-DBMS environment, however, special additional issues are raised, which manifest themselves in different scenarios. Three scenarios are possible:

can be defined in terms of other functions (called component functions)3°. For example, the statistical function AVERAGE is a computed function, as it can be computed from the component functions SUM and COUNT (i.e., N: cardinality), i.e., AVERAGE = SUM/COUNT. The use of computed functions for distributed computation of aggregate functions leads to reduction of transmitted data. That is, instead of transmitting all the data to the global site and then computing the aggregate functions, the partial results of the component functions can be transmitted to the global site, where the computed functions can be evaluated. For example, consider the following global query:

• The local databases contain only micro data. • The local databases contain only macro data. • The local databases contain both micro and macro data.

where the University of Ulster has four sites: (1) Coleraine, (2) Jordanstown, (3) Belfast, and (4) Magee. To answer the above query, the global query is broken down into four subqueries, where each subquery returns four aggregate data sets (i.e., results of the component functions). These are the sums of the ages of male and female students (SMi and SFi, respectively) and the number of male and female students (CMi and CF~, respectively) studying computer science at site i, where i = 1,2,3,4. In this situation the global query can be easily answered by simple arithmetic operations:

The following subsections deal with some of the important issues and problems that are raised by each of these scenarios. Some heuristics are then proposed to help solve them.

Micro data: query optimization implications In this scenario, a DS-DBMS is considered where the local databases contain only micro data (unaggregated data) and the local sites are assumed to be capable of providing some statistical functionality. As micro relations have purely relational extensions, then the optimization techniques that have been produced for relational algebra can be applied to most operations on micro relations. However, some important operations, such as the grouping operation and aggregate functions on micro relations, are not expressible in relational algebra. The relational algebra has been extended by the Group By (GB) operation 32, which caters for the grouping and the aggregate functions. Other researchers have studied aggregation in detaiP TM. The GB operation has a distributivity property, which leads to profitable evaluations of some global queries. The distributivity property states that the results of the GB operation on union of the operand relations is equivalent to the results of applying the union operation on the results of the GB operation on each operand relation. For example, to evaluate a global query that contains a GB operation on union of some underlying relations, the GB operation is performed on the underlying relations and then the union operation is performed on the results of the GB operation. However, this property only holds for some aggregate functions, that is, the aggregate functions MAX and MIN, and those aggregate functions that have additive properties. In addition to the above property, the concept of computed functions provides another important property that can be employed for evaluation of global queries that contain aggregate functions. A computed function 370

QI: What is the average age of male and female students in the University of Ulster?

(SMt + SM2+ SM~+ SM,)/(CMt + CM2+ CM3+ CM,) for male students (SF, + SF2+ SF3+ SF4)/(CF~+ CF2 + CF3 + CF4) for female students Therefore, in a DS-DBMS, when the underlying databases contain only micro data sets (micro SDBs), the above properties can be employed to enhance the evaluation of global queries by distributing the computation of aggregate functions to local sites when possible. That is, performing the required aggregate functions as closely as possible to the origin of the micro data sets to reduce the amount of data that has to be transmitted around the system. This concept has also been supported in the proposed architecture ~9.

Macro data: query optimization implications In this scenario the local databases contain only macro data sets. There are three possible situations that may arise in answering global queries. These are: (i) when the global query can be answered completely using the stored macro data (ii) when the global query cannot be answered completely, as the categories of the stored macro data do not exactly correspond to those of the global statistical query (iii) when the stored macro data cannot be used at all to answer the global query, which is a trivial case

Case (i) In the first case, the macro operations described previously (or the equivalent relational operation) can be Information and Software Technology

M H SADREDD1NI, D A BELL A N D S M C C L E A N

Table 7. Commutativity of operands and associativity of binary at-operations

Table 6. Commutativity of unary a-operations MSLn

MPRss2 MUN MINT MDF

MSLr,(*0R,))-> *(MSLr,0R)) MPRss,(*0R,))-> *(MPRss,(R,))

Y C2

C1 C3

MJN

*0a,,R2) - > *~2,R,)

Y

Y

N

Y

*(*0R,,R~),T)-> *(R,,*(Rz,T)

Y

Y

N

Y

where CI: Att(Fl) =_ C C U P P U SS2; C2: Att(F2) =_ CC tJ P P U S S I ; C3: S S I = SS2. Table 8. Idempotence of unary operations applied to macro relations to answer global queries. Oszu and Valduriez 32 describe several transformations of global queries into equivalent subqueries, which can be used for the query optimization process. Similar transformations of macro a-operations can be defined on macro relations, and these have yielded similar results when applied (there are some differences in some of the conditions due to the restrictions that are imposed by the macro a-operations). The following considers some o f the transformation issues associated with the macro a-operations. Macro a-operations have equivalent relational operations, which can be expressed using different languages of relational databases (the authors use SQL), and it is possible to transform most of the queries in these relational languages into equivalent expressions of relational algebra and vice versa 34. An expression of relational algebra specifies the semantics and the sequence o f operations o f a query. M a n y techniques have been developed for optimization using the relational algebra expressions. Most o f the macro a-operations on the macro relations can be transformed into equivalent expressions in relational algebra, so some of the optimization techniques o f relational algebra can be applied to macro a-operations. Oszu and Valduriez 32 gave a set of equivalence transformations denoted by ' < - > ' (for two-way transformations) or ' - > ' (for one-way transformations) for expressions of two or three operand relations. Different categories o f transformations are considered according to the types o f operations involved. The following states some of the laws that can be applied to unary, U, and binary B, algebraic operations, where R, S, and T are relations. *Commutating of unary operations: UI(U2(R))<-> U2(UI(R)) *Commtitating of operand of binary operations: B(R,S) < - > B(S,R) *Associativity of binary operations: B(R,B(S,T))<-> B(O(R,S),73 *ldempotence of unary operation: U(R) < - > UI(U2(R)) *Distributivity of unary operations with respect to binary operations: U(B(R,S))- > B(U(R),U(S)) *Factorization of unary operations (this is the inverse of distributivity): B(U(R),U(S))- > U(B(R,S)) N o w these are applied where possible to a-operations; Tables 6-10 summarize some of the possible combinations of a-operations to which the above laws can be applied. The tables contain the values 'Y', ' N ' , or 'C', where ' Y ' means that the property can always be applied, Vol 34 No 6 June 1992

MPRssL(R~)-> MPRssdMPRss~OIO), C:SSI isasubset ofSS3, SSI = SS2 MSELAR,) -> MSELr~(MSELn(R,)),C:F= FI and F2

Table 9. Distributivity of unary operations with respect to binary operations MUN MINT MDF MJN MREC MSELR*(R,,R2)) - > *(MSEL~(R,),MSELA'R2) MPRss(*(R,,R2)) - >

Y

Y

Y

Y

C1

*(MPRss(II,),MPRss(R2)

Y

Y

Y

Y

C2

where C i: one of the operand relations is a reclassification table; C2: both operand relations are reclassification tables.

Table I0. Factorization of unary operations from binary operations MUN MINT MDF MJN MREC *(M SELF(R,),MSELAR2)) -> MSEL~(*(R,,R2)) *(MPRss(R0,MPRss(R2) -> MPRss(*(I~,I;12))

Y

Y

Y

Y

N

Y

Y

Y

Y

N

' N ' means that it cannot be applied, and 'C' specifies a necessary and sufficient condition for the application of the property. The attributes that appear in a formula F are denoted as Att(F). Note that '*' stands for the operations defined by the column headings, and C C , P P , S S represent the category, NV, and summary subschemes, respectively. Although the macro a-operation M A G can be expressed using different relational database languages, it cannot be expressed directly using the relational algebra operators. The special features of these operations are the grouping and aggregate operations. As already mentioned, to cater for these operations, an extension of the relational algebra called the G r o u p By (GB) operator has been proposed 32. Application of the properties of the GB operator to the M A G operator yields the following distributivity property: MAGcc(MUN(R.R2)) - > MUN(MAGcc(R,),MAGcc(R2)) The M U N A G operator can be expressed in terms of the M A G and M U N operators: 371

Frameworkfor query optimization in distributedstatistical databases site a: RI(*AGE-GP: < 1%19, 20-24, 25-75 > , *SEX: < male, female > , COUNT-students) site b: R2(*AGE-GP: < 1%19, 20-24, 25-75 > , *COURSE: < maths, physics > , COUNT-students) Figure 3. Macro data from two different nodes MUNAGcc0R,,R2) = MAGcc(MUN((MAGcc(Rt)), MAGcc(Rx)) So the equivalent transformations on macro a-operations differ slightly from the equivalent transformations of the relational algebra.

Case (u) In the second case two special situations will be considered for answering global queries. The first situation arises when the global query is more detailed than the stored macro relations and the macro relations are on different populations. For example, consider the system that consists of two local macro SDBs, which contain two macro data sets Rt and R2 on two different populations, as shown in Figure 3. Now consider the global macro query. Q2: How many female students are between the ages of 1719 and study maths? The answer to this query requires input from both sites. However, the macro data at both sites are not detailed enough to answer the global query. That is, the level of detail on which the data sets can be compared or merged is somewhat limited and depends on the classification systems used. However, an approximate solution can be considered to this query, which only gives the overall number of students who are between 17-19, irrespective of their sex and their study (i.e., less detailed answers). In addition to less detailed answers to some global queries, partial results can also be given to some queries. For example, in the above query, the number of female students who are between 17-19, or the number of students between 17-19 who study maths, can be provided. Although in most situations less detailed answers and partial answers will not be acceptable, there may be other situations in which they will. The enquirer should be offered the possibility of having these less detailed answers, and clearly the responses would be much quicker here. The second situation arises when the global query is more detailed than the stored macro relations and the macro relations are on the same populations. Malvestuto et a/. 27-29refer to this problem as the derivation problem and propose several approaches to solving it. The problem of derivation is defined in the framework of the linear equation systems, and procedures for testing the 372

evaluability and evaluation of a category are shown that are based on solving an equation system29. In a distributed system, storing the 'intersection hypergraph' and solving the equation Systems at the global meta-data level improves the responsiveness of the system and provides an integrated use of the stored information. However, the sets of global queries that can be answered by these methods are limited.

Micro and macro data: query optimization implications In this scenario, the local databases contain both micro and macro data. Here it is assumed that there exist related macro data sets at the local sites that can be used to answer the global query. However, it is also assumed that the micro data sets from the local sites can be accessed and processed either locally (if possible) or at another node, to answer the global query. Therefore, the global query can be answered by using either the micro data or the macro data. One of the main fundamental issues that is raised in such an environment is to decide whether the global query can be answered from the already stored macro data. To answer global queries, several possibilities may exist. Three cases can be considered: (i) when the global query can be answered completely using the stored macro data (ii) when the global query cannot be answered completely using the stored macro data (iii) when the stored macro data cannot be used at all to answer the global query In the following, these three cases are discussed. The first case is interesting, for it must be decided whether to use the already available macro data or to evaluate it from the micro data, to answer the global queries.

Case 0) Consider a query where the user is interested in comparing or merging the data from two sites, where the data stored at the sites represent two different populations. Although in most situations it would be better to use the macro data sets to answer global queries, there might be some situations when it would be better to use the micro data sets• Examples of such situations can arise when answering a global query would require transformation; processing and merging of several macro data sets would take longer in comparison to processing of the micro data set, that is: T(E'0R,,R2. . . . )) > T(E(R,,R~. . . . )) where E" is an expression on the macro data sets 0llj,R:, •..), E is an equivalent expression on the micro data sets (R~,R2. . . . ), and T is the time to evaluate an expression. Note that a combination of operations using micro and macro data sets can also be considered (see next case). This issue is illustrated by means of an example here, and some heuristics are presented in the next section• Information and Software Technology

M H SADREDDINI, D A BELL A N D S M C C L E A N

site a: RI (id, name, age, sex, course . . . . ) R~(*AGE-GP: < 17-19, 20--24, 25--75 > , *SEX: < male, female>, COUNT-students)

Rl(id, name, age, sex, course, I Q - s c o r e . . . ) RI(*AGE-GP: < 17-19, 20-24, 25-75 > , *SEX: < male, female > , COUNT-students)

site b: R2(id, name, age, sex, course . . . . ) R2(*AGE-GP: < 17-19, 20-24, 25-75 > , *COURSE: < maths, physics > , COUNT-students)

R2(*AGE-GP: < 17-19, 20-24, 25-75 > , *COURSE: < maths, physics > , SUM-students-IQ-score) R3(*AGE-GP: < 17-19, 20-24, 25--75 > , SQUARED-SUM-students-IQ-score) CL1 (*AGE-GP:*N-AGE-GP) 17-19 17-24 20-24 17-24 25-75 25-75

Figure 5. Micro and macro data from two different sites

Figure 4. Macro data sets generated from micro data set and reclassification table Consider the macro data sets R~,R~,R3 shown in Figure 4, which have been generated from the R~ micro data set and the reclassification table CLI. Now consider the query: Q3: What is the average and standard deviation of the students-IQ-seores who are aged between 17-24? To answer this query from the stored macro data sets only requires several operations on the macro data sets. Note that the summary functions average and standard deviation require the component summary functions C O U N T , SUM, and S Q U A R E D - S U M . The following gives a set of steps that has to be taken on the stored macro data sets, before evaluating the summary functions average and standard deviation to answer the above query.

Step (1) Aggregate Rt and R2 on their A G E - G P category attribute, i.e., RI2= MAG.AcE.6P(R,) R2, = MAG.AGE.CpOR2)

(t 1) (t2)

Step (2) Join the result of the above aggregation with !i3, i.e., R22= MJNORtz,Rzt) R~ = MJNOR~,R22)

(t3) (t4)

Step (3) Use the reclassification operation of the above join with CL1, i.e., !~ = MREC(CLI,R3,),

(7"2)

It can be stated that if T2 > T1, then it would be better to use the macro data sets. It can be seen that this is not always necessarily the case, and that the specific details of the application determine the choice. Detailed cost equations are needed to do this properly, but some heuristics can help (see later).

Case 0i) This case is similar to case (ii) discussed previously. But here micro data sets are available that can be accessed to answer the global queries. In this case two approaches will be considered: (a) Evaluate partial answers from the micro data sets. (b) Answer the global statistical query at a different (less detailed) level of classification.

Approach (a) For the first approach, a solution may be provided by using macro data from one site and micro data from the other site. In this case, advantage can be taken of the already stored macro data. For example, consider the data sets shown in Figure 5, where RI and R2 are the micro data sets and R1 and R: are the macro data sets that have been generated from the micro data sets, respectively. Now consider the global query: Q4: How many students are between the ages of 17-19 and study maths?

(t6)

T o answer this query, a subquery is submitted to site b to retrieve the number of students who are between 17 and 19 and study maths from the R2 macro data set, and a micro subquery is submitted to site a to evaluate the

Select the tuple whose N - A G E - G P - - 17-24, i.e.,

Vol 34 No 6 June 1992

SELECT COUNT(id),SUM(IQ-score), SUM(IQ-score*IQ-seore) FROM Ri WHERE age> = 17 and age< =24.

(ts)

Step (4) R5 = MSEL.s.A6E.Cp=t7_2~(1~)

Here, tl . . . . ,t6 are the times taken to perform each of the steps. Therefore, the overall time is T1 = tl + t2 + t3 + t4 + t5 + t6, with the assumption that the above operations have been optimized. The following query in SQL expresses a query that would produce from micro data only an equivalent result (with time T2) to the result produced from the above steps.

373

Frameworkfor query optimization in distributed stat&tical databases number of students who are between l 7 and 19 and study maths from the R~ micro data set. In this way, only the micro data from one site needs to be processed to produce the required macro data. Approach (b) In the second approach, the answer provided to the global query would be at a lesser level of detail as required. For example, consider the Q4 global query, which requires the data sets from site a and site b, at the level of the A G E - G P and the C O U R S E classifications. Then, in this approach, the answer would be given only in terms of the A G E - G P category (i.e., the answer to the above global query gives the number of students whose age is 17-19, irrespective of their course o f study). This is similar to case (ii) discussed previously. Again the enquirer should be allowed to choose to have a response (more quickly) if acceptable, or even an interim response while a fuller result is pending.

Case (iii) In the case where no complete nor partial result can be evaluated using the stored macro data, access would be required to the micro data sets at the local sites (see the previous section). For example, consider the global query: Q5: How many students are between the ages of 20-22? which is to use the data sets in Figure 5. In this case the stored macro data sets cannot be used, as the macro data sets are not detailed enough. Thus the micro data at each site have to be processed to create the results of the global macro query in terms of 20--22 age groups.

Heuristics The query optimizer has the task of finding the trade-off between the possible approaches to answer a global query and to select the best approach. This should increase the performance of the system. In most cases, the delay cost was considered, as it is assumed that the transmission cost is minimal because of the aggregate functions (i.e., in most cases raw data are not transmitted around the network). In the following, some heuristics are proposed to help increase the performance of the system in answering global queries in a distributed environment. Most of the optimization heuristics that have been produced for relational algebra can be applied to most operations on micro relations, so these are taken for granted. • The distributivity property of the grouping operation and the distributed computation of additive aggregate functions should be used to enhance the optimization process on micro data sets, by reducing the amount of data that has to be shipped around the network and also enforcing a level of parallel computation. 374

• The processing of aggregate functions on micro and macro data sets should be performed at local sites if possible. • Similar transformation rules should be applied to the macro 0~-operations, therefore enabling most of the optimization techniques of the relational algebra to be applied to operations on macro relations. • In situations where the micro data sets cannot be accessed (or it would be too expensive or time consuming to do so), less detailed answers can be evaluated for the global queries, and these should be offered to the enquirer. • In situations where the macro data sets describe the same populations and the micro data sets cannot be accessed (or it would be too expensive or time consuming to do so), intersection dependencies of the classification systems (stored as graphs) should be used to answer a limited range of global queries. • In situations where a global query can be answered using either macro data sets or micro data sets, the delay cost of these two alternatives should be calculated and compared against each other, i.e., T ( E ' ( R I , R 2. . . . )) and T(E(R,R2 . . . . ), the lower cost alternative being chosen. • In some situations, partial answers evaluated from micro data sets should be combined with data from macro data sets, to answer the global queries. Similarly, in these cases, different alternatives for answering global queries should be considered for delay costs.

SUMMARY

AND FUTURE

WORK

This paper focuses on operational issues and performance implications related to distributed statistical databases (SDBs). Based on concepts of micro and macro objects, a model called the M I M A D (Micro/MAcro Data) model, which is an extension of the relational model, has been developed. A set of operations on the macro objects is defined, as well as the relational equivalent of these operations. A set of transformations (similar to those applied to relational algebra) has been applied to macro operations, which can be used for optimizing macro operations. One of the important aspects of performance in a DSDBMS is the efficient processing of queries, which is heavily influenced by the performance of optimizers. However, to provide query optimization in a DS-DBMS, special issues are raised, which manifest themselves in different scenarios. Some of the important issues and problems that are raised by these scenarios have been discussed and some heuristics to solve them have been proposed. The authors intend to implement these techniques in the next version of their DS-DBMS. Validation of the heuristics in typical application environments is required, and full cost models for the query optimizer have to be specified to support selection of an execution strategy. Information and Software Technology

M H SADREDDINI,D A BELL AND S MCCLEAN

REFERENCES 1 Catarci, T and Santueci, G 'Grasp: a graphical system for statistical databases' in Proc. Fifth Int. Conf. Statistical and Scientific Databases (1990) 2 Ghosh, S P 'Statistical relational model' in Proc. Fourth Int. Conf. Statistical and Scientific Databases (1988) 3 0 z s o y o g l u , G, Matos, V and Ozsoyoglu, Z M 'Query processing techniques in the summary-table-by-example database query language' A C M Trans. Database Syst. Vol 14 No 4 (1989) 4 Barcaroli, G, Di Battista, G, Fortunato, E and LeporeUi, C 'Design of statistical information media: time performances and storage constraints' in Proc. Fourth Int. Conf. Statistical and Scientific Databases (1988) 5 Sato, H 'Handling summary information in a database: derivability' in Proc. A C M SIGMOD Conf. (1981) 6 Bate, D, Boral, H and De Witt, D J 'A framework for research in database management for statistical analysis' in Proc. A C M SIGMOD Int. Conf. Management of Data (1982) 7 Lefons, E, Silvestri, A and Tangorra, F 'An analytic approach to statistical databases' in Proc. Int. Conf. VLDB Italy (1983) 8 0 l k e n , F 'How baroque should a statistical database management system be?' in Proc. Second Int. Workshop Statistical and Scientific Databases (1983) 9 Chen, M C and McNamee, R D 'A data model and access method for summary data management' in Proc. Fifth Int. Conf. Data Engineering (February 1989) 10 Hebrail, G 'A model of summaries for very large database' in Proc. Third Int. Workshop Statistical and Scientific Databases (1986) 11 Ikeda, H and Kobayashi, Y 'Additional facilities of a conventional DBMS to support interactive statistical analysis' in Proc. First Int. Workshop Statistical and Scientific Databases (1981) 12 Su, S Y W, Navathe, S B and Batory, D S 'Logical and physical modelling of statistical and scientific databases' in Proc. Second Int. Workshop Statbstical and Scientific Databases (1983) 13 Rowe, N C 'Rule-based statistical calculations on a database abstract' in Proc. First Int. Workshop Statistical and Scientific Databases (1981 ) 14 Ozsoyoglu, G and Ozsoyoglu, Z M 'An extension of relational algebra for summary tables' in Proc. Second Int. Workshop Statistical and Scientific Databases (1983) 15 Ozsoyoglu, G, Ozsoyoglu, Z M and Mata, F 'A language and physical organisation technique for summary tables' in Proc. A C M SIGMOD Conf. (1985) 16 Shoshani, A and Wong, H K T 'Statistical and scientific database issues' IEEE Trans. Soft. Eng. Vol 11 No l0 (October 1985) 17 Sadreddini, M H, Young, I R and Bell, D A 'A statistical capability of multidatabase management systems in health care' Medical Informat. Europe (1988) 18 Bell, D A and Sadreddini, M H 'SEDDS: a multidatabase system with statistical capabilities" Database Technol. Vol 2 No 2 (1988) 19 Sadreddini, M H, Bell, D A and McClean, S "Architectural considerations for providing statistical analysis of distributed data' Inf. S~?ft. Technol. Vol 32 No 7 (September 1990) 20 Landers, T and Rosenberg, R L 'An overview of Multibase' in Schneider (ed) Distributed databases North-Holland (1982) 21 Litwin, W e t aL 'Sirius systems for distributed data management' in Schneider (ed) Distributed databases North-Holland (1982) 22 Mostardi, T, Bell, D A et aL 'Multistar - a multidatabase management system (MDBMS) for organisational information systems' Database Technol. Vol 2 No 2 (1986) Vol 34 No 6 June 1992

23 Heiler, S and Manness, A T 'Connecting heterogeneous systems and data sources' Database Eng. Vol 3 (1984) 24 Hotaka, R 'Statistical database design method' in Proc. Database Syrup. (1984) 25 Hotaka, R 'Logical database design' Inf. Proc. (May 1983) 26 Lee, F and Hotaka, R ' A statistical database model: its uniqueness and the design procedure' J. Inf. Proc. Vol 12 No 2 (1989) 27 Malvestuto, F M 'The derivation problem of summary data' in Proc. SIGMOD Int. Conf. Management of Data Vol 17 No 3 (September 1988) 28 Malvestuto, F M and Zuffada, C 'The classification problem with semantically heterogeneous data' in Proc. Fourth Int. Conf. Statistical and Scientific Databases (1988) 29 Malvestuto, F M and Moscarini, M 'Aggregate evaluability in statistical databases' in Proc. Int. Conf. VLDB (1989) 30 Chen, M C, McNamee, R D and Melkanoff, M 'A model of summary data and its applications in statistical databases' in Proc. Fourth Int. Conf. Statistical and Scientific Databases (1988) 31 Bell, D A, Grimson, J B and Ling, D H 'EDDS - a system to harmonise access to heterogeneous databases on distributed micros and mainframes' Inf. Soft. Technol. Vol 29 (1987) 32 Oszu, M T and Valduriez, P Principles of distributed database systems Prentice Hall (1991) 33 Satoh, K, Tsuchida, M, Nakamura, F and Oomachi, K 'Local and global query optimization mechanisms for relational databases' in Proc. Int. Conf. VLDB (1985) 34 Ullman, J D Principles of database systems (2nd ed) Computer Science Press (1983) 35 Bultzingsloewen, G 'Translating and optimizing SQL queries having aggregates' in Proc, Int. Conf. VLDB (1987) 36 Dayal, U 'Of nests and trees: a unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers' in Proc. Int. Conf. VLDB (1987)

APPENDIX: MACRO a-OPERATIONS Macro-Project

MPss(R)

Macro-Project MPss,(R), where R denotes the operand macro relation and SS" denotes a subset of the summary attributes of R, is more restricted than the project operation of the relational algebra. The result of this operation is a macro relation R', which is defined over all the categorical and NV attributes, and the summary attributes that appear in SS'. That is, only the summary attributes of the operand relation that do not appear in SS' are suppressed, thus the cardinality of the result is the same as the operand macro relation. MPs~(R) = {t'[CC,PP,SS']]t'[CC,PP,SS'] = t[CC,PP,SS'] with tcR and SS' ~_ SS} where R is the operand macro relation with macro scheme (CC,PP,SS) and the resultant relation is R' with macro scheme (CC,PP,SS').

Macro-Select

MSLAR)

Macro-Select MSLr(R), where R denotes the operand macro relation and F denotes a formula that expresses a selection predicate, results in a macro relation R', which includes tuples of the operand macro relation that satisfy F. This operation is similar to the select operations in the relational model. Complex Boolean expressions that consist of a number of predicates and logical operators (AND, OR, NOT) can be used for the selection predicate. The predicate is of the form (attributename O variable/constant), where attribute-name can be a category, NV, or summary attribute name and O is a compari375

Framework for query optimization in distributed statistical databases son operator, and variable/constant specifies the value to be compared with the value of the attribute-name, i.e.:

patible with macro scheme (CC, PP,SS) and (CC, PP, SS), respectively, and R" is the resultant macro relation with macro scheme ( CC, PP, SS).

MSL~(1R)= {t'[CC, PP, SS]It'dR and t' satisfies F} where R is the operand macro relation with macro scheme

(CC, PP, SS) and the resultant relation is R' with macro scheme (CC,PP,SS).

Macro-Aggregate MAGcc(R) In statistical analysis, Macro-Aggregate MAGcc(R), where R denotes the operand macro relation and CC' denotes a set of the category attributes of R, is extensively used for creating union of individuals belonging to different categories (groups or classes). Application of this operation on R will result in a new macro relation R' with the macro scheme ( C 1 ' , C 2 ' , . . . . Cit'~ Pa,P2; S/,$2', . . . . S,'), where C{, . . . . C/ is a subset of category attributes in R and the summary attributes S,', . . . . S / h a v e the same summary domains as Sa. . . . . Sin, respectively, except that the values of the summary attributes are computed by the aggregate functions or procedures defined for the domains of the summary attributes of R, which has the same values for the attributes C/,C2', . . . . C,',Pa,P2. For the additive summary attributes, the M A G operation is the arithmetic addition of the values of the summary domains. For the nonadditive summary attributes (such as Avg, Max, and Min), user-defined rules can be used. MAGcc0R) = {t'[CC',PP,SS] I t'[CC',PP] = t[CC',PP] and t'[SS] = ~ t~SS], with tdR, CC' ~_ CC}

,dr ~cc',ee] = tJcc',e~

where R is the operand macro relation with macro scheme (CC, PP, SS), and the resultant macro relation is R' with macro scheme ( CC',PP, SS).

Macro-Difference MDF(R,R') Macro-Difference M D F (R,ff), where R and R' are two operand macro relations with similar schemes, produces a macro relation R" with the same scheme as its operands and includes those tuples of R that have different values for the categorical and NV attributes from those of El'. That is, all the tuples appearing in R that do not have the same values for the categorical and NV attributes as R', i.e.: MDFOR,R') = {t"ICC, PP, SS] I t"dl, t"[CC, PP]= t[CC, PP] but t"[CC, PP] ~ t'[CC, PP] with tell, and t'eR'} where the operand macro relations R and R' are macro-compatible with macro scheme (CC, PP, SS) and (CC, PP, SS), respectively, and R" is the resultant macro relation with macro scheme (CC, PP,SS).

Macro-Join MJN (R,R') Macro-Join MJN (R,R'), where R and R' are the operand macro relations, can be used in a similar way to the relational equi-join; this is due to the semantic distinction between the category, NV, and summary domains. The domain of the joining attributes in both macro relations has to contain all the category and NV attributes. The two macro relations are then joined over the category and NV attributes; the result is a macro relation R" defined over the joining domains and all the summary domains of R and R'. MJN(R,R') = {t"[CC, PP,SS"] [ t"[CC, PP] = t ICC, PP] and

t"[SS"] = t[SS] U t'[SS'] with tell, t'dl' and t[CC,PP] = t'[CC, PP]}

Macro-Union MUN(R,R') Macro-Union M U N (R,R'), where R and R' are the operand macro relations, works on two macro relations that are defined on the same macro scheme. Its result is a macro relation R" with the same macro scheme as its operands, whose domain is the union of the tuples of the operand macro relations, which includes all the tuples of both R and R'. This operation is similar to the Union operation in relational algebra, i.e.: M U N 0R,R') = {t"[CC, PP, SS] I t"elR or t"dl'} where the operand macro relations R and R' are macro-compatible with macro scheme (CC, PP, SS) and (CC, PP, SS), respectively, and R" is the resultant macro relation with macro scheme ( CC, PP, SS).

Macro-Intersection MINT (R,R') Macro-Intersection MINT (R,R'), where R and R' are two operand macro relations that are macro-compatible, produces a macro relation with the same scheme as its operands and includes those tuples of R that have the same values for the categorical and NV attributes as those of R'. That is, all the tuples appearing in R that have the same values for the categorical and NV attributes as R', i.e.: MINT(R,R') = {t"[CC, PP,SS] ] t" ¢R,t"[CC, PP] = t[CC, PP] and t"[CC, PP] = t'[CC, PP], with tdR and t'dR'} where the operand macro relations R and R' are macro-com-

376

where R and El' are the operand macro relations that are category-compatible with macro scheme (CC, PP,SS) and (CC, PP, SS'). respectively, and R" is the resultant macro relation with macro scheme (CC,PP, SS") where SS" = SS U SS'.

Macro-Reclassification MREC (CL,R) Macro-Reclassification MREC (CL,R), where R is the operand macro relation, CL is a relation that defines mappings between values of category attributes, and macro relation El' is the result of this operation, can only be applied to the category attributes of the operand macro relation. It results in reduction of categories by computing new summary values using the aggregation functions or procedures defined for the summary domains. The MREC operation is similar to the MAG operation described earlier, except that it does not remove category attributes, rather it replaces the old category values with new ones and computes new summary values for the summary attributes. MREC(CL,R) = {t'[CC',PP,SS] I t'[CC',PP] = t[CC',PP] and t'[SS] = ~ t,[SS],

qa

t[c.eel = t~fc.ee]

with telR, ueCL, C e CC, t[C] = u[C], and CC' = ( C C - C) U NC}, where CL is a relation with scheme (C,NC), R is the operand macro relation with macro scheme (CC,PP,SS), and El' is the resultant macro relation with scheme (CC',PP, SS) where CeCC and CC' = ( C C - C) U NC. Note that '-' is the difference operation. Information and Software Technology

M H SADREDDINI, D A BELL A N D S M C C L E A N

Macro-Union-AggregateMUNAGcc,(R,R') Macro-Union-Aggregate MUNAGcc-(R,R'), where R and R' are the operand macro relations and CC" denotes a set of category attributes common to both R and R', is a combination of M A G and M U N operations. It works on two macro relations that have similar macro schemes, i.e., one or more common category attribute(s), and identical summary attributes. Note that this operation is useful when combining two macro relations that do not have identical category attributes. The result is a macro relation R" whose category attributes are those that are common to the operand relations; the domains of its NV attributes (P~ and P2) are the union of the domains of the NV attributes of the operand relations, and the values of the

Vol 34 No 6 June 1992

summary attributes are computed by the aggregation functions or procedures defined for the summary domains. MUNAGcc,(R,R') = {t"[CC',PP,SSl l t"[CC",PP] = t[CC",PP] or t'[CC",PP] = t'[CC",PP], and t"[SS] = ~, ticllortltR' eicc',eP] ~ t,~¢c',ee]

t,[SS], with tell, or t'eR', CC" c_ CC and CC" ~_ CC'} where R and R' are the operand macro relations with macro schemes (CC,PP,SS) and (CC',PP,SS), respectively, and R" is the resultant macro relation with scheme (CC",PP,SS) where R and R' are summary-compatible, CC" c_ CC and CC" ~_ CC'.

377