I~F~~~ZO~
SCIENCES 71,269-287 t 1993)
269
SQL Translation Using an Attribute Grammar MICHAEL
FRAME
and MEHDI OWRANG Department of~o~~~~terScience and In~omzutio~Syste~ns,The American University,4400 ~ussae~usett.~Avenue, NK ~ffs~in~ton, DC 20016
ABSTRACT SQL translation can be simplified if the target language is well defined. A welldefined target language allows the semantics of SQL to be specified in terms of an attribute grammar that maps SQL phrases to that target language. A translator that interprets the attribute grammar generates the target language as a by-product of parsing an SQL statement. The target language defined here is a variation of relational algebra notation. By augmenting SQL ~ammar with semantic rules in terms of relational algebra it is possible to simplify SQL translation. These concepts are illustrated with an attribute grammar for an SQL subset and examples that demonstrate the translation process.
1.
INTRODUCTION
The function of a database system is to maintain and retrieve data. These operations should be performed efficiently, but, at the same time, the user should be able to specify a request in the simplest form possible, This is a standard problem in computing. It is solved by providing a high-level interface to the user and low-level “instructions” to the system. A transiator converts the high-level specifications to the low-level instructions. SQL is an example of such a nonprocedural data language. This paper will demonstrate how SQL can be translated to an executable form that is a variation of relational algebra. Ceri and Gottlob [3] have addressed the problem of translating SQL to relational algebra for the purpose of optimization and to determine the equivalence of SQL statements. They note that translation to relational 8 Elsevier Science Publishing Co., Inc. 1993 655 Avenue of the Americas, New York, NY 10010
ooze-0255/93/$6.00
270
M. FRAME AND M. OWRANG
algebra is a way of defining the semantics of SQL. Their translation is a four-stage process and there are a number of aspects of SQL they do not address because standard relational algebra does not have the expressive power of SQL. The aspects of SQL they have not included are null values, arithmetic operators, control of duplicate elimination, data definition, and the ORDER BY clause. Previous work related to SQL translation has concentrated on transforming SQL queries from one form to another for the purpose of optimization or to improve the understanding of the semantics of SQL queries [2, 7, 8, 11, and 131. The algorithms provided for the transformation are ad hoc and are for specific features and specific cases of SQL statements. As Ganski [ll] has shown, some did not even produce accurate transformations. Dayal IS] unifies the results of a number of others [2,3, 7, 11, and 131 in extending relational algebra to be sufficiently expressive to define SQL. This includes control over duplicates, aggregate functions, and nested queries. The only significant feature not covered by Dayal would seem to be the ability to generate ordered (sorted) results. Dayal integrates translation with optimization and believes that it is necessary to translate SQL into a form that allows for the most flexible optimization strategies possible to be applied. A general model to describe the transformation of SQL statements in a unified form is needed. Recently there has been an increased interest in the use of attribute grammars 1151for query language translation [lo, 14, 213. This has motivated us to consider attribute grammars as a general model for transforming SQL queries in a uniform fashion, rather than via ad hoc algorithms. Kintala [14] discusses the use of attribute grammars for formal specification of systems performing structural translation because of the structural differences between the conceptual model and the underlying physical model. He describes attributed grammars for structural translations: (1) single-relational queries (SQL-like) on hierarchical databases, (2) nonprocedural hierarchical queries on hierarchical databases, and (3) relational queries on network databases. Attribute grammars are used to capture the intended semantics of the queries and perform the translation of the queries. Yellin and lWueckstein [21] use attribute grammars to define a formal mechanism for specifying translation between languages. In addition, given an attribute grammar specifying the translation from language L, to language L,,they describe the process for automatic generation of the inverse attribute grammar specifying the inverse translation from L, to L,. They relate their experience in defining an interface for a format
SQL TRANSLATION
271
database accessing language, SQL. In particular, they describe the use of the attribute grammar in the PERFORM (Paraphrase and ERror message for FORMal languages) system 1171to paraphrase SQL database queries in English. They also take a subset of the PERFORM attribute grammar (translating a subset of all SQL queries into an English paraphrase) and invert this subset. The inverted attribute grammar translates simple English queries (paraphrase) into SQL queries. This capability can serve SQL users as a feedback device to make sure their queries are semantically correct from their point of view and from the system’s point of view. It is also an aid for the novice user in learning SQL. Our own use of attribute grammars differs from Kintala in that we are not attempting a structural translation. Instead we are translating one relational language to another. Our use differs from Yellin and Mueckstein in that we are translating for the purpose of execution as a compiler would translate a higher level language to executable form. It would be straightforward to write a traditional translator from SQL to relational algebra. However, there are a number of reasons that an attribute grammar approach is preferred for the work described here. Firstly, an attribute grammar may actually be considered as a formal definition of the syntax and the semantics of the language that it describes. The current definition of SQL has a formalized syntactic definition, but it depends largely upon descriptive text for the semantic definition. One of our interests is to develop a formal definition for the semantics of data languages using SQL as an example. Secondly, data languages are constantly being enhanced, clarified, and improved. The system we are building will allow many types of changes to be made to the attribute grammar and to be immediately used and evaluated. Finally, it will be possible to use the same virtual database machine with a number of data languages. A relational database management system (DBMS) is being developed for use as a research and teaching tool. A major goal is to develop a system that is small and simple. This goal is being achieved by designing and implementing a virtual database machine (VDM) that has an “instruction set” that is a variation on relational algebra. We chose an algebra approach (versus relational calculus) because of its more operational nature. We present an approach to translation that is built upon the work of Ceri and Gottlob [31 and that can take advantage of the advances made in extending relational algebra as presented in 181.However, we differ from Ceri and Gottlob in that our translation is a single-stage process and it results in an extended relational algebra that can be input to a separate optimization stage where the techniques of 181 can be applied. We note that this structure is similar to that used in modern compilers. In addition,
272
M. FRAME AND M. OWRANG
the attribute-grammar-based SQL translation provides a framework for defining the semantics of SQL in a unified form rather than ad hoc, for defining a global data language in a heterogeneous distributed database, and for experimenting with different languages in a distributed system. We believe that, for the reasons just mentioned, basing the implementation of a relational database management system (RDBMS) on an attribute grammar-based translator and a relational algebra instruction set willlead to the development of RDBMSs that are not much more complex than a modern compiler. This paper describes the approach being taken in the implementation of an attribute-grammar-based translator for SQL. To simplify the explanation, the language used is a subset of SQL that can be translated to a slightly extended version of relational algebra. The remainder of the paper is organized as follows: Section 2 describes an SQL attribute grammar; Section 3 gives examples of SQL translation to relational algebra; Section 4 is an evaluation of the potential of attribute grammar-based translation; conclusions and planned future work are discussed in Section 5.
2.
SQL ATTRIBUTE
GRAMMAR
In recent years several authors have used attribute grammars for query language translation [lo, 14, and 211. Readers not familiar with attribute grammars should see [18], which contains an excellent introduction including a survey of their uses and an extensive bibliography. An attribute grammar is a context-free grammar with productions that are augmented by semantic rules that assign values to attributes. Attributes are associated with the nonterminals that appear in the productions. Attributes of the nonterminal that appears on the left-hand side of the production are called synthesized attributes. They get their values in a bottom-up fashion from the lower level nonterminals in the parse tree. Attributes of the nonterminals on the right-hand side of a production are called inherited attributes. They are defined in a top-down fashion in terms of attributes that were defined at a higher level in the parse tree. Semantic rules involve only those attributes that are associated with nonterminals in the corresponding production. That is, the attributes of a node in a parse tree are related either to the attributes of the parent node (inherited attributes), or the immediate child nodes (synthesized attributes). The actual structure of semantic rules is flexible. In our case, semantic rules are assignment statements involving expressions that are either set operations or list operations (almost always list concatenation.) A grammar augmented with semantic rules is needed to assign the
273
SQL TRANSLATION
desired “meaning” to SQL statements. In this section we give a description of a scaled-down attribute grammar for SQL. The “meaning” is in the form of a relational algebra expression corresponding to the original SQL statement. An algorithm has been developed that takes as input an SQL statement and an augmented grammar and generates as output the corresponding relational algebra expression. The algorithm is described in detail in [lo]. The SQL translation process works because we are able to map a general SQL SELECT statement of the form SELECT return-list FROM table-list [WHERE condition-l] [ORDER BY sort-list] [GROUP BY aggregation-list [HAVING condition-211 where square brackets enclose optional clauses, to an extended relational algebra expression of the form
SORT’$sort-list I
SEL$ condition-2
I AGG$
aggregation-list
I
SEL$ condition-l I TIMES$table-list
Note that conditions may include a predicate that corresponds to a nested-SELECT. The separate optimization phase removes nested selects and turns correlated subqueries into joins. Note that views are handled easily: a view has its own query tree representation that replaces the name of the view in the table-list. The primary purpose of our translation is to turn a non-procedural query into an internal form that can be procedurally interpreted and that can be optimized. The translator does not have to attempt any optimization if it can produce a program that can be easily optimized later. The translator does eliminate parts of the relational algebra expression that
274
M. FRAME AND M. OWRANG
pRoJ$“amesEL$sa, = (a) Standard PROJ$ (b) Relational
Relational
name SEL$ Algebra
(EMP))
20000
Algebra
Expression
= sal 20000
Expression
EJvlP
in Prefix Form
PROJ$ name SEL$ = sal 20000
E&P [c) Query Tree Representation
of Prefix Form
Fig. 1. Examples of relational algebra notation.
correspond to parts of an SQL statement that are optional and that are not present in the statement being translated. For example, if no GROUP BY clause is present, then the AGG$ and SEL$ (condition-2) operators will not be generated. In addition, if return-list corresponds to all columns of the table-list, then the PROJ$ can be eliminated. Finally, if table-list is a single table, then the TIMES$ operator is replaced by the single table name. The relational algebra expression that is generated is in prefix form. The expression is a list in the sense that the term is used in LISP. That is, it is an ordered set of elements and the elements may be atoms or other lists. An atom is typically an operator that acts upon a fixed number of the elements that follow it. For example, the SEL$ operator (corresponding to the select or restrict operator in relational algebra) expects two elements to follow: the first is the Boolean expression to be evaluated, the second is the relation upon which to operate. Each of these elements would probably be a list in its own right and might be quite complex. There will be no circulari~ or recursion in this list, so it actually represents a tree structure. This tree corresponds to what Ceri and Pelagatti [4] call a query tree. This term will be used in the remainder of the paper. Figure 1 provides an example of a standard relational algebra expression, the same expression in prefix form, and a query tree representation of the expression. Semantic rules are in the form of assignment statements. A semantic rule, V:=E, is interpreted as meaning that E is evaluated and the result
SQL TRANSLATION
275
becomes the value of V, where V is called an attribute. When the same nonterminal appears more than once in a syntactic production (e.g., A =AB), the semantic rules assume an implicit numbering for the different occurrences of the repeated nonterminals. So the semantic rule might refer to Al and A2. A 1 is the A on the left side of the equal sign, and A 2 is the A on the right side of the equal sign. These number suffixes are assigned strictly in left-to-right order in the equation. The major types of attributes used here are as follows: ERA-the Extended Relational Algebra interpretation of the nonterminal. It is a list. TEXT-a string of characters. This is a special case of a list. It is treated as a special case because of the possibility of a more efficient implementation. SYM-a set of ordered pairs of the form (table-name, table-alias). This set is used to determine the complete name of a field when it is not qualified by its table name. Some lists have a fixed format (for example, a list of table-name/tablealias pairs.) That is, each element of the list is an atom, or each entry is a list with a fixed number of elements (i.e., a list of n-tuples.) Such a list may be defined as having a name with the fixed elements referred to by their component numbers (positions). The following function applies to such lists: LOOKUP$(name, lookup-attribute, lookup-value, return-attribute)-this function will find a tuple in the named list where the lookup-attribute is equal to the lookup-value and will return the value of the returnattribute as a result. Formally, let T be a set of n-tuples {(a,, . . . , a,>}, LOOKUP$(T, j, U, k) = {Ukl(~i, . .., U,_ 1,U, Uj+ I,...,Uk, . ..,U,) E T) (this assumes, without loss of generality, j
276
M. FRAME AND M. OWRANG
operators (e.g., the selection operator SEL$) and relational algebra type designators (e.g., .EQ$, the comparison operator for equality, and TN$, the handle for a table-name/table-alias pair.) Table 1 provides a grammar for a simple subset of SQL. Despite its simplicity it supports nested queries, correlated subqueries, and aggregate function (GROUP BY/HAVING) queries. Only one value is returned as a result of a query. Not all of the semantic rules are explicitly specified in the attribute grammar. If there is no semantic rule defining the ERA attribute, then the ERA attribute is implicitly the concatenation of the ERA attributes of the nonterminals on the right side of the ‘ = ’ sign in the syntactic production. We will examine some productions in the attribute grammar for SQL and explain how they are interpreted. EXAMPLE 1: Production 1 (from Table 1). 1. statement = complex-select complex-select.INSYM := { 1 This syntactic rule states that an SQL statement may be a complex-select. The INSYM attribute of the complex-select nonterminal is inherited and is set to the empty set. Implicitly, the interpretation (ERA attribute) of the statement is the same as the interpretation of the complex-select. EXAMPLE 2: Production 4 (from Table 1). 8. simple-select = select-clause ‘from’ from-list simple-select.SYM := simple-select.INSYM U from-list.SYM select-clause.SYM := simple-select.SYM simple-select.SSC := select-clause.ERA simple-select.SFL := from-list.ERA simple-select.ERA := ‘PROJ$’ select-clause.ERA from-list.ERA This syntactic rule states that a simple-select statement in SQL is in the form of a select-clause followed by the word “from” followed by a list of table names (the from-list). The semantic rules state that the interpretation of a simple-select statement is a list composed of an atom, the project operator (PROJ$), followed by the interpretation of the select-clause, followed by the interpretation of the from-list. Notice that this semantic rule takes an SQL statement and converts it into a relational algebra expression in prefix form. In addition, the attributes SSC and SFL are defined. They exist so that conditional-select can build a correctly formed extended relational algebra expression. Finally, the symbols (SYM) known to simple-select are those synthesized from from-list and those known to select-clause are inherited from simple-select.
277
SQL TRANSLATION
TABLE 1 SQL Attribute
Grammar
1. statement = complex-select complex-select.insym := f } 2. complex-select = conditional-select complex-selectsym := conditional.selec&.sym conditional-select.insym := complex-select.insym complex-select.sfnlist := conditional-select.sfnlist 3. complex-select = partitioning-select complex-select.sym := partitioning-selectsym partitioning-select.insym := complex-selectinsym complex-select.sfnlist := partitioning-select.sfnlist complex-select.era := ‘agg$’ partitioning-select.fldspec partitioning-selectsfnlist partitioning-select.era 4. complex.select = partitioning-select ‘having’ bexpr complex-seiect.sym := partitioning-select.sym partitioning-select.insym := complex-selectinsym bexpt.sym := complex-setect.sym complex-select.era := ‘sel$’ bexpr.era ‘agg$’ partitioning-select.fldspec partitioning-seiect.sfnlist bexprsfnlist partitioning-seleckera complex-select.sfnlist := partitioning-select.sfnlist bexprsfnlist 5. partitioning-select = conditional-select ‘group’ ‘by’ field-spec partitioning-select.sym := conditional-select.sym conditional-select.insym := partitioning-selectinsym field-spec.sym := conditional-select.sym partitioning-select.fldspec := field-spec.era partitioning-seiect.era := conditional-select.era partitioning-select.sfniist := conditional~sele~t.sfnlist 6. condittonat-select = simple-select conditional-sele~t.sym := simple-select.sym simple-seiect.insym := conditional-select.insym conditional-select.ssc := simple-selectsse conditional-seiect.sfl := simple-selectsfl conditional-select.sfnIist .= “ 7. conditional-select = simple-select ‘where’ bexpr conditional-select.sym := simple-selectsym simple-selectinsym := conditional-select.insym bexpr.sym := conditional-select.sym conditional-select.era := ‘pjS’ simple-select.ssc ‘se@’ bexpr.era conditional-select.ssc := simple-select.ssc conditional-select.sfl := simple-selectsfl conditional-~lect.sfolist := bexprsfnlist
simple_select.sfl
278
M. FRAME AND M. OWRANG TAESLE I(Continue& SQL Attribute
Grammar
8. simple-seleet = select-clause ‘from’ from-list simple-select.sym := simple-select.insym U from-list.sym select-clause.sym := simple-select.sym simple-select.ssc := select-clause.era simple-select.sfl := from-list.era simple-select.era := ‘pj.T select-clause.era from-list.era 9. select-clause = ‘select’ r&am-list return-listsym := select-clause.insym 10. return-list = expr expr.sym := return-list.sym 11. expr = field-spec field-spec.sym := expr.sym expr.sfnlist := “ 12. expr = set-function set - function.sym := expr.sym expr.sfnlist := set-function.era 13. expr = constant 14. from-list = table-name from-list.era := ‘tn$’ table-name.era table-name.era from-listsym := {(table-name.era,table-name.era)} 15. bexpr = pred pred.sym := bexprsym bexpr.sfniist := pred.sfnlist 16. pred = expr comp-op expr pred.era := camp-op.era exprl.era exprZ.era exprl.sym := pred.sym
expr2.sym := pred.sym predsfnlist := exprlsfnlist exprZ.sfnlist 17. nested-select = complex-select complex-select.insym := nested-selectsym nested.select.sfnlist := complex-select.sfnlist 18. field-spec = table-name ‘.’ field-name field-spec.era := ‘ar$’ table-name.era field-name.era 19. field-spee = tield-name field-spec.era := ‘a& looku~field-spec.sym,l, lookup(relscheme$,2,field-name.era.l),2) field.name.era 20. eomp-op = ‘> ’ camp-op.era := ‘gt$ 21. camp-op = ‘ < > ’ camp-op.era := ‘ne$ 22. set-function = ‘sum’ ‘f’ field-spec ‘1’ set-function.era := ‘sfn$’ ‘sum$’ field-spec.era 23. constan( = :number: constant.era := ‘nm$’ number.text 24. table-name = :identifier: table-name.era := identifier.text 25. field-name = :ideatifier: field-name.era := identifier.text
SQL T~S~TIQN
279
EXAMPLE 3: Production 14 (from Table 1). 14. from-list = table-name from-list.ERA := ‘TN$’ table-name.ERA table-name.ERA f~m-1ist.S~ := This syntactic rule states that a from-list may simply be a table name. The corresponding semantic rules provide that the interpretation of the from-list is a table name operator (TN$) that takes two operands, the inte~retation of the table-name followed by the interpretation of the table-name. The reason for repeating is found in the explanation of the next rule. EXAMPLE 4: Production 19 (from Table 1). 19. field-spec = field-name meld-s~.E~ := ‘AI@’ lookup(fieid-spec.SYM,l, lookup(RELSCHEME$,Z,field-name,ERA,ti,2) field-name.ERA This syntactic production states that a field specification is simply a field name. The semantic rule is as follows: the inte~retation of field-spec is an attribute reference operator (AR$) that takes two operands, the name of the relation of which the field is a component and the field name. This production allows tracking of column names that are not fully qualified (i.e., of the form table-alias.column name). RELSCHEME$ is a list of binary relations maintained in the system’s data dictionary that associates a relation name with the names of each of its columns. For example, RELSCHEME$ for the scheme EMP (EMP#, Name, Dept#, Sal) DEPT (Dept#, Name, Lot, Mgr) would be {(EMP,EMP#),(EMP,NAME),(EMP,DEPT#),(EMP,SALJ, (DEPT,DEPT#),(DEPT,NAME),(DEPT,LOC),(DEpT,MGRl), 3.
SQL TRANSLATION
Now that relational algebra and a method for converting SQL to relational algebra have been developed, it is possible to show an example of the conversion of an SQL statement to relational algebra. Consider the
280
M. FRAME AND M. OWRANG
following statement previous section.
based on the EMP/DEPT
scheme
given in the
SELECT NAME FROM DEPT WHERE DEPT# IN ( SELECT DEPT# FROM EMP WHERE DEPT.M < > EMP# GROUP BY DEPT# HAVING SUM64L) > 2000 )
This statement includes a nested subquery with a correlated variable from the outer query (DEPT.MGR), and the inner query includes aggregation. Productions 1, 2, and 7 are the higher level syntactic productions used to recognize this statement. We will concentrate on the translation portion of the semantics, ignoring for the time being the tracking of symbols. Productions 1 and 2 have implicit ERA semantic rules so the ERA attribute value from production 7 is really the ERA attribute value for the statement nonterminal. The ERA attribute for production 7 is PROJ$ simple-select.SSC SEL$ bexpr.ERA simple-select.SFL That is, the projection operator followed by the simple-select.SSC attribute value (this is just the interpretation of the return-list), followed by the select operator (SEL$), followed by the interpretation of the Boolean expression of the WHERE clause, followed by the interpretation of the FROM clause list of tables. Simple-select.SSC is determined by following the semantic rules for productions 8, 9, 10, 11, 19, and 25. This yields AR$ DEPT NAME the attribute reference handle (AR$) followed by the table alias and the column name. Simple-select.SFL is determined by following the semantic rules for productions 8, 14, and 24. This yields TN$ DEPT DEPT
281
SQL TRANSLATION
the table name handle (TN$) followed by the table name followed by the table alias (in SQL a FROM list may include a table name followed by a name that can be used in place of the table name, here called the table alias.) The remaining part of the semantic rule for conditional-select.ERA is the development of the value for the bexpr.ERA attribute. This is the most interesting and complex part of the translation. Bexpr.ERA is derived by following the semantic rules for production 15. This states that the interpretation of pred is IN$ field-spec.ERA
nested-select .ERA
where IN$ is the ‘in’ operator for determining set membership, spec.ERA is derived through productions 19 and 25 and yields
field-
AR$ DEPT DEPT# and nested-select.ERA is determined by parsing the nested SQL select statement. Productions 17, 4, 5, and 7 provide the high-level breakdown of the nested select statement in the example. Analogously to the outer select, the interpretation of conditional-select in production 7 is PROJ$ AR$ EMP DEPT# SEL$ NE$ AR$ DEPT MGR AR$ EMP EMP#
The semantic select.ERA is
rules for production
5 indicate that the partitioning-
AGG$ field-spec.ERA conditional-select.SFNLIST conditional-select.ERA This is a straightforward transformation except for the use of the operator AGG$ and the attribute ‘conditional-select.SFNLIST’. AGG$ is an extended relational algebra operator that has the form AGG$ group-by-list aggregate-values table
282
M. FRAME AND M. OWRANG
Its function is to group the rows of ‘table’ that share values in the ‘group-by-list’ and to compute ‘aggregate-values’ for each group. It implements an operation that is comparable to the GROUP BY clause in SQL. The SFNLIST attribute is used in a number of productions. Its purpose is to keep track of all the aggregate functions that have been referenced in a query. This is necessary because the aggregate function references may appear in the HAVING clause as well as in the return-list. All these productions taken together yield for the nested-select the interpretation:
PROJ$ AR$ EMP DEPT# SEL$ GT$ SFN$ SUM$ AR$ EMP SAL NM$5000 AGG$ AR$ EMP DEPT# SFN$ SUM$ AR$ EMP SAL SEL$ NE$ AR$ DEPT MGR AR$ EMP EMP# TN$ EMP EMP
Finally, combining the interpretation for the outer select with that of the nested selected yields the complete interpretation, the value of attribute statement.E~ PROJ$ AR $ DEPT NAME SEL$ INS$ AR$ DEPT DEPT# PROJ$ AR$ EMP DEPT# SELS GT$ SFN$ SUM$ AR$ EMP SAL NM$5000 AGG$ AR$ EMP DEPT# SFN$ SUM$ AR$ EMP SAL SEL$ NE$ AR$ DEPT MGR AR$ EMP EMP# TN$ EMP EMP TN$ DEPT DEPT
Figures 2 and 3 show the parse tree for the example SQL statement and the grammar of Table 1. The interested reader can determine how attribute values are passed around by associating with each node of the parse tree the attributes of the production that corresponds with the node. In addition to the attributes used to produce the interpretation, there are attributes related to symbols used in the query. There are two types of these symbol attributes: inherited and synthesized. The inherited symbol attributes are passed in from higher level productions. This allows us to keep track of symbol names known in an outer select in case one of these variables needs to be used to correlate rows from the outer select with rows from the inner select. In fact, this occurs in the example with DEPT.MGR.
283
SQL TRANSLATION statement
I
complex-statement conditional-statement I
I ‘where’
simple-select
I
I
‘from’
select-clause
I ‘select’
bexpr
I
prkd
from-list
I
I
return-list
table-name
I
I
I
:identifier:
expr
I
I
I
‘(’ nested-select
I ‘)’
I
field-name
(1
I field-spec
I
field-spec’in,
(see Figure 5)
I :identifler:
[DEPT#]
I field-n:lme
I :identifier:
Fig. 2.
Nested-select
example-parse
tree.
The s ‘rthesized symbol attributes are used to keep track of tables referenced in a query so that we can determine that column references are valid and can generate complete AR$ expressions when a column is specified simply as ‘name’ rather than ‘table-alias.name’. Note that many types of information can be extracted from the parsing of a statement using an attribute grammar. In addition to the interpretat’on and symbols, we could check for semantic errors and generate error :odes and possibly even do a certain amount of error correction. An mtput that could be useful for our SQL application is to output the type of graph structure used in [81 to optimize relational algebra expressions. At this time, however, we feel that it should be possible to pass the resulting relational algebra expression through an optimizer analogous to the type used in compilers to generate an optimal result. Such an optimizer has been developed and is currently being evaluated [12]. 4.
EVALUATION
This paper demonstrates that an attribute grammar-based system can be used to translate a high-level language, a subset of SQL, to a relational algebra expression that is encoded so that it can be interpreted as the
M. FRAME AND M. OWRANG
284
nested-select complex-select
It
I
’
I
‘select’
I
t
:identifier: IDEPT#I
‘from’ from-list
2
return-list
I
b xpr field-name
‘where’
select-clause
field-spec
I
I
simple-select
b&or
7 ‘group’ ‘t!y’
I
I
I
‘havkg’
I
t conditional-select
I
I
partitioning-select
I table-name
pred
I
I
:identifier:
expr
I
expr Bpr
I
i’
1
field-spec
set-function
I field-name
I
I
I
I
‘sum”(’ field-spec
‘>’
I
‘)’
I
:ideniifier: (DEPT#]
I constant
I :number: 15ooo1
field-name
I
:identifier:
pred
I
I expr
camp-op
I I
‘0’
I
field-spec
1 table-name
I
Fig. 3.
“
I ex’pr
I field-spec
1 field-name
I
I field-name
Nested-select example-parse
instruction set of a VDM. This is expected advantages:
i
tree-s&query.
to lead to the following
A. Simplifying the engineering of RDBMS’s A DBMS is being constructed in three major parts: a translator, optimizer, and VDM. At present, each of these parts is about 3,000 lines of C code.
SQL TRANSLATION
285
This structure permits the use of a translator whose sole function is to convert nonprocedural statements to executable form with no regard for the efficiency of the generated code. Code efficiency is strictly the responsibility of the rule-based optimizer [12]. B. Describing the semantics of SQL When a high-level language is procedural it is relatively easy for the user to know the meaning of a specification written in the language. However, when the language is nonprocedural and supports a complex capability, the user may have difficulty in specifying certain requests or in understanding certain specifications. Such is the case with SQL. It is a nonprocedural data language used for expressing operations to be performed on a relational database. SQL operations are usually explained by resorting to a conceptual procedural description or by an informal mapping of an SQL statement into relational algebra [5, 201. Progress has been made on the formal definition of the syntax of SQL statements 111.However, a formal definition of the semantics of SQL is yet to be developed. SQL queries are often described by an algorithm that is intended as a purely conceptual explanation of how the SELECT statement is evaluated. Such an explanation serves as a definition of the semantics of the SELECT statement and can be taken as a definition of what the result should be [6]. An attribute grammar may be considered as a formal definition of the syntax and the semantics of the language that it describes. One of our interests is to develop a formal definition for the semantics of data languages using SQL as an example. This can be achieved if we have a formal mapping from SQL to a well-defined structure such as relational algebra. C. Defining global data language in heterogeneous database systems Since relational algebra is a language in its own right, it will be possible to implement an attribute grammar for relational algebra that can translate relational algebra to any data language that has sufficient expressive power. This would permit a system to be developed that can interconnect heterogeneous database systems. A statement could be entered in a “global data language,” translated to relational algebra, and decomposed according to which database systems are involved. Then each component could be translated on the basis of the relational algebra attribute grammar that is defined in terms of the corresponding DBMS’s data language. This approach would allow support for a new data language simply by adding a new attribute grammar for relational algebra in terms of that language. A number of systems (Multibase [161, Serius-Delta [9], and Mermaid [19]) have accomplished this interconnection; however, each has resorted to “hand-coded” translators to support a new query language.
286
M. FRAME AND M. OWRANG
D. Permitting experiments with different languages in a distributed system The use of attribute grammars should provide other benefits to the engineering of distributed database systems. Using the approach described here, it will be possible to use the same database machine with a number of data languages. As long as an attribute grammar can be specified for the language, and the grammar defines the language in terms of the instruction set, the new data language can replace the old data language. In fact, different users could conceivably use the language with which they are most comfortable. For our purposes, the approach will allow us to experiment with different data languages in an environment that is otherwise unchanged. 5.
CONCLUSION
This paper has demonstrated that attribute grammars can be used to define the translation of SQL to an executable form. A simple SQL grammar was defined and a fairly complex nested SQL query was mapped to its corresponding relational algebra form. The approach described in this paper must be extended to be of practical use in specifying SQL semantics and in implementing a fully functional RDBMS. The same issues must be resolved to solve both of these problems: relational algebra must be replaced with an extended form that supports sorted data, aggregation, duplicate rows, data manipulation operations, and data definition and data control statements. However, since a great deal of research is based upon relational algebra and since it is a well-understood way of describing database operations, the replacement should be built upon relational algebra. The authors wish to thank the referees for their comments. better as a result of their many helpful suggestions.
This article is significantly
REFERENCES 1. ANSI X3H2, American National Standard Database Language SQL, Feb. 198.5. 2. G. V. Bultzingsloewen, Translating and optimizing SQL queries having aggregates, in Proceedings of the 13th VLDB Conference, 1987, pp. 235-243. 3. S. Ceri and G. Gottlob, Translating SQL into relational algebra: optimization, semantics, and equivalence of SQL queries, IEEE Trans. Software Eng. SE-11(4), 324-345. (April 1985). 4. S. Ceri and Pelagatti, G. Distributed Databases Principles and Systems, McGraw-Hill, Inc., 1984.
SQL TRANSLATION
287
5. C. J. Date, An Introduction to Da&base Systems, 4th ed., Addison-Wesley, 1985. 6. C. J. Date, Relational Database: Selected Writings, Addison-Wesley, 1986. 7. U. Dayal, N. Goodman, and R. Katz, An extended relational algebra with control over duplicates, in Proceedings of the ACM PODS, 1982. 8. U. Dayal, Of nests and trees: a unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers, in Proceedings of the 13th VLDB Conference, 1987, pp. 197-208. 9. C. Esculier, The SERIUS-DELTA architecture: a framework for co-operating database systems, Comput. Networks 8:43-48 (1984). 10. M. Frame, Automatic Translation of Query Languages in Heterogeneous Database Systems, Ph.D. thesis, Department of Electrical Engineering and Computer Science, George Washington University, Washington, D.C., March 1987. 11. R. Ganski and H. K. T. Wong, Optimization of Nested SQL Queries Revisited, in Proceedings of ACM 1987 SIGMOD Conference, pp. 23-33. 12. Bhalachandra Ghatate, A rule-based query optimizer, Master’s thesis, Department of Computer Science and Information Systems, American University, Washington, D.C., August 1990. 13. W. Kim, On optimizing an SQL-like nested query, ACM Trans. Database Syst. 7(3):443-469. 14. C. M. R. Kintala, Attributed grammars for query language translation (extended abstract), in Proceedings of the ACM 1983 SIGMOD Conference, pp. 137-148. 15. D. E. Knuth, Semantics of context-free languages, Math. Syst. Theory 2:(1968X
Corrections in 5( 1). 16. T. Landers and R. L. Rosenberg, An overview of MULTIBASE, Distributed Databases (H. J. Schneider, Ed.), North-Holland, 1982. 17. E. M. Mueckstein, A-TRANS: query translation into English, in Proceedings of the 8th International Joint Conference on Artificial Intelligence, August 1983, pp. 660-662. 18. T. W. Reps, Generating Language-Based Endronments, The MIT Press, 1984. 19. M. Templeton et al., An overview of the Mermaid System-a front-end to heterogeneous databases, Proceedings of the IEEE EASCON, Washington, D.C., September 1983. 20. J. D. Ullman, Principles of Database Systems. 2nd ed., Computer Science Press, Rockville, Maryland, 1982. 21. D. M. Yellin and E. M. Mueckstein, The automatic inversion of attribute grammars, IEEE Trans. Software Eng. SE-12(5):5YO-599 (May 1986). Receked
16 October 1989; rellised 11 January 1991