A UNIFORM INDEXING SCHEME FOR OBJECT-ORIENTED DATABASES†

InformationSystemsVol.22,No. 4, pp.199-221,1997 Pergamon @ PII: S0306-4379(97)000124 Science .Ltd (Q 1997Elsevier Allrightsreserved. PrintedinGrea...

Download PDF

3MB Sizes 1 Downloads 168 Views

Report

Full Text

InformationSystemsVol.22,No. 4, pp.199-221,1997

Pergamon

@

PII: S0306-4379(97)000124

Science .Ltd (Q 1997Elsevier Allrightsreserved. PrintedinGreatBritain 0306-4379/97 .$17.00 + 0.00

A UNIFORM INDEXING SCHEME FOR OBJECT-ORIENTED Databases EHUD GUDES

andComputerScienceBen-Gurion University, Beer-Sheva, Israel Department ofMathematics (Received 23 July 1995; in jinal rtwised form 3

April1997)

Abstract — Object-oriented databases(OODB)havereceivedconsiderable attentioninrecentyears. Theirperformance is a criticalfactorhinderingtheircurrentuse.Severalindexingschemeshavebeen proposedin the literatureforenhancingOODBperformance andtheyarebrieflyreviewedhere. In this papera newand uniformindexingschemeis proposed.Thisschemeis baaedon a singleBtreeandcombinesboththe hierarchical andneatedindexingschemes.Theuniformityofthisscheme enablescompactand optimizedcodefordealingwitha largerangeofquerieson the onehand,and flexibility in addingandremoving indexedpathsontheotherhand.Theperformance oftheechemeis discussedandan exteneive experimental analysisforthe class-hierarchv caseis presented.Theresults showthe advantagesofthe schemeforsmallrange,clusteredsetsqueries. @1997 ElsevierScienceLtd Keywords:Object-Oriented Databases,Indexing,Performance, B-’IYees 1. INTRODUCTION Object-oriented databases (OODB) have received considerableattention in recent years. One problem cited by practitioners in using OODBS in industry, is their performance. Indezing is a common technique to enhance performance. Two types of indexesspecialto OODB were described by Bertino and Kim [1, 7, 9]: the CZass-Hierarchyindex and the Nested or Path index. The classhierarchy index provides indexing on some attribute-value to a class and its sub-classesalong the “is-a” (sub/super type, generalization) hierarchy. The original class-hierarchy indexing scheme does not support well multi-class queries since it clusters the index along the key dimension (key grouping). A modification of the clam-hierarchyindex czdledH-trees was described by Low et al. [8]. This scheme did overcome,the drawback of the originalscheme, in that it cluster multiple keys per set, but the structure resulted is a complexchain of B-trees. Nested and path indexes provide indexing on some attribute-value of a class, to classes higher above, along the class-composition hierarchy (object-referencinghierarchy). Nested index provides access to the top class only, while path index provides access to classes along the path hierarchy. Path indexes can also be used to answer queries with predicates on in-path nodes. Such queries, however, may require the search of many index pages. Path indexes were described and analyzed in [1]. A variation of the pathindexing method which can be used to support Joins in relational databases is called Access support relations and is discussed in [5]. In this paper a uniform scheme combiningboth forms of indexing is suggested. This scheme, called the U-index, provides support for class-hierarchywith the advantages of directly accessing sub-classes (like H-trees), while retaining the good performance of singlekey retrieval of CH-trees. The same schemeprovidespath indexingwith better retrieval performancethan the originalscheme of Kim. Also, this schemeprovidesin one uniform index, combinedclass-path-hierarchyindex, and with that it is able to answer queries which are not answerable with the previous index schemes. Recently,two index structures have appeared which have similar goalsto the U-index structure. The first structure, called CG-trees [6], addresses only the class-hierarchy case, but it provides generally good performance for both key-groupingand set-grouping (see definitions below), thus providing a compromise between CH-index and H-trees. A full comparison of the performance of the U-index scheme vs. CG-trees is presented in this paper. The second structure, called Nestedinhem”tedindex (NIX) [3] is oriented to support both nested attribute index and class-hierarchy tRecommended byPatrickO’Neil 199

200

EHUDGUDES

index, and can also answer complex queries on both. A qualitative comparison of the U-index scheme with this structure is also provided in this paper. The rest of the paper is organized as follows.Section 2 briefly reviewsthe well known indexing schemesfor OODB systems, with special attention given to the CG-trees and the Nested-inherited index structures. Section 3 describes our U-index scheme. Section 4 discussessome of the issues raised by this scheme, analyzes its performance qualitatively,and compares it with the NIX structure. Section 5 presents an extensive experimental comparison and evaluation of the U-index vs. the CG-tree index for the Class-hierarchycase. Section 6 is the summary. 2. BACKGROUND AND RELATED WORK

Figure 1 shows a typical OODB schema based on a similar figure in [7]. To explain this figure and the main concepts, we use the terminologyof [7]and a modifiedversionof the schema presented there. The main concepts are: ●

●

C./asses and attributes - self-explanatory. Class hierarchy - a class may be related to other classes by an “is-a” or “generalization” relationship. This creates the super-class/sub-class relationships. We will denote them as SUP and SUB respectively. In a figure of a schema an arrow(solid) will always point from a super-class to a sub-class. Vehicle is an example of a class which is the root of a classhierarchy, and Vehicleis related to Automobileby a SUP relationship.

. Class composition hiemrchy - a class may reference another class. This reference is done by

the fact that an attribute has aa value(s) the object-id(s) of objects of another class. If a reference attribute contains only a single object-id value then the relationship it represents is a m:l relationship in the Entity/Relationship model [11]. If a single attribute contains set of object-ids as values, it is called a multi-valueattribute, and the relationship it represents may be m:l or n:m depending whether two differentattributes reference the same object. In the next section we denote a m:l referencerelationship as the REF relationship (see Section 4 for a discussionof multi-value attributes and m:n relationships). In a figure of a schema an arrow representing a REF relationship will point from the “many” to the “one” direction. The database schemashownin Figure 1 showsseveralREF relationships. Vehicleis connected to Company by the REF relationship “manufactured-by”,Companyis connected to Employee by REF “president-of’, Division is connected to Company by REF “belong”, Division is connected to City by REF “located-in”. The sequenceof classes: Vehicle,Company,Employeerepresent a Path in the database. This path is usually denoted in queries by syntax such as: Vehicle.Companv.Employee. Age and its meaning is the “age of the employeewho is the president of the company which manufactures the vehicle” ●

Class-hierarchy Index (CH-tree) - A CH-tree indexeson an attribute/value of a root class all objects of the class and its sub-classeswhich have this attribute value. For example, a classhierarchy index may be built on the attribute color of vehicle. The CH-tree will associate with every attribute-value pair, e.g. (color,Red), a list of object-ids of each of the objects within the class-hierarchyrooted at Vehiclewith that value. For example, the followinglist: (Color,Red) (Vehicle, {Oidl, Oids, ...}, Automobile,

{Oid4,0id5,...}...)

The index is built aa a B-tree and the leaf nodes contain a set directory which is used to keep track of set membership. The CH-tree is a key groupingscheme since it attempts to store all entries with the same key in one leaf page. Range queries then scan pages which may not be relevant to the query, therefore more pages than necessary are scanned for Range queries. See [7, 9] for a detailed structure of this index. tN~t,ethat in Figure 1, the m:n relationshipCompany-cityis representedas two REF relationships

A UniformIndexingSchemeforObject-Oriented Databaaee

201

●

H-tree - the H-tree maintains a separate B+-tree for every set (CISSS).These B+-trees are nested according to the set hierarchy by maintaining link pointers between the trees. The H-tree groups all members of a singleset at the leaf page level according to their key values. This implies that retrieval costs are directly proportional to the number of sets queried. In general, the key-groupingis better for unique-match retrieval, whilethe set grouping is better for partial-match retrieval and for range queries.

●

Nested Attribute and Path Index - Given a path such as: A. B. C.Attr, where A,B, C are classes and Attr is an attribute, one can associate the objects of class A which are related by the path to the values of Attr. A Nested Attribute index only associate the objects of A, while a Path index associate the corresponding objects of the entire path (i.e of classes B and C as well). An example is the path: Vehicle/Company/Employee. Assume that we want to index on attribute Age of Employee. A n,estedindex on Vehiclecan provide access to all vehiclesmanufactured by companies whose President’s age is above 50 (this example appeared in [7]). If one is also interested in the companies and presidents along this path, one constructs a path index. Again, the index is built as a B-tree with leaf-nodescontaining entries of the form: (Age, 50), (Listmf.Vehicle_oids, Pathl, Pathz, ...) where Pathi is a pointer to a node containing entries of the form: (C’ompany.idl, EmplqjeeAd~, Company-id?, Employee-idZ,...)

Note that since we use pointers in leaf nodes, then when we use predicates on internal nodes of the path (e.g. Company), we will need to search other nodes and potentially other index pages. See [1]for more details on this structure.

Vehicle

? ,

&:grH’& ~

:4#.’’’’, Jarxmese

\--:Tmck

E!&!drn.d

Age b

Fig.1: A Object-Oriented DatabaeeSchema Recently, two important index structures have appeared. The CG-tree index by Kilger and Moerkotte [6]provides a compromisebetween the key-groupingof CH-trees and the set-grouping of H-trees. It achievesthat by having a set directory like CH-trees, but at the leaf levelit provides two features which helps in the set-grouping,these are: link pointers between Ieafpages of the same set, and sharing of multiple keys entries in one leaf page. The result is quite a complex structure with quite complex update and retrieval algorithms. However, the performance of CG-trees as

202

EHUDGUDES

demonstrated in [6]is good in most cases. The only case where CH-trees performs better is exact match queries, however,the CG -tree is quite closeto CH-trees whilethe H-tree is much worse. For range queries and small number of sets, the CG-tree is much better than CH-trees and very close to H-trees, and it is not much worse than H-trees for large number of sets. Therefore, CG-tree is a reasonably good structure for most queries except for exact match queries. A full comparison of our scheme with CG-trees is presented in Section 5. The Nested Inherited index (NIX) was suggested by Bertino and Foscoli[3]. This index structure associates with an attribute-value pair all the object instances belonging to a class or a sub-class along a path leading to that attribute. For example, for the schema in Figure 1 and the attribute-value (Age, 50), it will index all vehicles(and automobiles, trucks, etc), companies( and Auto or Truck companies, etc. ) whose president’s age is 50. The structure of the index is similar to the CH-index, in that leaf entries have a directory structure with pointers to all classes along the path, and for each class the entry contain the relevant object identifiers (OIDS), i.e this is a Key-grouping scheme. Also, for the case of multi-value attributes there is an entry for each such value in the corresponding class entry. In addition, there are auxiliary (13+) structures for each class, which point from an object to its parents along the path. There are pointers from the primary structure to the auxiliary structure and vice-versa. This auxiliary structure is used to speed up the update process. The main advantage of the NIX structure is that it can answer queries which contain both path and class-hierarchycomponents. As we’llsee below, the U-Index scheme has the same goals. A qualitative comparison of the U-Index with the NIX structure will be presented in Section 4. (Two other variations of the NIX structure were discussed in [3] but since they usually perform worse than NIX, they will not be discussedhere.) The fact that, except for the NIX structure, two separate indexingschemesneed to be supported for OODBS,one for Class-hierarchyand one for Path indexes,makes their implementation difficult and the database physical design decisions cumbersome. Furthermore, the large differences in the performance of the various class-hierarchy schemes for different types of queries (e.g exact vs. range) may cause difficultiesto the database administrator. The Uniform indexing scheme presented below attempts to overcomethese difficultieswith a reasonable performance.

3. THE UNIFORM INDEX In this section the uniform indexing scheme called U-Index will be presented. A basic element of this schemeis the encoding of Class-namesin sometype of a lexicographicorder. As an example, take the schema in Figure 1 and assume that each of the classesin the schema is encoded uniquely by the followingnaming scheme using the relation COD. This relation maps class names to codes in such a way that the lexicographicalorder of the codes matches the natural depth-first structure of the schema-graph. COD Vehicle COD Division COD City COD Company COD Employee COD Automobile COD Truck CompactAutomobile COD COD AutoCompany COD TruckCompany JapaneseAutoCompanyCOD

C5 C4 C3 C2 cl C5A C5B C5AA C2A C2B C2AA

A UniformIndexingSchemeforObject-Oriented Databases

203

The meaning of the above naming schemewill be explained later. Thus, the schema in Figure 1 can be represented as follows: C2 REF Cl C4 REF C2 C4 REF C3 C5 REF C2 C5 SUP C5A C5 SUP C5B C5A SUP C5AA C2 SUP C2A C2 SUP C2B C2A SUP C2AA

The graph representing the schema using the encoded names is shown in Figure 2. This graph has two important properties: ●

It is acyclic

●

Its naming correspond to some topologicalsort of it.

,. T -.. .---.,.. d’ E!q/p El El%7-J ------Division C4

.“

Automobile C5A /’

y%iq

-.”.“

,.”

... -..

...

C3

C2

-..

C5B

‘..

City

Compsny

Truck

..*

-.. -.. .

\

----

-’-a

,,,P,C!!J}:B ,F=-’l Fig.2: EncodedDatabaeeSchema

Most practical schemas can be represented by such an acyclic graph, and therefore encoded in such a scheme. For the class-hierarchycase, even the existence of multiple inheritance usually leavesthe graph acyclic. For the class-compositionhierarchy,there may be cycles,e.g. two claases connected by two opposite REF relationships. This cycle case can be removed as will be discussed in Section 4. The lexicographicsorting property is central to our scheme. For example, scanning all cl=ses beginning with C2 upto(not including) C3 results exactly with the class-hierarchy of C2 in preorder sequence. As will be explained below, this enables clustering and isolation of index entries for any complete sub-tree

204

EHUDGUDES

3.1. Encoding Paths

As was explained in Section 2, there are two types of indexes. A class-hierarchyindex is built on a sub-tree of the class-hierarchy.A path index is built on a singlepath of the class composition hierarchy. Now, the only meaningfulpath indexesare those along the REF relationships. That is, an index is built on an attribute of a class which is at the end of some REF path, i.e. at the “one” end of an m:l relationship. It does not make senseto build the index the other way around sincethe information is available by a select operation and the referencingattribute! For example, it makes sense to build an index on the attribute Name of Company for vehicleswhich are manufactured by that company, but not to build an index from attribute Manufactured-by in Vehicleto the company manufacturing it, since that exactly what the referencingattribute provides! Looking at Figure 2 we therefore see the followingproperties: ●

Because path indexes are along REF edges, the order of class names in such a path is sorted lexicographically.

●

If one traverses a class-hierarchysub-tree by pre-order one also gets the class names sorted lexicographically.

●

One can construct a combinedpath/class-hierarchy index and still gets all class names sorted lexicographically!

For example, the followingindexes are represented by their respective sets of class names: ●

The index: Employee->Company->Vehicleis represented by the set: C1,C2,C5.

●

The index on the class hierarchy of Company is represented by the set C2,C2A,C2AA,C2B.

●

The index on an attribute of Employeeon the entire sub-tree rooted in Vehicleis represented by the set: C1,C2,C2A,C2AA,C2B,C5, C5A, C5AA, C5B.

Note that the last index can be used to answer queries such as: “find the domestic automobiles manufactured by a JapaneseAut.ocompanywhose president’sage is above 50.” It is not possible to answer such a query with either the class-hierarchyor path indexes alone! From now on we assume that all indexes are built along paths which their encoding is ordered by the lexicographicsort order. (There is a problem with some m;n relationships such aa CompanyCity. This is discussed in Section 4.) Now, let us describe the index structure. 3.2. The Indexing Structure

To better understand the index structure and follow the example, let us assume that the database contains the followingobject instances (similar to the ones in [1]). Example ●

1 - An Example Database

Vehicle(Oid,

Name, Color, Manufactured-by)

vehicle(vl,

Legacy, White, cl)

- automobile(v2, Tipo, White, c2) – automobile(v3’, Panda, Red, c2) – compact(v~, R5, Red, c3)

compact(v5, Justg, Blue, cl) wmpact(v6,

Uno,

white,c2)

A UniformIndexingSchemefor Object-OrientedDatabases

205

. Company (Oid, Name, President)

—japanese-company(cl, Subaru, e3) – auto- company(c2, Fiat, el) – auto- company(c3, Renault, e2)

— ●

Employee(Oid,

Age)

– employ ee(el, 50) – employ ee(e2, 60) – employ ee(e3, 45)

3.2.1.

❑

Class-Hierarchy Case

The index is built with a B-tree with variable-length, front-compressedkeys [11]. A non-leaf node contains keys of the form: (attribute-value, Class-name-code)

A leaf node contains (in addition to the key) entries of the form: (Class-name-code, list of object-ids)

where Class-name-codecorresponds to each of the classeswithin the class-hierarchyrooted by the class on which the index is defined. An example for a leaf node for the instances database in Example 1 is: (Color, White) C5$v~, C5A$vz, C5AA$ve,..., (Color, Red)C5A$v~, C5AA$v4 } Now because of the sorting order of the class-codesall these entries are sorted and all entries for an

entire sub-tree are clustered together in the index. Furthermore, because of the key-compression, the existence of the class-code in the key takes very little space. Actually, one can use only single-valueentries of the form: (Class-code,object-id) and rely on the compressionmechanism to eliminate the duplication of: class-codes. 3.2.2.

Path Index Case

For the path-index case, a non-leaf node will contain the key: (attribute-value, (Class-name-code, object - id)n,lastclassname-code) ) where each class-name-codecorrespond to one class in the path. A leaf node will contain the entry: (Key, list of object-ids) As an example for the instances database in Example 1, the followingentry represents the leaf node for the index entry for subclassesrooted at Vehiclewhere the President’s age of the company which manufactures them is 50: (Age,50)Cl$el,

C2$C1,C5A$vZ, v3, C5AA$V6,...,

Again, because of the naming structure the index entries are sorted (Note that ‘$’is lower lexicographically than A...) and if one likes to restrict the list on some ids in the middle of the path, one can do it easily (see examples below...). Furthermore, the key-compressionmakes sure these long entries are not stored repeatedly. Again, the object-id list may be replaced by a single-value entry and the key-compressionwill make sure that no redundant space is taken. 3.3. Queries

Let us now give examples for the three index variations we have and show some sample queries on them. We will use the Schema in Figure 2 and show leaf entries with their keys. The simplest and most intuitive retrieval algorithm consistsbasicallyof findingthe first relevant index entry using the standard B-tree search, and then scanningthe index forwardsfrom that point on. A more efficient algorithm, especially suited for Range and “dispersed” classes (i.e. classes which are not clustered together) queries is presented in Section 3A. We first discuss exact match queries.

206 ●

EHUDGUDES

Class-Hierarchy Index Assume we index on the attribute Color of the class-hierarchy rooted at Vehicle, The leaf entries are structured as: (Color-Red,C5$,idl,id2,id3...) (Color-Red,C5A$,idG,id7,idS ...) (Color-Red,C5AA$,idl@lz,idls ...)

The followingqueries can be answered easily using the above index: 1) find all vehicles (of all types) with red color 2] find all automobiles with red color 3] find all automobiles and their sub-classes with red color.

The above 3 queries are easily answered since all the required information is clustered and just a sequential scan is needed. Notice, that the lexicographicorder of the class-codesis the basis for the clustering of all class and its sub-classesindexes. The followingquery is a bit problematic: 4) find all vehicles which are not compact automobiles with red color. This query requires skipping over one (or more) entries. This can still be done by a forward scan and by eliminating the unnecessary entries with just as much key-decompressionas needed. (i.e. only the class-nanwcode of the key needs to be decompressed.) Alternatively, one can maintain forwards pointers from one class entry to the next one at the same level, thus avoiding scanning entire sub-trees. Another option instead of using forward pointers, is to use the parent node in the index. That is, whenever you need to skip some entries, lookup the uncompressedpart of the key in the parent node, and search for the first entry with key equal or larger to it. This algorithm can also be used for range queries. It is described in detail in Section 3.4. 5) find Automobiles or ticks (with their sub-classes) w“th red color. Here the encoding scheme is most effective. Because, in a sub-tree all classes have the same prefix, it is enough to search the two subtrees rooted at C5A and C5B. The query search algorithm establishes these two starting points, and from here on a parallel sequential scan can be performed. ●

Path Index Assume we have an index on attribute age of Employee with the path: Vehi-

cle/Company/Employee and like to find out the vehiclesmanufactured by a company whose President’s age is 50. The entries are structured as: (Age-50,Cl$ id, C2$ idlz C5$ idlzs,...idlz~ ) (Age-50,Cl$ idl C2$ idls C5$ idlSS,...idlsJ (Age-50,Cl$idlC2$idl~C5 $idl~s ,...idl~~) (Age-50, Cl$idzC2$idz~C5 $idz~s ,...idz~n)

(Age-6(_),C!l$idkC2$id&~C5$idk~s ,...id&~~)

The notation ad~jkdenotes object with idi referenced by object with ~dj referenced by object with idk. As can be seen here, all entries for the same company are clustered, all entries for the same president are clustered, and all entries for the same age are clustered. Obviously the key-compressioneliminates much of the redundant storage. The followingare sample queries on this index: 1) Same query as the definition for the index. find the vehicles manufactured by a company whose President’s age is 50. 2) find the vehicles manufactured by a company whose President’s age is 50, and do it jor a particular president or a particular company.

A UniformIndexingScheme for Object-OrientedDatabasee

207

The entries for the above two queries are clustered and no redundant search is needed. 3) find the vehicles manufactured by companies with more than 50,000 emplopees whose President’s age is 50.

The companies’ object-ids must be first restricted by a select operation. Next, the entries for all companies are searched and “joined”with the select results. 4) jind all companies whose President’s age is 50. Even though this path index was built to answer queries on Vehicles,the above query can be answered since the information exists in the path index. However,the company’sentries are separated from each other by vehicle entries. But as before, a scan forwards using the partly uncompressed keys is the easiest way to search. Again, an algorithm which uses the uncompressed part of the key in the parent node will be more efficient,and that algorithm is described in Section 3.4. ●

Combined Index The structure is the same as the path index, except that in addition to entries for the root class, entries correspondingto sub-classesmay also appear. For example, the items:

(Age-50,Cl$id1C2AA$id12C5$id123,...id12n) (Age-50,Cl$id1C2AA$id12C5A$id12~ ,...idlzn) (Age-50,Cl$idlC2AA$id12C5B$id123,...id12n) etc. represents entries used by the query: find the vehicles manufactured by Japanese autocompanies whose President’s age is 50.

while the entry: (Age-50,Cl$id1C2A$id12C5B$id12s , ...idMn) represents the the first entry for the query: find the truck vehicles manufactured by Japanese autocompanies whose President’s age is 50, c Range Queries The situation for Range queries is somewhat more complex. Lets discuss

the class-hierarchy case, since the path-index case is similar. If the range query is on the root class (e.g. Vehicle)and all its sub-classes,then there is no problem, all relevant entries are clustered (and can be searched in parallel). If a single sub-class at a lower level of the hierarchy is needed, then parallel search need to be performed on all the keysbelongingto the range. This parallel search can be performed sequentially by skipping entries at the higher level of the index. For example, assume we like to retrieve all Trucks with colors Blue to Red. Then, at the highest possible point in the index we establish the searching points: (Color-Blue,C5B) (Color-Green,C5B) (Color-Red,C5B)

and continue the search from there on in parallel, utilizing any page which is already in memory. If one issues range queries against few classes,then many entries will have to be skipped and the algorithm in Section 3.4 utilizes that. However,the advantage of the encoding scheme is clear when one issues range queries on complete sub-trees, since again after establishing the starting point, all the required entries are clustered! ●

Multiple Paths It is often the case that we want to index on multiple paths which share

part of the path between them. For example, assume we want to retrieve the divisions of companies whose president’s age is 50. The index should be built on the path: Division/Company/Employee

- Age

Since the prefixesof this index and the previousone for vehiclesare the same, the first levels of the index will not be repeated and the compressionschemeis most effective. Furthermore, if one likes to retrieve both divisionsand vehiclesof companies whose president’s age is 50, the required index entries are clustered because of the encoding scheme! ( Cl, C2, C4, C5).

208

EHUDGUDES

In a recent paper, Bertino [2] investigated the issue of breaking a path into multiple sub-paths, where each sub-path may be implemented as a nested index or a path-index. We maintain that with the encoding scheme presented above and the range-queriesalgorithm presented below such splitting is not necessary, and therefore both the retrieval code and the designer’stask are much simpler. 3.4. The General Retrieval Algorithm

In this section we present the general retrieval zdgorithmespecially suited for Range queries. We first discuss the translation of queries. Translation of Queries Queries are translated using exactly the same encoding scheme. For example, query 3 for the Class-Hierarchy Index is translated into (Color-Red, C5B$, ?) Query 1 for the Path Index is translated into: (Age=50,Cl$?, C2$?, C5$?) For queries involvingnon-leaf classes with all their descendants one may use a regular expression ala Unix. For example for query 5 above: (Color-Red, [C5A*$,C5B$],?) For range queries one can enumerate the valuesin the range or use a range expressionala Unix: (Color -[Blue-Red],C5B,?) In all the above cases it is clear that the answers are directly available by searching the index while doing the minimal decompressionrequired. The general format of a query is therefore: (attr-value, Class – coclel, Vail, Class – codez, valz,...)

where: attr-value is the key we are searching, (may be a range expression.) Class-codeis the code of the class we are searching, (may be a regular expression),and Vali may be: 1) null 2) actual value - i.e object-id for some class 3) value to be found (?) 4) a predicate (note that in most of the examples given above most valueswere null except for the last one which was ‘?’.) The Algorithm The algorithm for exact-match queriesis simple and was sketched for several queries before. On the other hand, the general algorithm described next can answer any query of the syntax specifiedabove, and in particular the partial-match/range queries. The main characteristics of this algorithm is that because of the encodingschemethe algorithm decodes only as much key aa needed in order to perform the comparison with the B-tree entries. Since predicates and range expressions may create multiple key values to search for, the algorithm constructs dynamically a search-tree which represents the relevant parts of the B-tree it searches. The algorithm scans the nodes of the search-tree in order but considers at each step all possible key values. Since it scans in paraZleZall the key values it is able to scan relevant B-tree nodes only and utilize them for all possiblekey values. In this way it can search for range queries without scanning redundant nodes! Since the algorithm scans in parallel multiple key values, it is called later on the “parallel retrieval algorithm”t. The algorithm constructs first the top-levelnode of the search-tree from the first query component. Then it scans in parallel the query componentsand the B-tree, constructing a new search-tree node m needed. As soon as a query component is completed, the B-tree node corresponding to the search-tree node is scanned and the relevant entries are extracted. Now the algorithm can be written as follows: tThishaanothing to do with parallelism in thetraditionalsense,i.e executing multiple tasks in parallel.

A UniformIndexingSchemefor Object-OrientedDatabases

209

Algorithm 1 ( Parallel Scanning of the Index)

beginParscan /“ queryis ~sumed to be stored as a linkedlist of componentsof the format specifiedabove Also,we maintaina linkedlist or array of partial keysto search*/ set partial keyarray Key[] to null. set k - numberof partial keysto 1. Whilethere are morecomponentsin the querydo begin extract next querycomponent. if this componentis simpleor a ‘*’regularexpressionthen set Key [k]= Key [k]I I component elseif this is a rangeexpressionthen extract next j valuesfor the range set the nextj partial keys(i=l..j) Key[k+i]= Key[k]lI i’th rangevalue elseif this is a predicate,computethe relevantvaluesand set keysas for rangeexpression. end /“ We nowdo a parallelsearchmaintaininga searchtree. A partial key may add newnodesinto the searchtree */ set root of search-treeto B-treeroot. set current = search-treeroot. whilethere are still non-searchedpartial keyscomponentsand search-treeis not empty do begin fetch next componentof the next partial key if we arriveat the end of the keythen searchthe currentsearch-treenode for all relevantentries whilethere are still unstarchedcompletekeysat this currentnode searchthem and retrieverelevantentries if this is the last nodein the searchtree then exit elseset currentto next search-treenodein the pre-orderorder end else/* there are moreunstarchedcomponentsfor partial keys*/ searchcurrent. if partial keyis in currentdo nothing elsesearchthe B-treeforwardat this levelfor the node containingthe paxtialkeyand add foundnodeto the searchtree if there is still a nodeat this levelwith unstarchedkeys then set currentto this node end end Parscan

As an example, in Figure 3 there is a search tree constructed for the query: Find all Automobile or ticks with Blue or Red color. From this search-tree, it is obvious, that although we have a range query, we do not search separately for each class or sub-class. Contrary, in the left sub-path we find entries for both classesat the leaf level,whilein the right sub-path we find entries for both classes in the second to leaf levelt. 3.5. The Update Algorithm Since the U-index structure is a singlestandard key-compressedB+-tree structure, its update procedure is quite standard and will not be elaborated here. We want to emphasize though the followingpoints: tThe ~fwa~ tree~~ may in the B-tree.

not be a tree since it contains all the key values, no matter where they actually reside

210

EHUDGUDES

,fil

Blue Red

El

Id5,1d6,...

Fig. 3: An ExampleSearchlhe

1. Before the update, the encoding of the index entry must be done according to the encoding scheme explained above. 2. For the class-hierarchycase, an object insertion/deletion will cause inaertion/deletion of one index entry (for every separate index). This, assuming that sub-classesdo not overlap! if an object belong to multiple sub-classesit has to be inserted/deleted for each of the sub-classes entries. 3. For the path-index case we have to distinguish between two cases. If the object at the end of the path is inserted/deleted (and assuming only REF relationships) then only one index entry has to be inserted/deleted (for each separate index; but see comment for multi-value attributes in Section 4). If an object in the middle of the path is inserted/deleted or an object changes its referenced attribute ( e.g a President switches companies) then severaJ insertion/deletions will follow (i,e all index entries under the old company are deleted, and are reinserted for the new company). However,each of these insertions/deletions is a simple B-tree insertionfdeletion, and no associative structures need to be updated. Furthermore, take the example above, these updates may be done in “batch’’sinceall entries for a specific employeeor company are clustered. A batch algorithm to update B-trees was described in [4].

4. ANALYSISAND DISCUSSION In this section we discuss the U-index schemefrom severzdaspects: a) Generality and Uniformity; b) Performance analysis including: Storage, Retrieval and Update cost; c) Tackling complex schema relationships such as multi-value attributes and schema changes, and d) Comparing the U-index structure with other index schemes,including the NIX index. 4.1.

Generality and Uniformity

As was stated in the introduction, the scheme is general and provides support for both classhierarchy and path indexes and their combination. Also, by encodingthe attribute-value as part of the key, one can use a singleB-tree for all these indexes. Furthermore, because the same structure is used for different types of indexes, a single compact code may be used and optimized in a low-levellanguage.

A UniformIndexingSchemefor Object-OrientedDatabases

211

Note, that by using the name-encodingschemeabove, schema information can be stored in the same index and retrieved easily. For example, the relations SUP or REF may be stored in the index and that information is also clustered. See a similar idea in [10]. 4.2. Performance Analysis

Here we discuss the performance qualitatively, an experimental evaluation is presented in the next Section. We will comment on several aspects of performance: Retrieval cost, Update cost, and Storage cost. Retrieval Cost For a singleclass retrieval, either for the class-hierarchyor for the path index, the U-index provides almost the same performance as a single-classindex. As an example assume that accessto a sub-tree is required, and the number of keys in that sub-tree is about half the total number in that class hierarchy, then the retrieval cost for a separate index is approximately: O(Zog~(lV/2)= O(Zog~(lV))– log~(2)= O(log~(iV))for large N where N is the number of keys in the B-tree, and ‘k’ is the B-tree parameter. Where our cost is also: O(log~(iV)). For the range-queries case, the general algorithm worst behavior will be when for r distinct values in the range and m distinct classes or (dispersed) sub-classes,the algorithm searches r*m separate paths which results with the cost of: O(r * m * logk(i’V)) However, the average cost is much lower because of the key-compression,the parallel search of multiple classes, and the class clustering. For the path index case, an exact-match query will behave similarlyto the class-hierarchycase, since the required entries are clustered. the same holds for the combined class-hierarchy/path index case. For range queries or predicate queries, the worst case behavior is again not as good, but the general retrieval algorithm willagain search multiple paths in parallel, therefore the average behavior is not as bad. Storage Cost In the first glance it may seem that because of the class-encodingthe storage cost is high. However, because of the key-compressionthis is not so. In the clsss-hierarchy case, CISSSname codes followeach other lexicographically,and therefore can be compressed. The compression is even stronger in the path-index case because complete paths up to the last class may be compressed. The key-compressionis particularly effective with many index entries per class. Update Cost Update cost for the U-index is at least as good as that of the two previous simple schemes: the CH-trees and the Nested index. Since there is no complex directory structure in the leaf node, then a single update (for the clsss-hierarchy case) will affect a single leaf node only! In comparison to H-trees, there is no nested pointers update, just a regular B-tree update. For the path-index case, the clustering again helps. For example, assume a company replaces its president (same example as in [9]). This means that all the entries of the old company with the old president must be deleted, and new company/vehicleentries must be entered. However, because of the sorting and clustering this update can be done in “batches”, and therefore be more efficient. One cost which is significantin the U-index schemeis the key-compressionand decompression cost. However, this is mostly CPU cost (and not page access) and it can also be optimized by low-levellanguage routines or even by special-purpose hardware. It is therefore not considered here. 4.3. Complex Schema Relationships and Gmph Topology

A major assumption of the above structure is that there is a naming schemewith the desirable properties which relies on the fact that the schema-graphis acyclic. Let us investigate this assumption. First, the assumption of acyclic-Nessof a class-hierarchysub-graph is reasonable even with the existence of multiple inheritance. It is very unlikelyto find a cyclein a graph of super-sub-type relationships. REF relationships on the other hand may create a cycle! A simple example using the schema in Figure 2 is to have a the followingtwo REF relationship: A relationship OWN from

212

EHUDGUDES

Employeeto Vehiclewhich meam that an employeeowns one or more vehicles,and a relationship USE from Vehicleto Employeewhich means that a vehicleis used by several employees. However, since the indexes for these two REF relationships are completely separated, one can encode one (or both) classes in duplicate names. Thus, we break the cycle by replacing the original graph with two acyclic separate graphs, one correspond to one REF index, the other to the rest of the graph. It is also not difficultto identify which of the graphs to use with a particular query, since, a query usually contains the referencing attribute explicitly (OWN or USE in the example) and the mapping to the right sub-graph can be easily done. Another problem with the schema above is associated with Schema modification. When a class is added and/or a REF or SUP relationship is added, one must update the naming scheme. This can be easily done by using changesas those suggestedin Figure 4, which use additional characters in the encoding scheme for class names. Obviously,the limit on the number of distinct letters in the alphabet in the figure is not a real problem.

~q\c,B;(KB New-Class

CIAA

a) Addinga newclass

Cl A

C;Aa

- withinexistinghiersrchy cl P

c1

~%. I

1

‘.

‘.

I 1

1 1 1

.

‘.

,W

New-Class

h\ \ i2 b \ \

,@a

\

C2”

\

C2 b) Addinga newclass - wifhinanew hierarchy Fig. 4: Changesin the Schema

Finally, there is the problem of indexing on multi-value and complex attributea. It is only necessary to consider the case of multi-valueattribute in the opposite direction of the Path index. This, since, the REF relationship is always multi-value in the path direction (i.e. a company manufactures multiple vehicles). If, however, a vehicle is manufactured by multiple companies, then the same vehicle object will appear in multiple index entries. This is no problem during Retrieval, but will cause more, Update overhead. For example, a delete of a vehicle object will cause deletion of an index entry for all the companies it is manufactured by, and these are not necessarilyclustered. Therefore the U-indexupdate performanceis not particularly good for multivalue attributes.

4.4. Comparison with

Other Schemes

In this subsection we compare our schemewith someof the other indexingschemes qualitatively, a full experimental comparison with the CG-trees scheme is presented in the next chapter. The schemes compared are: CH-trees, H-trees, Path-index and NIX index. 1. CH-ZYees. Since the U-index is very similar to CH-trees for single classes, it will exhibit the same retrieval performance for ezact match queries. For multiple classes, it is expected to

A UniformIndexingSchemefor Object-OrientedDatabases

213

behave better than CH-trees, since it does not have a directory structure on the one hand, and the entries for sub-trees of classes are clustered.

2. H-frees, For Range queries,,while CH-trees exhibit the worst performance and H-trees the best performance, it is expected that the U-index will perform somewherein the middle like CG-trees. This is further elaborated in the experiments in the next chapter. 3. Path Index. The U-index’spath schemeis very simple. For a linear path, with no sub-classes, and single-valueattributes, it will exhibit the same performance as the original path index of [l]i. If we add the possibility of sub-classes,i.e combiningpath and class-hierarchy indexes, the only comparable structure is the NIX scheme of [3], and we next compare the U-index to it. 4. NIX Indez. The NIX scheme contains in the leaf node a separate entry for every class and sub-class along the indexed path. If one asks queries which involveonly single classes, then the performance of the U-index and NIX is comparable. If one asks for multiple sub-classes rooted in one class of the path, then the result depends on the relative amount of clustering of these sub-classes. If a complete sub-tree is queried, the U-index may be better, while if dispersed sub-classes are asked the NIX structure will be better. If, however, one wants also to restrict some classes along the path, e.g “all the vehicleswhich are manufactured by companies whose president’s age is above 50, but only for companies located in Italy...”, in this case, the U-index scheme haa an advantage since it stores the entire (compressed) path. For Range queries the NIX scheme will probably be better since no redundant sub-classes will be retrieved. Another case in which the U-index has an advantage is the case of multiple paths in one index (see Section 3). Because of the key compression,many index entries are common and therefore both storage and retrieval time are saved. In terms of Update, because of the existenceof the secondindex structure in the NIX scheme, it is expected to have a worse update performance for end of path objects. For the other cases, including multi-value attributes, its hard to predict and an experimental evaluation is needed. It should be emphasized again that the U-index scheme is very simple, and both its retrieval and the update algorithms are straightforwards and can be optimized by taking advantage of the existing 23+implementations. 5. EXPERIMENTAL EVALUATION Two experimental evaluations were made. The first experiment used the original schema of Figure 1 which was enhanced with the followingadditions: ForeignAuto ServiceAuto Truck HeavyTruck LightTkuck Bus MilitaryBus TouristBus PassengerBus

COD COD COD COD COD COD COD COD COD

C5AB C5AC C5B C5BA C5BB C5C C5CA C5CB C5CC

We generated randomly a database with 12,000 records. The number of nodes generated depends on the B-tree parameters as follows:In case that there are maximum m records per node and minimum ~ records, then the minimal number of internal nodes is 1+ ~~=} 2(Y)~-1 and the

214

EHUDGUDES

number of leaves is 2(~)n–1. In our case, we used a small node size m = 10, and we got about 312 internal nodes, and about 1250leavesto a total of 1562nodea. The main purpose of this experiment was the comparison of pure forward scanning vs. the parallel retrieval algorithm for the U-index scheme. We executed three classes of queries against this database. A) Regular Class-Hierarchyqueries, b) Range queries c) Path and Combined index queries. The followingqueries were run against this database: 1. Class-Hierarchy - Simple and Range - Retrieveall Buses(C5C). (a) Sameas 1, but just red buses. (b) Sameas 1, but just red and bluebusses. (c) Sameas 1, but just red, blueand greenbusses. 2. Class-Hierarchy - Simple and Range - Retrieveall PsseengerBuaes(C5CC).As we can see, nowwe must searchfor subtreeof C5CC. (a) Sameas 2, but just red passengerbuses. (b) Sameas 2, but just red and bluepassengerbusses. (c) Sameas 2, but just red, blueaud greenpassengerbusses. 3. Comparing the Two Algorithms - Retrievered automobiles. We executedtwo algorithms. (a) - forward scanning. (a) Sameaa 3a, but just red automobiles. (b) Sameas 3a, but just red and blueautomobiles. (c) Sameaa 3a, but just red, blue and greenautomobiles. (b) - parallel algorithm. (ba,bb,bc)Samequeriesas 3a, b, cb. 4. Comparing the Two Algorithms - Retrieve either compact or service automobiles (C5AA or C5AC). Again, we executed the two algorithms.

(a) - forward scanning. (a) Sameas 4a, but just red domesticor serviceautomobiles. (b) Sameas 4a, but just red and bluedomesticor serviceautomobiles. (c) Sameas 4a, but just red, blueand greendomesticor serviceautomobiles. (b) - parallel algorithm. (ba,bb,bc)Samequeriesas 4a, b, cb. 5. Path Index - Retrieve companies whose president’sage is 50 (a), or above 50 (b). 6. Combined Index - Retrieve automobilesmanufacturedby Autocompanies whose President’s age is above 50 (a), or retrieve Trucks (b). Table 1 presents the results of the above queries in terms of the number of nodes searched for each of the queries. The results show clearly the characteristics of the U-index and the retrieval algorithms as follows:

1. From comparing queries 1 to 2 or 3 to 4, it is obvious that retrieval of sub-trees rooted at sub-classesis more efficientthan retrieving full class trees, as can be expected because of the clustering scheme. 2. From comparing queries 1 to la, lb, ICor 2 to 2a, 2b, 2c, it is obviousthat for range-queries

the number of visited nodes grow slowly with the addition of range values, and does not multiply with each new range value as is the case with Key-grouping. Also, even with range queries it is more efficientto retrieve individual sub-classesthan full classes. 3. From comparing the two columnsfor queries 3 and 4, it is obviousthat our parallel retrieval aJgorithm is much better (almost 100%)than simple forward scanning. This justifies its use especially in the presence of range queries or multiple sub-class queries.

A

UniformIndexingSchemefor Object-OrientedDatabases Query

Number 1 la lb IC 2 2a 2b 2C 3 3a 3b 3C 4 4a 4b 4C 5a 5b 6a 6b

number of visited nodes 35 19 24 28 28 15 20 24 33 22 25

215

Forward Scanning

30 29 16 19 24

51 41 44 47 41 32 34 37

10 20 22 21

Table 1: QueriesResults

4. From comparing queries 5 and 6, it is obvious that partial path queries are more efficient than full p-ath q;e;ies, even if the index was originally built for the full path. 5. From comparing queries 6a and 6b it is obvious that the behavior exhibited earlier for subclass retrieval is true also for complex path/class hierarchy queries such as query 6. The first experiment described above was used to test the U-index scheme on its own, and for that a real database schema and instances were used. However, the database was small and the page sizes were small too. We therefore conducted a much more thorough experiment with larger datab~e and page sizes. Especially, we wanted to compare the U-index with CG-trees for the Class-hierarchy case. We thereforehave implementedcompletelythe CG-trees schemeand generatedexperimentswhichcompare the two schemeswith almost the same parameters as those describedby Kilger and Moerkotte [6]. The full comparisonis presentedin the next section. 5.1.

Comparing the U-Index and the CG-Trees Schemes

Both the CG-tree and U-index were implemented in C++, with general classes corresponding to high-levelnodes in the B-tree, and specializedsub-classescorrespondingto the leaf nodes which have a different structure. Index fileswere stored in page fileswith pages of size 1024bytes. All the CG-tree features, described in [6]like : ●

leaf node sharing (for same sets),

●

saving only non-NULL referencesin directory nodes,

●

best splitting key search when splitting leaf node.

were implemented. The only feature that was not implemented was the balancing of leaf pages. For all the experiments reported in this paper the database contained 150,000objects, which are referenced by 4 bytes OIDS. The objects were distributed uniformly over either 8 classes or 40 classesclass hierarchies. ( Since [6]uses the word “set” to refer to “class”in the index, we use the

.

..

,.

)

216

EHUDGUDES

same terminology.) The key size was 8 bytes, size of page 1024bytes and size of page reference 4 bytes. Note that the total number of pages required for the indexes is similar to the total number used in [6]. However,there they used 600,000records and page size of 4k, so the number of inner nodes is smaller because of the larger fanout. The similar number of pages does give more credit to the comparison. The number of different key values in the experiments was 100, 1000,and 150,000key values. In the case of 150,000key values, there is exactly one index record for each key value (unique key). In the case of 100 and 1000different key valuesthe key values were distributed uniformly over the total objects space. The queried sets were chosen randomaly in two different ways. As adjacent(i.e. clustered) in the class hierarchy and as distant (i.e separated) in class hierarchy if it was possible (e.g. for 10 out of 40 sets it is possible to generate 10 distant sets, and for 30 out of 40 it is impossible). The separation of these two cases was made due to the influenceof queried sets adjacency, in U-index structure, on its performance. For tests on the CG-tree index structure the sets were generated randomaly because set adjacency does not influenceits performance. We measured the number of page reads of the index structures for exact match queries and range queries. For range queries, the search range comprises107o,270,0.5Y0,and 0.2% of all keys. The 0.5% and 0.2% range queries were tested only for case with 1000different keys. All experiments were repeated 100 times with random inputs and their results were averaged. 5.2. Results Graphs The following graphs display the experiments results. In all the graphs the term “B-tree” stands for the “U-index”. Also, in all the graphs the vertical axis displays the number of pages read and the horizontal axis the number of sets used. The following points can be made based on the displayed results: 1.

In the unique key queries case, the results for our CG-tree are very similar to those reported by [6]. This is expected because of the same number of pages. For range queries on CGtrees, our results are consistently worse (between 10%-80%) than those reported by [6],this is because our B-tree has many more internal nodes than that of [6]. Both of these results confirmsfurther the validity of our experiments.

2. For the unique-key exact-match case, see Figure 5, the U-index is much better than CG-trees, as expected. Also, as expected, the number of sets in this case has very little effect on the performance of the U-index.

3. For the exact-match and non-unique keys case, see Figure 5, the U-index is still better although not as much as for unique keys, and it improvesconsiderably when the number of sets retrieved is increased. 4. For non-unique keys the results for the U-index for 100 keys compared to the results for 1000keys are quite different. The U-index is consistently better with the smaller number of keys. The explanation may be that for 100 distinct keys, there are long lists of secondary Object-ids, as a result, most of the nodes scanned in both structures, store these object-ids, and the U-index is more efficientin that. Therefore, the results are not so indicative as per the performance of the index structure itself!. We prefer then to use more the results for 1000keys (even if the U-index performs better for 100 keys!...). 5. For the Range queries case, see Figures 6, 7, 8, the CG-tree is better than the U-index as expected. However, this advantage is reduced as we decrease the range. For example, for 10%range and 1000keys, the U-index starts to be better at the 20 sets point, while for 0.5% range it starts to be better at the 10 sets points, and the advantage of U-index increases with more sets queried.

A Uniform Indexing Scheme for Object-Oriented Databases

Exact Match Query. 40 sets

217

8 sets

uniquekeys

Eiil

:: 30. 2520-

1510-

2 1j

5 1

10

20

so

40

1

2

4

6

8

2

4

6

6

100 differentkeys 60.

70. 00. 50. 403020100 1

-S-he

40

(naaraata)

35 30 25

+B4me(non-naaraets) +CG-trea

20 15

10 5 0

10

20

so

40

1

1000 d~erent keys 70. 60!m40. 302010. 0 1

+B-trw (naarada) +B-trw (non-naaraeta) +CG-trae k

10

20

so

40

‘L=-’=1

Fig. 5: Exact Match Queries

2

4

6

8

218

EHUDGUDES

Range Query (10% of Keyspace). 40 sets

8 sets

uniquekeys

ILL 1

10

20

30

40

20

30

40

1

2

4

6

8

2

4

6

8

4

8

8

100 differentkeys

o~ 1

10

1

1000 differentkeys 3w 4304oo-

3so3oo2502oo154)1 Iw . so-

+6-ttw

(nearsets)

+B-trae (non-near sets) +CG-tree

‘:P’ o~

0 1

10

20

30

40

1

2

Fig. 6: Range Queries (10% of Keyspace )

219

A Uniform Indexing Scheme for Object-Oriented Databases

Range Query (2% of Keyspace). 40 sets

8 sets

uniquekeys +B-tree +B-trse +CG-tree

140120. 100.

(nearssts) (non.nesrsets)

60-

so

so-

60-

4G4

/

40-

20 1

10

30

20

1

40

2

4

6

6

6

6

100 differentkeys 120-

+B-tree +B-tree +CG-tree

100-

(neereste) (non-nesr sets) P

Bo00. 40. 200 1

10

20

30

40

L4-1

2

4

1000 differentkeys 12) la) 60-

+B-tree +B-tree +CG-tree

60- .

(nesreste) (non-nssr sets)

7060so40 30.

BO4og

+&tree (nesreete) +&tree (non-nesr sets) x +cG-treeA

m

2010 i

20-

o~

0 1

10

2U

30

40

1

2

Fig. 7: Range Queries (2% of Keyspace )

4

6

8

220

EHUDGUDES

Range Query (0.5% of Keyspace). 40 sets

8 sets

1000 differentkeys

i!k 1

10

b--o~

20

30

1

40

2

Range Query (0.2% of Keyspace). 40 sets

4

6

8

8 sets

1000 differentkeys 70 60 50 40

!L4& ‘t===-1

10

xl

40

30

Differencesbetweencase of near and case of non-nearquaried setsin B-tree.

Rangequery,10%of keyspace,1000differentkeys. 8 sets

40 sets 300 250 200 150 100

100

+8-tree +&tree

50

(near aete)

+B-tree

50

(non-near eeta)

0

I

15

10

15

20

25

30

35

40

(neareats)

+B-traa(non-near

o 12345676

Fig. 8: Results for 1000 Keys

eds)

I

A Uniform Indexing Scheme for Object-Oriented Databases

221

6. For Range queries, and small number of sets, CG-trees is better, however,as we increase the number of keys, CG-trees advantage decreases, as can be seen at the bottom right parts of the figures. 7. For the U-index scheme the clusteringof sets is not always significant. It is most significant

with larger number of keys as is shown in Figure 8. It is also more significantwith smaller number of sets as expected. The differencebetween the smaller impact here and the larger impact in the previous experiment can be explained by the use of the much larger page size (about 80 entries instead of 10) here, which causes many sets to reside on the same page anyway... The overall conclusion from the above experiments is that for the class-hierarchy case, the U-index scheme has a clear advantage when majority of the sets are queried, or when clustered sub-set of the sets is queried either with unique keys, or with a small-range query or with a large number of distinct key values. The above are very common cases in many OODB applications, and that with its generality and simplicity makes the U-index scheme very useful. 6. SUMMARY

In this paper we reviewed several of the most common indexing schemes for object-oriented databases and suggested instead one uniform schemecalled: U-index. The main idea is the use of a class-encodingnaming scheme which corresponds to the 00 scheme’sgraph topology, and the heavy use of key compression resulted from it. Another idea is the use of a “parallel” retrieval algorithm which can scan multiple keys and multiple sub-classesin parallel, utilizing to the fullest the information contained in high-levelnodes of the B-tree index. The implementation of the Uindex was discussedand its performanceanalyzed. Examplesfor queriesusing it as wellas results of actual test runs were given. Also an extensiveexperimental comparisonwith the CG-trees scheme was presented. Future research will involveinvestigatingthe performance of the most general case, that of combined class-hierarchyand path-index and comparing it to the NIX structure. Acknowledgements — 1 am grateful to Alex R.Azman for his help in the implementation and the Experimentation, and to the Referees for their helpful comments which have improved the paper considerably.

REFERENCES [1] W. Kim and E. Bertino. Indexing techniques for queries on nested objects. IEEE Thins. on Knowledge and Data Engineering, 1(2):196-204 (1989). [2] E. Bertino. Index configuration in object-oriented databases. ‘l’he VLDB Journal, 3(3) (1994). [3] P. Foscoli and E. Bertino. Index organizations for object-oriented database systems. IEEE flhans.orIKnowledge and Data Engineering, 7(2) (1995). [4] S. Tsur and E. Gudes. Experiments in b-trees reorganization. In Proc. of SIGMOD, pp.200-206(1980). [5] G. Moerkotte and A. Kemper. Access support relations an indexing method for object bases. Information Systems, 17(2):117-145 (1992). [6] G. Moerkotte and C. Kilger. Indexing multiple sets. In Proc. 20t/I VLDB Conference, Santiago, Chile (1994). [7] W. Kim. Introduction to Object-Oriented

Databases. The MIT Press (1990).

[8] H. Lu, C.C. Low and B.C. Ooi. H-trees: a dynamic associative search index for oodb. In Proc. ACM Sigmod, The MIT Press, pp. 134-143 (1992). [9] A. Dale, W. Kim and K. C. Kim. Indexing techniques for object-oriented databases. In W. Kim and F. Luchovsky, editors, Object-Oriented Concepts, Databases, and Applications, ACMPress, pp. 371–394(1989).

[10] N. Riehe. A filestructure for semanticdatabaaea. Information Systems, 16(4):375-385 (1991). [11] J. Unman. Principles of Database and Knowledge-Base Svstems. Computer Science Press (1988).

A UNIFORM INDEXING SCHEME FOR OBJECT-ORIENTED DATABASES†

A UNIFORM INDEXING SCHEME FOR OBJECT-ORIENTED DATABASES†

Recommend Documents