Hash function performance on different biological databases

Hash function performance on different biological databases

Computer Methods and Programs in Biomedicine, 28 (1989) 87-9i 87 Elsevier CPB 00958 Section I. Methodology Hash function performance on different ...

325KB Sizes 0 Downloads 98 Views

Computer Methods and Programs in Biomedicine, 28 (1989) 87-9i

87

Elsevier CPB 00958

Section I. Methodology

Hash function performance on different biological databases E d m o n d J. B r e e n a n d K e i t h L. Williams School of Biological Sciences, Macquarie University, Sydney, N.S.W. 2109, Australia

Open hashing is used to demonstrate the effectiveness of several hashing functions for the uniform distribution of biological records. The three types of database tested include (1) genetic nomenclature, mutation sites and strain names, (2) surnames extracted from literature files and (3) a set of 1000 numeric ASCII strings. Several hash functions ( h a s h p j w , h a s h c r c and h a s h q u a d ) showed considerable versatility on all data sets examined while two hash functions, h a s h s u m and hashsmc, performed poorly, on the same databases. Hashing; Key-to-address transform, File organization

I. Introduction Hashing, or scatter storage, is a well-documented software application for fast efficient storage and retrieval of single database records [1]. Hashing is generally considered the best performing technique for single record retrievals [2,3], with certain applications guaranteeing retrieval of a record in one disk assess [4]. Compared with B-tree applications, which typically locate a record in 3 - 5 disk accesses [2,5], hashing seems ideally suited for real-time database applications. The main primitive behind a hash-based file organization method is the key-to-address transform [6]. If more than a desired number of records transform to the same address (storage location or bucket), a collision is said to occur. There are several methods for handfing collisions [7] but these are beyond the scope of this report. Collisions effectively mean that retrieval and storage times will not be on average constant. An ideal key-to-address transform will distribute an arbitrary record set over some address space uni-

formly, ensuring constant retrieval times. Given the likelihood that a particular transform will not perform ideally on all types of record sets, guidance is needed on the selection of a transform for a particular application [3,6]. The motivation behind this study was to consider hashing as a practical means for real-time data collection of digital images taken directly from moving micro-organisms at a rate of up to 50 images/s. Since we wish to extract various types of information, any pre-order of the data may have no real significance. This also means that data is likely to be stored in various ways. Hence we needed to test how different hash functions perform with a diversity of databases. To model this process, we have constructed several data sets from our biological databases. We have chosen five key-to-address transforms (hash functions), to test for uniformity of record distribution.

2. Methods

2.1. Open hashing Correspondence: Edmond J. Breen, School of Biological Sciences, Macquarie University, Sydney, N.S.W. 2109, Australia.

The hashing method used here to test the uniformity of record distribution produced by a hash function is 'open hashing' [1,8,9]. Open hashing is

0169-2607/89/$03.50 © 1989 Elsevier Science Publishers B.V. (Biomedical Division)

88 TABLE 1 Hash functions Function

Description

References

hashsum

This function returns the sum of the integer values of each character in the key. Ha sh sum is the most commonly used function in demonstrating hashing in elementary computer text books. Similar to hashsum, but before each character is added into an accumulator the bits in the accumulator are shifted left one position. Hashsmc is used to maintain the symbol table in the small C compiler V2.1. Similar to hashsmc, but before each character is added into a 32-bit accumulator, the bits in the accumulator are shifted left four positions. After a character has been added in, if any of the high 4 bits are set they axe shifted right 24 postions, XOR into the accumulator and reset to zero. Hashpj u is used in PJ. Weinberger's C compiler. Groups every four consecutive characters into a 32-bit integer and then adds up the integers. This is the 16-bit cyclic redundancy check (CRC) used to detect errors in disk storage and data communications.

[8,9]

hashsme

hashpj~

hashquad hashcrc

a technique whereby records are mapped via a hash function into an array of ' m ' buckets. Each bucket is effectively the header of a linear linkedlist and is stored in primary memory. Collisions are resolved by appending injected records to the end of the nascent list of their assigned bucket. The locating of any record in the hash table is essentially a linear search of a linked-list. Since any record can only be mapped to one bucket, failed information requests are quickly discovered TABLE 2 Data sets Data set

Description

Dploid a 3669 D. discoideum diploid strain names. 734 prefixed with DP, 2935 prefixed with DU. all suffixed with a simple number. Hploid a 2377 1). discoideum haploid strain names. 35 different prefixes of either one or two uppercase letters, all suffixed with a simple number. Mut a 848 different D. discoideum mutations. Mutations are described by a three-letter lowercase prefix followed by either a single uppercase letter or ' - ' and suffixed with a simple number. Combine Consists of Dploid, Hploid and Mut data sets. 6894 strings. Names 1996 surnames, all in uppercase, from our own literature database. Numbers 1000 numeric ASCII strings forming the set [000-999]. See Section 2, Methods. a Strain names and mutation sites conform to the uniform nomenclature proposal of Demerec et al. [14].

[8,13]

[8,9]

[8] [9]

via a search of only one list. Values of ' m ' are usually restricted to prime numbers [6,7] and we have chosen the value of ' m ' to remain static at 211 for this study regardless of the data set size. The databases used were considerably larger than our arbitrary set value of 211. This meant that the desired loading density in each linked-list would be 't/m', where 't' is the total number of records.

2.2. Hash functions The hash functions used here (Table 1) examine every character in a primary-key field (key), as opposed to transforms that examine only a few characters at the ends or in the middle of a key [8], and return a pseudo-random integer (hash(key)). A record is assigned to a bucket by taking the remainder of the integer division of hash(key) by the number of buckets in the array (m), such that: record -~ bucket[mod(hash(key), m ) ] .

(1)

2.3. Uniformity measure An appropriate performance indicator for measuring the distribution of records within a hash table is the uniformity statistic U(h, t) described in detail by Aho et al. [8]. U(h, t) characterizes the uniformity of a hash function (h) in distributing 't' records by taking the ratio of the observed distribution of records to a determined random distribution. A U(h, t) of 1.00 indicates a perfectly random distribution. For this study a distri-

89

bution will be considered random if it has a U( h, t) such that: 0.95 < U(h, t) ~< 1.05.

11 10 9

(2)

8

7 6

2.4. Data sets We used four databases ¢o~pi]od in our laboratory for genetic nomenclature a~,d fiterature. Two further databases were constructed by combining the above databases and by writing a simple computer algorithm to produce 1000 sequential numeric ASCII strings from '000' to '999'. These six data sets are described in Table 2.

3. R e s u l t s

3 2 1 0 (a) lO-

9 r Q

7

O .a 6 m 5 q~ z_

Fig. 1 displays the plots for bucket number (0-210) versus the number of records/bucket, for three of the hash performance results given in Table 3. Fig. l a plots h a s h s m c ' s distribution of the 'mutation" data set; this distribution is biased (U(h, t ) = 1.09). It is observed in Fig. la that there is an overuse of buckets loaded ,.~hh ~ore tha~ e~ht records and an underuse of buckets loaded with less than three records compared to a random distribution. For an unbiased distribution the desired bucket loading for Fig. la is calculated to be 4.02. Fig. l b plots h=shpjw's distribution o! the 'mutation' data set; this distribution is measured to be random (U(h, t ) = 1.00). The desired loading density for Fig. l b i~ a.02 and the majority of buckets contain between two and seven records. Fig. lc shows h a s h p j w ' s distribution of the 'number' data set; this distribution is measured to be more uniform than would be expected for a random distribution (U(h, t ) = 0.92). The desired loading density for Fig. lc is 4.74; note that the majority of buckets contain 3-7 records. The cycle which is apparent in Fig. lc is not investigated here, but readers wishing to know why cycles may appear in a pseudo-random process are referred to Horowitz and Sahni [7]. Table 3 gives the U(h, t) statistics for all of the hashing functions (Table 1) used on the various data sets examined (Table 2). It is clear that h a s h p j u , hashquad and h a s h c r c perform

U(h,t) = 1.09

O o O E

4 3 2

1O(b) U ( h , t ) = 1 . 0 0 8

7 6

5 4 3

2 (C)

1 o

ELi 0

I 50

U(h,t) = 0.92 I 100

Bucket

I 150

I 200

number

Fig. 1. Plots of the bucket number versus the number of records in each bucket for three of the results shown in Table 3. (la) displays hashsmc's distribution of the data set 'mut'. (lb) displays ha s h p j u's distribution of the data set 'mut'. (lc) displays h a s h p j u ' s distribution of the data set 'numbers'. See text for details.

consistently better than hashsmc and hashsum. 'Diploid', 'haploid' and 'combined" data sets show similar U(h, t) values for each hash function. This is to be expected because of the similarity between these sets (Table 2). Note that h a s h p j w gives a more uniform than random distribution for the

90 TABLE 3 Uniformity statistics

n u m b e r data set while hashcrc and hashquad demonstrate a slight bucket bias for the same data set. With the other databases these three hash functions give r a n d o m distributions. Although h a s h s m c and h a s h s u m show a poor spectrum of uniformity, they demonstrate the versatility of the other three hash functions.

is surprising since in theory at least open hashing should be appropriate for real-time analysis of large files. Attempts have been m a d e to generalize hashing methods [2,5], but these are not without problems [10] (e.g. retrieval of the logical next record). Here we have resolved one of the initial problems of guaranteeing at worst a r a n d o m record distribution for real data used in a biological lab. We show that three hash functions, h a s h p j u, h a s h q u a d and h a s h c r c , are suited for a variety of biological databases (Table 3). While -~v¢have a specific interest in the application of hashing to our time-lapse filming databases [11], we suspect that medical informatic researchers may find our studies of relevance. Open hashing is best suited for temporary record storage during data analysis. This is not only encountered when compiling high-level languages and analysing time-lapse films but also, for example, when a hospital compiles its daily list of critical and interesting patient data [12].

4. Discussion

Acknowledgements

Function

U( h, t ) a Dploid Hploid Mut Combine Names Numbers

hashsum hashsmc hashpjM hashquad hashcrc

4.67 1.67 1.00 0.99 0.99

3.55 1.55 0.97 0.96 1.00

1.57 1.09 1.00 0.99 0.98

3.46 1.41 0.99 1.01 1,00

1A2 1.00 1.01 0.99 0.99

8.37 3.19 0.92 1.07 1.07

a For derivation of this statistic, which is the measure of uniformity of allocation of records to buckets, see [8].

Aho et al. [8] has tested h a s h s m c ( n a m e d X2), h a s h s u r a (named X1), h a s h p j u and h a s h q u a d ; Floyd [9] has tested h a s h p j w , h a s h c r c and h a s h s u m . Both examined data sets based on keywords and identifiers found in high-level programming languages. They showed that ha s h p j u, ha s h ¢ r c and ha s h q u a d performed consistently (i.e. U(h, t) <~1.05) on all the data sets that they examined. Here, h a s h q u a d and h a s h c r c demonstrated a slight bucket bias (U(h, t) = 1.07) for the 'number' data set (Table 3). The four hashing functions examined by Floyd [9], including h a s h sum, showed little difference in performance o n the text materials he studied. The results presented here (Table 3) clearly demonstrate that the choice of hash function for a particular data set does affect the performance of the hashing application, since h a s h s u m performed poorly on all our databases. Open hashing is extensively used in compiling high-level computer languages [8], but has not been accepted for more general applications dealing with database management systems. This

We would for helpful study. This (Australia)

like to thank D. Leigh and N. Farrar suggestions during the course of this work was supported by N H and M R C grant 860245.

References [1] R. Morris, Scatter storage techniques, Commun. ACM 2 (1968) 38-44. [2] W. Litwin, Trie hashing, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 19-29 (1981). [3] C.V. Gormack, R.N.S. Horspool and M. Kaiserswerth, Practical perfect hashing, Comput. J. 28 (1985) 54-58. [4] P. Larson and A. Kajla, File organization: implementation of a method guaranteeing retrieval in one access, Commun. ACM 27 (1984) 670-677. [5] D.A. Bell and S.M. Deen, Hash trees versus B-trees, Comput. J. 27 (1984) 218-224. [6] V.Y. Lum, P.S.T. Yuen and M. Dodd, Key-to-address transforru techniques: a fundamental performance study on large existingformatted files, Commun. ACM 14 (1971) 228-239. [7] E. Horowitz and S. Saimi, Fundamentals of Computer Algorithms (Computer Science Press, Rockville, MD, 1978).

91 [8] A.V. Aho, R. Sethi and J.D. Uilman, Compilers, Principles, Techniques, and Tools (Addision-Wesley Computer Sciences Series, Addison-Wesley, Reading, 1986). [9] E.T. Floyd, Hashing for high-performance searching, D.D.J. Software Tools 12/2 (1987) 34-41. [10] C.J. Date, An Introduction to Database Systems, Vol. 1, 4th edn. (Addison-Wesley Systems Programming Series, Addison-Wesley, Reading, 1986). [11] E.J. Breen, P.H. Vardy and K.L. Williams. A morphological study of the multicellular slug stage of Dictyostelium discoidem: an analytical approach, Development 101 (1987) 313-321.

[12] A.A. Alex, R.H. Gadsden, R.H. Gadsden and W.E. Groves, A computerized system for rapid retrieval and compilation of critical or interesting patient data, Comput. Methods Programs Biomed. 22 (1986) 267-273. [13] J.E. Hendrix, The Small C Handbook (Reston Publishing, Reston, VA, 1984). [14] M. Demerec, E.A. Adelberg, AJ. Clark and P.E. Hartman, A proposal for a uniform nomenclature in bacterial genetics, J. Gen. Microbiol. 50 (1968) 1-14.