Information and Software Technology 45 (2003) 109–112 www.elsevier.com/locate/infsof
Empirical studies of some hashing functions Olumide Owolabi Department of Mathematics and Computer Science, University of Port Harcourt, Port Harcourt, Nigeria Received 18 February 2002; revised 20 July 2002; accepted 14 September 2002
Abstract The best hash function for a particular data set can often be found by empirical studies. The studies reported here are aimed at discovering the most appropriate function for hashing Nigerian names. Five common hash functions—the division, multiplication, midsquare, radix conversion and random methods—along with two collision—handling techniques—linear probing and chaining—were initially tried out on three data sets, each consisting of about 1000 words. The first data set consists of Nigerian names, the second of English names, and the third of words with computing associations. The major finding is that the performance of these functions with the Nigerian names is comparable to those for the other data sets. Also, the superiority of the random and division methods over others is confirmed, even though the division method will often be preferred for its ease of computation. It is also demonstrated that chaining, as a technique for collision-handling, is to be preferred. The hash methods and collision-handling methods were further tested by using much larger data sets and long multiple word strings. These further tests confirmed the previous findings. q 2002 Elsevier Science B.V. All rights reserved. Keywords: Hashing; Hash function efficiency; Collision resolution; Probing; Chaining
1. Introduction Hashing constitutes a very important class of techniques for storing and accessing data in large information systems. Hash function methods, also known as scatter-storage techniques, locate a piece of data in a table by transforming the search key directly into a table address [5]. Computer-based information, usually organized as records with several fields, are usually identified by certain fields known as the key fields. In storing information about employees in a company, for example, the data may be organized into records with fields such as employee number, name, department, rank, duty and salary. We may choose to store and access the records based, for example, on the name field. This then becomes our key field. Several methods exist for accomplishing a task such as the one just stated. The records could be stored using sequential organization in which they are ordered according to the key field. To locate a desired record, the file is searched from the top until the desired record is encountered. This method is, of course, simplistic and inefficient. Methods that improve on this basic approach include the various indexed and tree organizations [4,8]. E-mail address:
[email protected] (O. Owolabi).
The great advantage of the hash approach over these other methods is that the table address for a given record can be directly computed from the key. Suppose we have a table with N locations and we desire to store M records (M , N ) in the table, a hash function maps the M keys into the N table locations. The hash approach applies a function, h, to a key, k, to produce the table address: Address ¼ h(k ). Ideally, a hash function should so evenly distribute the records over the available address space such that no two records hash into the same table location. In such a case, every key can be located by looking into only one slot in the table. It frequently occurs, however, that several records map into the same table location. This phenomenon is referred to as collision. Thus, mechanisms referred to as collision-handling techniques exist alongside hashing functions to resolve collision cases [9]. A hash function that ensures that no collisions occur is known as a perfect hash function [3]. If the function can map the M keys into exactly M locations, it is known as a minimally perfect hash function [2,10]. So far, such functions have only been found to work for rather small static data sets [1,11]. Given that hash functions generally suffer from collisions, it then becomes necessary to find means of measuring their efficiencies. The goodness of a hash
0950-5849/03/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved. PII: S 0 9 5 0 - 5 8 4 9 ( 0 2 ) 0 0 1 7 4 - X
110
O. Owolabi / Information and Software Technology 45 (2003) 109–112
function can be measured by the average number of table slots examined—called probes—to locate any given record [12]. The number of probes required to find an empty slot or locate a table entry will depend, amongst other things, on how full the table is. The proportion of the hash table that is currently occupied is called the load factor. This is the ratio the number of entries to the table size ðM=NÞ: If it is relatively sparsely occupied, i.e. if the load factor is low, then we can expect to find an empty slot fairly soon. But if the load factor is high, then collisions are more likely. This suggests that it is desirable not to have the table too fully occupied. Normally, a load factor of between 0.6 and 0.75 will represent an acceptable figure for most applications [7,9]. As an indication of the importance of hashing methods in information systems, a great deal of effort has been devoted to quantifying the efficiency of hashing functions. Indications from these studies are that the efficiency of hashing functions can depend to a large extent on the type of data involved. Empirical studies are essential to evaluate the goodness of hashing functions in any particular context [6,7, 9,12]. The work reported in this paper arose out of our efforts at conducting research into large-scale information systems particularly adapted to local needs. In a prototype information retrieval system under development, hash functions are employed to map names—which are largely Nigerian—into table addresses. The aim in this project is to study some common hash functions as well as collisionhandling methods to determine which combination will give the best performance for our particular data sets. Apart from Nigerian names, hash functions have to be employed for other data in the system. We therefore decided to test these functions on two other data sets—English names and words with computing associations. In addition to deciding what hash function/collision-handling method combination is most suitable for the Nigerian names, this study will also enable us to know whether or not we need to employ a different combination for each data type in the system. Having introduced the subject of our study in this section of the paper, Sections 2 and 3 describe the hash functions and collision-handling techniques that were evaluated in this work. These are followed by a discussion of the experiments and results in Section 4, and then Section 5 concludes the paper.
key string, s, is first converted to a numerical value, k, which is then converted into an address between 0 and N 2 1, inclusive. In each case the numerical value of a key string is computed using the function: n X f ðsÞ ¼ Ci W i ; i¼1
where n ¼ length (s ) is the number of characters in the string, W the number of characters in the base alphabet, and Ci is the ordinal position of character i in the alphabet. 2.1. Division method Given a key, k, the division method uses the remainder modulo N as the hash address: hðkÞ ¼ k mod N Selecting an appropriate hash table size is important to the success of this method. For example, a hash table size divisible by two would yield even hash values for even keys, and odd hash values for odd keys. This is an undesirable property, as all keys would hash to even addresses if they happened to be even. To obtain a good spread, the hash table size should be a prime number not too close to a power of two. 2.2. Multiplication method With this method, the normalized hash address, in the range 0– 1, is first computed. The result is then applied to a hash table of arbitrary size. The normalized hash address is computed by multiplying the key value, k, by a constant c, (c ¼ 0.618034 has been found to be a good choice [4]), and then taking the fractional part. Multiplying the resulting fraction by N gives a hash address between 0 and N 2 1. 2.3. Midsquare method To properly hash up the bits in a given key, this method squares the key value, k. Thereafter the n middle bits are extracted from this value to form the hash address. The size of the hash table will determine the value of n. For example, 10 bits will address up to 1024 hash table locations. 2.4. Radix conversion method
2. Hash functions In these studies we decided to examine the performances of the five most common hash functions. These are: the division, multiplication, midsquare, radix and random methods. Brief descriptions of these methods are given below. It is required to map the set of keys into N table locations. Hence, N is the size of the hash table. A
This method is based on the idea that if the same number is expressed in two digital representations with radices that are relatively prime to each other, then the respective digits will have very little correlation. Consider the key, for example, as a string of octal digits. Now regard this same string as digits in a different base, 11 say. The resulting base 11 number is then converted to base 10. The hash address is this number modulo N.
O. Owolabi / Information and Software Technology 45 (2003) 109–112
2.5. The random method
Table 2 Results for English names
This method uses a pseudo-random number generator to generate a sequence of addresses, which are more or less random. Pseudo-random number generators have the property that when seeded with a particular value, the sequence generated is deterministic. This property is sometimes used in collision handling. The pseudo-random number generator employed here is seeded with the key to generate of a value between 0 and 1. Multiplying this by the size of the hash table gives the hash address.
3. Collision-handling techniques Collision-handling techniques fall into two general classes: open addressing and separate chaining. When a key hashes into a location already occupied by another record, open addressing techniques work by searching the table circularly till an empty slot is found. The search for an empty slot is conducted by adding a certain increment to the computed address each time. Thus the search for a slot will visit the locations: hðkÞ; hðkÞ þ f1 ; hðkÞ þ f2 ; …; hðkÞ þ fi ; … Open addressing can be performed by linear probing, in which case fi ¼ i: It can also be done by quadratic probing, in which case fi is a quadratic function. Alternatively, it can be done randomly, in which case fi ; i ¼ 1; 2; …; is a sequence of pseudo-random numbers. Searching for a key follows the same pattern. In all these cases, the sequence fi is always deterministic for a given key; this ensures that the key will always be found if it is in the table. Separate chaining techniques, as opposed to open addressing, utilize a storage area separate from the primary table to accommodate colliding entries. Thus, when a key hashes into a location already filled, this key is stored in the next reserved location with a pointer to it stored in the primary table. There is therefore no need to visit other table locations in case of a collision; the new entry is simply attached to the end of the list at the location, where a collision has occurred. Chaining can also be implemented by using bucket addressing. In this case, locations in the reserved area are grouped into buckets of a reasonable fixed size. When there is a collision, an empty bucket is attached to the table
Division Multiplication Midsquare Radix Random
Hash function
Division Multiplication Midsquare Radix Random
Collision-handling method Linear probing
Separate chaining
1.87 1.88 3.98 1.95 1.20
1.31 1.31 2.64 1.32 1.14
location where, the collision has occurred. Colliding entries are inserted in this bucket until it is full. If a collision occurs at a location with a full bucket, a new bucket is simply attached to the last one.
4. Experiments and results As has been previously stated in this paper, three separate data sets were initially used in testing the five hash functions described in Section 2. For collisions, simple chaining and linear probing were used. The first data set is a collection of Nigerian names, the second a collection of English names, while the third is made up of terms with computing associations. Each consists of about a thousand words. The Nigerian names range in length from 3 to 12 characters with an average of 6.15; the English names range in length from 3 to 13 with an average of 6.42; lastly the computing terms have lengths from 12 to 17, and with an average of 7.45 characters. For our 1000 words we have chosen a table size of 1537. This figures satisfies the peculiarities of the different hash functions. For example, with the mid-square function, we compute the hash address from the middle 11 bits; this can address up to 2048 locations. Multiplying this number by 3/4 and adding 1 gives a maximum of 1537 hash locations. In our experiments, each hash function was in turn tested; and for each function the collision-handling methods were separately employed. This process was repeated for each of the three data sets. For each data set, all the words were first inserted into the table. Each word was then retrieved in turn, summing up the total number of probes required to locate all the words. From this figure, the average number of probes was computed. The results are shown in Tables 1– 3. Table 3 Results for computing words
Table 1 Results for Nigerian names Hash function
111
Hash function
Collision-handling method Linear probing
Separate chaining
2.50 1.81 2.84 2.00 1.48
1.36 1.33 1.77 1.32 1.18
Division Multiplication Midsquare Radix Random
Collision-handling method Linear probing
Separate chaining
1.90 1.73 3.23 1.80 1.20
1.34 1.28 2.10 1.33 1.14
112
O. Owolabi / Information and Software Technology 45 (2003) 109–112
A major finding of these studies, therefore, is that we need not think of using special hash functions to handle Nigerian names.
Table 4 Results for multi-word strings Hash function
Division Multiplication Midsquare Radix Random
Collision-handling method Linear probing
Separate chaining
1.90 1.80 3.32 1.74 1.21
1.32 1.31 2.12 1.31 1.15
From Tables 1 – 3, it is apparent that the random method of hashing requires the least number of probes on average to locate a given key. It can thus be inferred that a good pseudo-random number generator that can be seeded with a key will be efficient. It is true, however, that it requires more computational effort than some of the other hash functions. We can conclude that the performances of the division, multiplication and radix conversion method are at par. The division method, however, has the advantage that it is easiest to compute. These findings are consistent with those of Lum [6] and Kohonen [5] in which they found that the division and random methods give the best performance. It is significant to note that with the chaining method of collision-handling, the three data sets yield a more or less uniform set of figures. Also, the hash functions have a more impressive performance with chaining. Kohonen [5] also found that separate chaining works best in most cases. With this method of collision handling it does not appear that the characteristics of the Nigerian names is significantly different from the other data types. This might be due to the fact all the data sets are drawn from the same alphabet. Further tests were conducted on these hash functions and collision-handling methods using much larger dictionaries containing mixed types of strings. The hash table size was also proportionately increased so as to keep the load factor at about 0.65 when the table is fully loaded, since this was the factor used in the initial tests. The results were very similar to what we have in Tables 1 – 3. Also, a dictionary with long multi-word strings with lengths varying from 14 and 48, and averaging 27 characters per string was used. Table 4 shows that it is in line with the results for short single-word strings. This shows that string length is not necessarily a significant factor with respect to the efficiency of hashing functions.
5. Conclusion These preliminary studies have shown that it might not be necessary to devise special functions for hashing Nigerian names. The performances of the common hash functions for Nigerian names are comparable to those for other words drawn from the same alphabet. In addition, the superiority of hash functions such as the random and division methods over others has been confirmed. Chaining is also shown to be a good choice for handling collisions. These findings will be of tremendous help in our work of designing information-processing systems.
References [1] M.D. Brain, A.L. Tharp, Near-perfect hashing for large word sets, Software—Practice Experience 19 (1989) 967–978. [2] C.C. Chang, A scheme for constructing ordered minimal perfect hashing functions, Info. Sci. 39 (1986) 187–195. [3] G.V. Cormack, R.N.S. Horspool, M. Kaisewerth, Practical perfect hashing, Comput. J. 28 (1985) 54–58. [4] D.E. Knuth, The art of computer programming, Sorting and Searching, vol. 3, Addison Wesley, Reading, MA, 1973, pp. 506–542. [5] T. Kohonen, Content—Assressable Memories, Springer, Berlin, 1987, pp. 39–100. [6] V.Y. Lum, General performance analysis of key-to-address transformation methods using an abstract file concept, Commun. ACM 16 (1973) 603 –612. [7] J.D. Maurer, T.G. Lewis, Hash table methods, Comput. Surv. 7 (1975) 5–19. [8] N.E. Miller, File structures using pascal, Benjamin/Cummings, California, 1987, pp. 209 –260. [9] R. Morris, Scatter storage techniques, Commun. ACM 11 (1968) 38– 44. [10] R.W. Sebasta, M.A. Taylor, Minimal perfect hash functions for reserved word lists, SIGPLAN Notices 20 (1985) 47–53. [11] R. Sprugnoli, Perfect hashing functions: a single probe method for static sets, Commun. ACM 20 (1977) 841–850. [12] J.D. Ullman, A note on the efficiency of hashing functions, J. ACM 19 (1972) 569 –575.