The self-indexed search algorithm: A bit-level approach to minimal perfect hashing

The self-indexed search algorithm: A bit-level approach to minimal perfect hashing

ELSEYIER Information Processing Letters 69 (1999) 253-258 The Self-Indexed Search Algorithm: A bit-level approach to minimal perfect hashing Marco A...

518KB Sizes 0 Downloads 41 Views

ELSEYIER

Information Processing Letters 69 (1999) 253-258

The Self-Indexed Search Algorithm: A bit-level approach to minimal perfect hashing Marco A. Ton-es *, Susumu Kuroyanagi, Akira Iwata Department of Electrical and Computer Engineering, Nagoya Institute of Technology, Gokiso, Showa-ku, Nagoya 466-8555, Japan

Received 2 April 1998 Communicated by F. Dehne

Abstract A novel algorithm for generating minimal perfect hash tables for large amounts of data is presented. Functions considered are of the form h (key) = g (fl (key)) + f2 (key), where g is retrieved from a lookup table, and fi are bit strings contained in the key itself. Since keys are considered as binary strings, and only a few bits from the key itself are used to produce the address in the hash table, search time is independent of the length of the keys and the size of the character set. 0 1999 Published by Elsevier Science B.V. All rights reserved. Keywords: Search algorithms; Data structures; Minimal perfect hashing

1. Introduction Given a particular

set of N records,

searching

it con-

record that is supposed to be contained on the collection. A piece of information called key is used to distinguish among records. Like sorting, searching is one of the most time-consuming processes of many data processing systems, so there is a great need for efficient and fast search algorithms. In this context, hashing promises fast access. In hashing, keys are transformed into table addresses by doing arithmetic operations. A hash function is then a mapping from a set of keys into array indices; for every key of the sequence, a transformation function h(key) is computed and then taken as the location of key in the table. A per&t hash function is a hash function h such that h(keyi) # h(keyj) for all distinct i and j, i.e., it is a one-to-one mapping from the keys to the sists in locating

a specific

* Corresponding author. Email: [email protected].

integers indexing the table. Since a perfect hash function transforms each key into a unique address in the hash table, each item can be retrieved from the table in a single probe. In general it is desirable to have a perfect hash function for a collection of N keys in a table of only N positions. Such a perfect hash function is called minimal. Perfect hash functions are used for fast retrieval of items from a static collection, such as reserved words in programming languages (compilers), commonly used words in natural languages (spell check), lexical post-processing in word or text recognition systems, etc. [ 11. In this paper, we present the Self-Indexed Search Algorithm, for generating minimal perfect hash tables for large collections, that by considering the keys as binary strings is independent of the character set, and by using only a few bits from the key itself to produce the address in the hash table is also independent of the length of the keys. Our approach is segmentation [ 121 based: The entire Collection of words is divided into

0020-0190/99/$ - see front matter 0 1999 Published by Elsevier Science B.V. All rights reserved. PII: SOO20-0190(99)00016-2

254

M.A. Torres et al. /Information Processing Letters 69 (1999) 253-258

Sets by word length, and each set is divided into small Groups by some bits (Index bits) taken from the key itself. For each group, a perfect hash function is obtained. Each set requires a single minimal perfect Hash table to allocate all the groups.

2. Previous work Many different algorithms for generating perfect hash functions have been developed. Some approaches restrict input keys to already be integers [6,13]. The usual approach is to associate integer values with all or some of the characters in the string, and then to combine those values into a single number. Cichelli’s method [3] uses the first and last characters of the key. Many different methods have been developed based on Cichelli’s approach [2,10], however all of them work well only for relatively small collections (typically a few hundreds). Sager [9] introduced the mincycle algorithm which uses graphs to improve Cichelli’s method, but it is also impractical for collections larger than 5 12 words. For larger collections, other solutions have been developed based on the mincycle algorithm. Czech et al. [4] used functions of the form h&y)

= (g(fl (key)) + g(f2(&)))

mod N,

where fl and A are functions that map strings of characters into integers, and g is a function that maps integers into [O, N - 11. g is implemented as a table lookup and fj functions are auxiliary hash functions selected from a class of universal hash functions .fj(&)

=

where Tj are tables of random integers modulo N for each character and for each position of a character in a word. The number of entries in the table depends on the length of the keys and the size of the character set. Search time depends on how quickly the auxiliary functions can be computed (since all the characters in the key are considered, it actually depends on the length of the key). Fox et al. [5] used a function of the form f (key) = (fotkey) + (g(ft (key)) + g(f2)W)))

mod N,

where jii, fl and A are similar to those used by Czech et al.; g: (0,. . . ,2r - 1) + (0,. . ., N - 1); r is a parameter typically N/2 or less. The larger r is, the greater the probability of finding a minimal perfect hash function. It is a rule of thumb that hash functions should depend on every single bit of the key, so that two keys that differ in only one bit hash into different values, thus a hash function that simply extracts a portion of the key is not suitable [8, p. 5061. This has the consequence that the search time does depend on the length of the key, which can be a liability in practical applications with long keys [ 11, p. 5741. In a recent study, Jenkins [7] analyzed many different hash functions and data sets and concluded that, in real keys usually key lengths range from 8 bytes to 200 bytes. Therefore, those methods that consider all the characters in the key could perform poorly under such conditions. On the other hand, those methods that only consider a few characters are limited to small collections. An additional problem of methods previously mentioned is that they were restricted to collections with small character sets, usually English words. Applying them to collections with large character sets like Chinese or Japanese could be quite complicated or even impossible. Therefore, a method for large static collections, independent of the length of the keys and independent of the size of the character set is needed.

3. The Self-Indexed Algorithm Keys are sequences of symbols chosen from a finite alphabet X formed by the M symbols (xi}El. To be processed by a digital computer, each symbol must be represented by a finite sequence of bits, called code word. To assign a unique code word to each of the symbols in the alphabet, log2 M bits are required. The keys that we desire to process are sequences of such symbols. Enumerating the binary sequence to be assigned to each key is out of the question. A very practical approach consists of encoding the characters and then forming the code word for a sequence of characters by concatenating the code words for the characters. It has the consequence that keys are overrepresented; e.g., to represent N elements, n bits are needed (where n, is the minimal integer such that n > log2 N), but usually Ikey > n.

M.A. Torres et al. /Information Processing Letters 69 (1999) 253-258

If the keys were optimally represented (l&y1 = n), they could be used by itself as the index into the hash table. Since they are overrepresented, and the code words are already assigned, our goal is toJind, inside the keys itself, some substring (some combination of bits) that provide a unique pattern for each key in the group.

Example 1. Consider the words headm_n and headmen, coded in 8 bits ASCII. They only differ in the character underlined; actually they differ only on bit 5 (from left to right) of that character (a + 01100001, e + 01100101). Therefore, we can use bit 45 of the entire string to distinguish them.

The function considered h(key) = get.bit(j,

is of the form

&key),

(1)

where, gethit returns the value of the jth bit of string key. In our example, j = 45; then in searching, h(headman)

= get.bit(45,

&headman)

= 0,

h (headmen)

= get.bit(45,

&headmen)

= 1.

When several bits are required, Eq. (1) is expanded

255

How the binary sequences for different values of n are constructed? If n = 1, then two sequences, 0 and 1 exist. The sequences for n = 2 can be constructed by concatenating first a 0 then a 1 to each of the sequences of length 1. The sequences for n = 3 can be constructed from those of length 2 in a similar manner. The number of binary sequences of length n is then 2”. At any bit position bj , the number of zeros is equal to the number of ones, and is equal to 2”-‘ . Therefore, to get a unique code word, we need that the number of bits that hold the same value (zero or one) at any particular bit position be less than or equal to 2”-‘ . 3.1. Seek step After choosing Candidate bits (after excluding those bit positions that hold a number of zerus or ones larger than 2”-‘), an exhaustive search is done by taking IZ candidate bits at a time. At each iteration, the next candidate bit in the list is evaluated. Seek is repeated until an adequate combination of bits is found or it is concluded that the present amount of bits, n, is not sufficient to disjoint the group; in such a case, 12is increased by one and the process is repeated. In the worst case, the keys differ in only 1 bit then, the number of Select bits required will be equal to the number of elements in the group. 3.2. Divide large sets into small groups

to: n-1

h(key) = c

get.bit(si , &key),

(2)

i=O

where C indicates string concatenation, not arithmetical sum, and IZ is the number of bits required to disjoint the group. si are Select bits (SeZ_bits = sn_l }), bit positions selected from the key bO,Sl,..., itself to provide a unique pattern for each member of the group. To get the Select bits, we must seek inside the keys itself for those bit positions that provide a unique pattern for each key in the group. Evaluating all possible combinations will take a long time, however, a brief analysis of how the encoding process is done will help to speed up the seek, by only considering those bit positions that can provide a unique identifier for each word in the group.

Sometimes, it is not possible to find sufficient Select bits with the minimal n; in general, minimal perfect hashing can only be guaranteed when the group is complete, i.e., when 2n = N. When the group is incomplete (2n > N), some code words are not assigned; therefore hashing is perfect but not minimal. In general, it is better to divide the process in two levels: (i) By using a few Index bits, divide the set into small groups. (ii) For each small group, find a perfect function with parameters: (n, SeZ_bits). The function that we are looking for is of the form h(key) = g(fi (key)) + f2(key),

(3)

where g is retrieved from a lookup table and fl and f2 are derived from Eq. (2).

256

MA. Torres et al. /Information

Processing Letters 69 (1999) 253-258

Table 1 “Months” collection used in Example 2

bits in italics (01) are not considered in Seek; bits underlined (01) are Index bits; bits in bold (01) are Select bits for each group.

Ib-I fl

= c

3.4. Index table

get.bit(pi, &key),

(4)

i=O

where Ib indicates the number of Index bits (Index-bits fqb_ 1)); their values (bit positions) are ={po,m,..., chosen for the entire set, while Sel_bits = {so, ~1, . . . , s,_ 1) and n must be found for each group.

As the number of Index bits increases, more single element groups are produced, but also many locations in the Parameters table are not used. To reduce this problem, the Parameters table is addressed by pointers stored in the Index table. In this way we can (almost) freely increase the amount of Index bits (since Index table size = 21b) without wasting much memory space.

3.3. The assignment step

Example 2. Consider

n-1 f2 =

C

get.bit(Si,

&kC?y),

(5)

i=O

After the parameters for every group in the set have been obtained, the final step is to assign a location in the Hash table for each key, as follows: 0) Beginning by the group with the largest n, find a space in the Hash table large enough (2” contiguous locations) to allocate all the elements in the group. The address of the first location in such space is the value of g, and it must be stored in the Parameters table, at location pointed by fl, followed by the Parameters of the group: In,so,sl,...s,-1). (ii) For each key in the group, compute its address in the Hash table: h(key) = g + f2. (iii) Continue the process for all groups in decreasing order of n.

the collection: jan, feb, mal; apl; may, jun, jul, aug, sep, act, nov, dec (also used in [3,4,10]). By using 3 Index-bits (21, 22, 23), 7 groups were obtained. The largest group, apl; mal; feb, is allocated at address 0000, then g(O10) = 0000 is stored at location 00000 in the Parameters table, followed by the Parameters of the group (n = 2, so = 5, s1 = 6). Then, pointer 0000 must be stored at location 010 in the Index table. Location for apr is obtained by h(apr) = g(O10) + A = 0000 + 00 = 0000; h(mar) = g(O10) + f2 = 0000 + 10 = 0010; h(feb) = g(O10) + f2 = 0000 + 11 = 0011. Single groups (sep, may, dec, aug) were stored at locations not used by incomplete groups. On searching for apr: Parameters:

Ib = 3;

Index-bits

= {21,22,23);

MA. Torres et al. /Information Processing Letters 69 (1999) 253-258 Table 2 Collections

257

of words used to evaluate the Proposed Method

E.5

4,502

40

3,050

2,149

9

4

15

17 1 22

E6

6,613

48

4,241

2,834

14

4

15

18

25

34

26

1 31

J4

12,116

32

3,321

796

20

5

12

17

20

J6

6,339

48

4,048

2,932

16

4

15

19

27

34

J8

3.856

64

2.612

1,939

9

4

15

18

31

44

JIO

2,661

80

1,783

1,255

8

3

15

18

37

52

J12

2,565

96

1,648

1,090

8

3

15

19

46

62

41,595

40

23,261

12,677

15

4

15

19

24

31

ZIPCODES

mf = Largest group in the set; n+ = Largest value of n required for a group in the set.

by Eq. (419 i =O;

get.bit(21,

&upr) = 0;

h = 0,

i = 1;

get.bit(22,

&upr) = 1;

h = 01,

i =2;

get.bit(23,

&apr) = 0;

h = 010,

fl (up)

=

010 * g(ft)

= g(O10) = 0000.

For group “0 10”: Parameters:

n = 2;

G-bits

= {5,6);

then, by Eq. (5), i =O;

get.bit(5, &upr) = 0;

h = 0,

i = 1;

get.bit(6, &upr) = 0;

h = 00,

f2 (up-)

= 00,

h(upr) = g(fl)

on English characters coded in ASCII: bits 3,4 and 5, range from 000 (for a, b and c) to 110 (for x, y and z). Therefore, if we combine bits 3, 4, 5, of any two characters, we can get no more than 49 code words, although with 6 bits, 64 combinations can be produced. But, if we use bits 5, 6 and 7, it is possible to get 64 of 64 combinations. On the other hand, words usually contain common patterns in the initial or ending parts (Verb Radicals, Verb inflections, prefixes, suffixes, etc.), but not in the middle part. Therefore, LSB of middle characters typically produce the best results (large number of groups).

4. Tests and results + f2 = 0000 + 00 = 0000.

3.5. Choosing the Index bits The coding process is inherently inefficient, usually the MSB are not used in the full range. For example,

To evaluate the proposed method we used three large collections (see Table 2). lko methods (Czech et al. [4], and Fox et al. [5]) for generating perfect hash functions for large collections were also implemented and search time compared. On looking for Zn-

258

M.A. Torres et al. /Information

Processing Letters 69 (1999) 253-258

dex bits, we made several tests with different character positions, just to find that, as expected, middle characters produced largest number of groups; in all cases, we used bits 5,6 and 7. For the proposed method, search time is proportional to the number of bits that must be retrieved from the key (Zndex_bits + S.&bits). It was almost constant for all the sets in the three collections tested, as opposite to the other methods where, search time increases with the length of the keys.

consider the keys as binary strings, it can be applied to any class of keys with the same efficiency. Experimental results have confirmed that it is independent of the length of the keys, and the size of the character set.

5. Discussion

[3] R.J. Cichelli, Minimal perfect hash functions made simple, Comm. ACM 23 (1) (1980) 17-19. [4] Z.J. Czech, G. Havas, B.S. Majewski, An optimal algorithm for generating minimal perfect hash functions, Inform. Process. Lett. 43 (5) (1992) 257-264. [5] E.A. Fox, L.S. Heath, Q.F. Chen, A.M. Daoud, Practical minimal perfect hash functions for large databases, Comm. ACM 35 (1) (1992) 105-121. [6] G. Jaeschke, Reciprocal hashing: A method for generating minimal perfect hashing functions, Comm. ACM 24 (12) (1981) 829-833. [7] B. Jenkins, Algorithm alley: Hash functions, Dr. Dobb’s J. (1997) 107-109,115-l 16. [S] Y. Langsam, M.J. Augenstein, A.M. Tenenbaum, Data Stmctrues Using C and C++, Prentice-Hall, Englewocd Cliffs, NJ, 1996. [9] T.J. Sager, A polynomial time generator for minimal perfect hash functions, Comm. ACM 28 (5) (1985) 523-532. [lo] D.C. Schmidt, Gpeti A perfect hash function generator, in: Proc. 2nd USENIX C++ Conference, San Francisco, CA, April 1990, pp. 87-101. [ll] R. Sedgewick, Algorithms in C, 3rd edn., Addison-Wesley, Reading, MA, 1997. [ 121 R. Sprugnoli, Perfect hashing functions: a single probe retrieving method for static sets, Comm. ACM 20 (11) (1977) 841850. [ 131 V.G. Winters, Minimal perfect hashing for large sets of data, in: Proc. Intemat. Conf. Computing and Information-ICCI’90, Canada, May 1990, pp. 275-284.

Human beings represent information as words, sequences of characters; to distinguish between words, the deeper level that humans can use is the character level. Computers represent information as sequences of bits; they can go until the bit level (deeper than the human level) to distinguish among strings. Hashing methods that consider words as strings of characters are self-limiting to work at the human level; they are language dependent and must be adapted to the particular characteristics of each case, which sometimes is not affordable. On the other hand, the proposed SelfIndexed Search Algorithm, by working at bit level, is independent of the language in which the information was originally represented and, by using only a few bits of the key to guide the search, it is also independent of the length of the keys.

6. Conclusions An efficient an easy to implement algorithm for producing minimal perfect hash tables for large collections has been presented. Since the proposed method

References [l] G.H. Gomret, R. Baeza-Yates, Handbook of Algorithms and Data Structures in Pascal and C, Addison-Wesley, Reading, MA, 1991. [2] C.C. Chang, The study of an ordered minimal perfect hashing scheme, Comm. ACM 27 (4) (1984) 384-387.