An Approach to Generalized Hashing

An Approach to Generalized Hashing Michael Klipper With Dan Blandford Guy Blelloch

Hashing techniques currently available • Many hashing algorithms out there: • Separate chaining • Cuckoo hashing • FKS perfect hashing • Also many hash functions designed, including several universal families • Good: O(1) expected amortized time for updates, and many have O(1) worst case time for searches • Bad: Require fixed-length keys and fixed-length data

What’s so bad about fixed length? • Easy to waste a lot of space: • Every hash bucket must be as large as the largest item to be stored in the table. • This is a large problem for sparsely-filled tables, or tables where large items occur infrequently. • Hash tables are often building blocks to more complicated structures, so optimizing them pays off in a lot of places.

Example: A Graph Layout Where We Store Edges in a Hash Table Let’s say u is a vertex of degree d and v1, … vd are its neighbors. Let’s say that v0 = vd+1 = u by convention. Then the entry representing the edge (u, vi) has key (u, vi) and data (vi-1, vi+1). Hash Table This extra entry “starts” the list. u u u u u v1 v2 v3 v4 u v1 v2 v1 v2 v3 v4 Degree of Vertex 4

An Idea for Compression • Instead of ((u, vi), (vi-1, vi+1)) in the table, we will store ((u, vi – u), (vi-1 – u, vi+1 – u)). • With this representation, we need O(kn) space where k = S(u,v)ÎE log |u – v|. • A good labeling of the vertices will make many of these differences small! But not all of them. The following paper has details: D. Blandford, G. E. Blelloch, and I. Kash. Compact Representations of Separable Graphs. In SODA, 2003, pages 342-351.

First, a simpler problem • Variable-length data stored in arrays • It’s like a hash table except that the indices now are in the fixed range 0…n-1 for n items in the array. We’ll use the following data for our example in these slides: (0, 10110) (1, 0110) (2, 11111) (3, 0101) (4, 1100) (5, 010) (6, 11011) (7, 00001111) We’ll assume that the word size of the machine is 2 bytes.

Key Idea: BLOCKS • Multiple data items can be crammed into a word, so let’s take advantage of that. • Two words in a block: one with data, one marking off separations of strings • If the first index in a block is i, we’ll label the block as bi b0 1st word 1 0 1 1 0 0 1 1 0 1 1 1 1 1 2nd word 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 This is the block containing strings s0 through s2 from our example.

Organization of Blocks • Index structure (regular array): A[i] = 1 if and only if string #i starts a block • Hash table (one of the regular kind): if string #i starts a block, H(i) = address of bi • Note that it is easy to split and merge blocks. Key size invariant: two adjacent blocks (like b0 and b3 in the example) must have their sizes sum to greater than the word size of the machine H(0) b0 1 0 0 H(3) 1 b3 A 0 0 0 H(7) b7 1

A Rough Look at Space and Time Bounds for this Array Structure Let’s say we have n items, and w is the word size in bits of the machine. WLOG all data strings are nonempty. Let m = Si |si|. At most w strings 1 0 <= w apart 0 On avg block is ½ full 1 0 • Lookup tables cut the time down to constant time for finding a block and the string inside it, since they can operate on entire words at once. • Indexing structures + hash table use O(w) bits per block. Each block is on average half full due to the invariant. • O(m + w) bits used and operations are O(1) time! 0 0 O(m/w + 1) blocks! 1

Briefly, how we proceed from there • We can finally implement our generalized hash table using an array of the type we just described as the hash table. • There are more details: the following paper explains this. D. Blandford and G. E. Blelloch. Storing Variable-Length Keys in Arrays, Sets, and Dictionaries, with Applications. In Symposium on Discrete Algorithms (SODA), 2005 (hopefully)

Great, but if there’s a paper written on the subject already, then what do I do? • A lot of this code isn’t yet written. We haven’t yet checked to see that the code we have fulfills the theoretical bounds, since we have to make sure that any “cutting corners” done for the programming is theoretically safe. • My job is to get a lot of this running and look for optimizations. • Also, once this is running, we’ll want to run experiments to see how well it runs, especially in modeling graphs.

An Approach to Generalized Hashing