Wednesday, September 30, 2009

Simple perfect murmur hashing

A simple way of finding a perfect (collision free) murmur hash for a set of keys S is to simply iterate over the seed values until we find one that doesn't produce any collisions:

seed := 0
while true
    H[i] := murmur_hash(S[i], seed) for all i
    return seed if no_duplicates(H)
    seed := seed + 1

As long as the size of the key set S is not much bigger than the square root of the output range of the hash function, the algorithm above will terminate quickly. For example, for a 32 bit hash this algorithm works well for sets up to about 65 000 elements. (In fact we can go up to 100 000 elements and still find a good seed by just making a couple of extra iterations.)

With a perfect hash function we only need to compare the hash values to dermine if two keys are equal, we never have to compare (or even store) the original keys themselves. We just have to store the 32-bit seed and the hash values. This saves both memory and processing time.

In the BitSquid engine this simple perfect hashing scheme is used to generate 32-bit resource IDs from resource names and types.

6 comments:

  1. Sorry for being late in the discussion but it means IDs are computed in a preprocessing stage?

    ReplyDelete
  2. Yes, you would have to know the set of all possible key values in the preprocessing stage.

    ReplyDelete
  3. What if I just get all keys and map them to numbers from 0 to N one by one with increment? Would it cause problems?

    ReplyDelete
  4. In that case you would have to store the "map" somehow. How would you do that? In a hash table? Then you are using a non-perfect hashing scheme to get to your perfect hashing scheme, which means you have removed all advantages of it.

    ReplyDelete
  5. Nice blog, I really enjoy reading your articles !

    Unfortunately I am not sure to understand the purpose of those 32 bits resource IDs ...
    It speeds up comparisons for sure but do you really often compare resources in your game ? Have you practical examples in mind ?

    Do you plan to use those IDs for indexing ? In that case, either you need a 4 billion entry table or you use a smaller set of bits from the key. By doing the last option, you loose the benefits of perfect hashing and collisions come back. Even if it deals faster with collisions (complete hash as key), I admit.

    Otherwise, if comparisons are the only purpose of those IDs, iLych solution (above reply) may work too since we generate them at preprocessing stage. We do not need to store a map, we just want to give a unique ID to each resource and forget all about the rest (e.g. using alphanumeric or timestamp order as mapping rule). Then adding hash to resource data. Or maybe you only need the seed and processes hashes at runtime ?

    Sorry for bothering you with so much questions but I have the feeling it could help me in the future.

    ReplyDelete
  6. Yes, most often we use the IDs for indexing. As you say, we still have to deal with collision in the smaller set of bits, but the data structures become a lot smaller and simpler when you don't have to store a lot of arbitrary sized strings.

    But actually, we don't use perfect hashing any more... we just hash into a large enough keyspace that collisions are so unlikely that we don't have to worry about them.

    ReplyDelete