*Post by Michael S*On Thu, 30 May 2024 19:27:48 -0700

*Post by Tim Rentsch**Post by Bonita Montero**Post by Tim Rentsch*I say the output quality is poor because I have run tests that

show the poor output quality.

If you chose a prime whose double is beyond 64 bit there's an

equal distribution among the 64 bit modulos.

*Post by Tim Rentsch*I've done that with a prime of my own choosing and also with

18446744073709551557, the value you suggested.

Show me your code.

Oh get real. It's not my job to make up for your

laziness.

So, what were your conclusions?

Ignoring the speed of computation, would something like

cryptographic hash scaled to bucket size be a best hash for

this type of application? Or some sort of less thorough

grinding of the bits is better?

Factors that matter:

will hashing be done using rehashing or aggregates (lists

or trees, etc) grouped under the first probe location?

(or both?)

can keys be compared for ordering or just for equality?

how expensive is it to test for key equality?

(a) when the keys are equal

(b) when the keys are not equal

32-bit hash values suffice for (a) re-hashing schemes with

tables up to perhaps 2**30 entries, or (b) schemes that

use direct hashing and bucket aggregation with tables up

to 2**32 slots. Anything bigger should use 64-bit hash

values.

hash functions should always produce a full-size hash value

(either 32 bits or 64 bits). Fitting the hash value to

the size of the table is the responsibility of the hash

table management code, not the hash function.

The goal:

An ideal hash function has the property that changing any

single bit of the input key has a 50% chance of flipping

any output bit, with no correlation between output bit

m and output bit n (m != n) for changing a given input

bit, and no correlation between changes to output bit n

for changing input bit i and changing input bit j (i != j)

Evaluating hash functions:

One way to evaluate a hash function is to measure changes

in output bits as a function of changing various inputs

bits, either exhaustively or statistically, looking for an

ideal distribution as described above. Usually doing this

isn't very helpful, because most hash functions don't come

close to the ideal, and it's hard to interpret results that

are not close to the ideal, even though those hashes may be

perfectly fine in practice

So basically we are left with comparing a candidate hash

function, performance-wise, to other hash functions that

are going to have good performance results. Three obvious

choices for those: a slow but high quality hash such as a

cryptographic hash; using a good random number generator

to produce keys where the hash value is equal to the key

value; comparing against theoretical ideals. Looking at

each in turn:

High-quality-but-slow-hash: it isn't hard to make a hash

function that approximates an ideal hash, as long as we

are willing to have it run like a turtle. Or simply use

one of the many known cryptographic hashes and fold the

output down to a reasonably size hash values (most of

these cryptographic hash functions produce large hash

values, 128 bits and up, which isn't a great fit for

ordinary hash table use). In my recent testing I simply

rolled my own high quality hash, which ran about 10 times

slower than the slowest of the other hashes I looked at.

Using a good RNG to produce keys the same as the hash

value: I expect this method to be self-explanatory.

Comparing against theoretical ideals: it isn't hard to

do a calculation that computes a statistical average for

the two kinds of hashing methods. More specifically,

(a) for re-hashing, compute the average number of

probes taken for a hash table with k out of n

slots already filled, and use that to compute

the average number of probes need to fill the

hash table up to, say, 85%

(b) for direct hashing, compute the average number

of slots occupied after inserting some number

of values representing a fixed fraction of the

table size. For example, if we have a table

with 100,000 slots, if we insert 100,000 values

then on the average about 63% of the slots

should be occupied (very close to 1 - 1/e).

Use one or more of the above comparison values (they

should all be very close) to compare against the

performance of the candidate hash function for each

of the possible workloads. My workload set included:

(a) results of malloc() with fixed sized blocks,

with a size going from 1 to some modest value

(50 or 100)

(b) results of malloc() with randomly chosen sizes

based on some distribution

(c) sequential addresses in an array, varying the

element size between 1 and some modest value

(maybe 50)

(d) results of malloc() with a deterministic set

of sizes. I tried varying ones of these, such

as size(i) == i/17 + 1 (filling a table of

65636 entries varying percentages full).

What we would like to see:

for every workload tested, results with 10 or 20%

of the comparison metric used

What we don't want to see (this is important):

any result far away from the comparison metric on any

workload. Importantly this includes results that are

significantly _better_ than the comparison metric.

The reason for this is that doing a lot better on one

test inevitably means doing worse on a different test.

The same is true of random number generators: the

best ones look "averagely random" all the time, never

either "sub random" or "super random".

Of course we always want high quality, but the standards

are higher for (a) direct hashing, and/or (b) cases where

comparing keys is expensive. Note that if comparing keys

is expensive we might choose to store the full hash value

with each table entry, to make "trivial reject" on key

comparison be faster.

Probably much of the above was obvious. I make no apologies for

that. Also it may seem disjointed or unorganized. If so then

sorry about that chief, it's a big topic and there's a lot of

ground to cover, and I tried to hit the most important

highlights.