Change Hashing from H3 to Siphash.

This commit mostly changes the hash function that is used for Internal
hashing of data < 36 bytes from H3 to Siphash. This change is motivated
by the fact that it turns out that H3 apparently does not deliver a very
good source of data uniqueness; running HLL with H3 as a hashing
function results in quite poor results (up to of 75% off in my tests).
In difference, running HLL with Siphash (or HMAC-MD5) changes this
factor to ~2%.

This also fixes a long-standing bug in Hash.h which truncated our hash
values to 32 bit on most machines.

Furthermore, it once again fixes a problem with the Rank function in
HLL.
This commit is contained in:
Johanna Amann 2016-07-13 06:35:32 -07:00
parent c15f48661d
commit e1218cc7fa
10 changed files with 257 additions and 25 deletions

View file

@ -165,6 +165,8 @@ private:
*/
uint8_t Rank(uint64_t hash_modified) const;
static int flsll(uint64_t mask);
/**
* This is the number of buckets that will be stored. The standard
* error is 1.04/sqrt(m), so the actual cardinality will be the