The call to Empty() was originally meant as an optimization in the
lookup phase. However, the performance implications are substantial:
this check operates in O(f(m/8)) where m is the number of bits in the
Bloom filters and f a function that looks for the first non-empty block
of bits.
As the Bloom filter fills up, the check for Empty() becomes no longer
negligible and can lead to serious performance degradations when Bloom
filters are used frequently.
This commit marks (hopefully) ever one-parameter constructor as explicit.
It also uses override in (hopefully) all circumstances where a virtual
method is overridden.
There are a very few other minor changes - most of them were necessary
to get everything to compile (like one additional constructor). In one
case I changed an implicit operation to an explicit string conversion -
I think the automatically chosen conversion was much more convoluted.
This took longer than I want to admit but not as long as I feared :)
Addig a new random seed for external tests.
I added a wrapper around the siphash() function to make calling it a
little bit safer at least.
BIT-1612 #merged
* origin/topic/johanna/bit-1612:
HLL: Fix missing typecast in test case.
Remove the -K/-J options for setting keys.
Add test checking the quality of HLL by adding a lot of elements.
Fix serializing probabilistic hashers.
Baseline updates after hash function change.
Also switch BloomFilters from H3 to siphash.
Change Hashing from H3 to Siphash.
HLL: Remove unnecessary comparison.
Hyperloglog: change calculation of Rho
This commit mostly changes the hash function that is used for Internal
hashing of data < 36 bytes from H3 to Siphash. This change is motivated
by the fact that it turns out that H3 apparently does not deliver a very
good source of data uniqueness; running HLL with H3 as a hashing
function results in quite poor results (up to of 75% off in my tests).
In difference, running HLL with Siphash (or HMAC-MD5) changes this
factor to ~2%.
This also fixes a long-standing bug in Hash.h which truncated our hash
values to 32 bit on most machines.
Furthermore, it once again fixes a problem with the Rank function in
HLL.
This commit changes the calculation of the rho-value to be in line with
the implementation of the original research paper, counting the number
of zero bits before the data.
This also fixes an infinite loop in case the hash value is 0.
I also cleaned up the code a bit, converting the raw pointers that were
used to a STL vector.
Addresses BIT-1612
When our universal hash function fell back to MD5 for inputs larger than
supported by H3, the computation only returned the first byte of the MD5 result
instead of as many bytes as needed to cover sizeof(Hasher::digest).
This patch fixes a bug that occurred when calling the BiF bloomfilter_clear,
which used to not only clear the underlying bit vector but also set its size to
zero. As a result, subsequent element access or computations using the bit
vector size caused erroneous behavior.
Reported by @colonelxc.
This changes the internal type that is used to signal that a vector
is unspecified from any to void.
I tried to verify that the behavior of Bro is still the same. After
a lot of playing around, I think everything still should worl as before.
However, it might be good for someone to take a look at this.
addresses BIT-1144
Fixed typos in documentation of hexstr_to_bytestring.
Also added documentation that was missing for function parameters
and return values of other BIFs.
* origin/topic/bernhard/ticket1072:
and const 2 more functions
update hll documentation, make a few functions private and create a new copy constructor.
fix case where hll_error_margin could be undefined (thanks John)
BIT-1072 #merged
This cleans up most of the warnings from sphinx (broken :doc: links,
broxygen role misuses, etc.). The remaining ones should be harmless,
but not quick to silence.
I found that the README for each component was a copy from the actual
repo, so I turned those in to symlinks so they don't get out of date.
They conflict with the "char" version, so that other classes would now
pick the wrong one. Added a bit of casting to HLL to use the "char"
versions instead.
* topic/robin/hyperloglog-merge: (35 commits)
Making the confidence configurable.
Renaming HyperLogLog->CardinalityCounter.
Fixing bug introduced during merging.
add clustered leak test for hll. No issues.
make gcc happy
(hopefully) fix refcounting problem in hll/bloom-filter opaque vals. Thanks Robin.
re-use same hash class for all add operations
get hll ready for merging
and forgot a file...
adapt to new structure
fix opaqueval-related memleak.
make it compile on case-sensitive file systems and fix warnings
make error rate configureable
add persistence test not using predetermined random seeds.
update cluster test to also use hll
persistence really works.
well, with this commit synchronizing the data structure should work.. ...if we had consistent hashing.
and also serialize the other things we need
ok, this bug was hard to find.
serialization compiles.
...
* origin/topic/bernhard/hyperloglog: (32 commits)
add clustered leak test for hll. No issues.
make gcc happy
(hopefully) fix refcounting problem in hll/bloom-filter opaque vals. Thanks Robin.
re-use same hash class for all add operations
get hll ready for merging
and forgot a file...
adapt to new structure
fix opaqueval-related memleak.
make it compile on case-sensitive file systems and fix warnings
make error rate configureable
add persistence test not using predetermined random seeds.
update cluster test to also use hll
persistence really works.
well, with this commit synchronizing the data structure should work.. ...if we had consistent hashing.
and also serialize the other things we need
ok, this bug was hard to find.
serialization compiles.
change plugin after feedback of seth
Forgot a file. Again. Like always. Basically.
do away with old file.
...
BIT-1048 #merged
I'm reverting the serializer version update for now as that breaks
Broccoli. Let's do that later for 2.2.
* topic/robin/topk-merge:
update documentation, rename get* to Get* and make hasher persistent
adapt to new folder structure
fix opaqueval-related memleak
synchronize pruned attribute
potentially found wrong Ref.
add sum function that can be used to get the number of total observed elements.
in cluster settings, the resultvals can apparently been uninitialized in some special cases
fix memory leaks
fix warnings
add topk cluster test
make size of topk-list configureable when using sumstats
implement merging for top-k.
add serialization for topk
make the get function const
topk for sumstats
well, a test that works..
implement topk.
* topic/robin/bloom-filter-merge:
Using a real hash function for hashing a BitVector's internal state.
Support UHF hashing for >= UHASH_KEY_SIZE bytes.
Changing the Bloom filter hashing so that it's independent of CompositeHash.
Add new BiF for low-level Bloom filter initialization.
Introduce global_hash_seed script variable.
Conflicts:
testing/btest/Baseline/bifs.bloomfilter/output
* origin/topic/bernhard/topk:
adapt to new folder structure
fix opaqueval-related memleak
synchronize pruned attribute
potentially found wrong Ref.
add sum function that can be used to get the number of total observed elements.
in cluster settings, the resultvals can apparently been uninitialized in some special cases
fix memory leaks
fix warnings
add topk cluster test
make size of topk-list configureable when using sumstats
implement merging for top-k.
add serialization for topk
make the get function const
topk for sumstats
well, a test that works..
implement topk.