* origin/topic/jsiwek/file-signatures:
File type detection changes and fix https.log {orig,resp}_fuids fields.
Various minor changes related to file mime type detection.
Refactor common MIME magic matching code.
Replace libmagic w/ Bro signatures for file MIME type identification.
Conflicts:
scripts/base/init-default.bro
testing/btest/Baseline/coverage.bare-load-baseline/canonified_loaded_scripts.log
testing/btest/Baseline/coverage.default-load-baseline/canonified_loaded_scripts.log
BIT-1143 #merged
Notable changes:
- libmagic is no longer used at all. All MIME type detection is
done through new Bro signatures, and there's no longer a means to get
verbose file type descriptions (e.g. "PNG image data, 1435 x 170").
The majority of the default file magic signatures are derived
from the default magic database of libmagic ~5.17.
- File magic signatures consist of two new constructs in the
signature rule parsing grammar: "file-magic" gives a regular
expression to match against, and "file-mime" gives the MIME type
string of content that matches the magic and an optional strength
value for the match.
- Modified signature/rule syntax for identifiers: they can no longer
start with a '-', which made for ambiguous syntax when doing negative
strength values in "file-mime". Also brought syntax for Bro script
identifiers in line with reality (they can't start with numbers or
include '-' at all).
- A new Built-In Function, "file_magic", can be used to get all
file magic matches and their corresponding strength against a given
chunk of data
- The second parameter of the "identify_data" Built-In Function
can no longer be used to get verbose file type descriptions, though it
can still be used to get the strongest matching file magic signature.
- The "file_transferred" event's "descr" parameter no longer
contains verbose file type descriptions.
- The BROMAGIC environment variable no longer changes any behavior
in Bro as magic databases are no longer used/installed.
- Reverted back to minimum requirement of CMake 2.6.3 from 2.8.0
(it's back to being the same requirement as the Bro v2.2 release).
The bump was to accomodate building libmagic as an external project,
which is no longer needed.
Addresses BIT-1143.
This changes the internal type that is used to signal that a vector
is unspecified from any to void.
I tried to verify that the behavior of Bro is still the same. After
a lot of playing around, I think everything still should worl as before.
However, it might be good for someone to take a look at this.
addresses BIT-1144
* origin/topic/bernhard/hyperloglog: (32 commits)
add clustered leak test for hll. No issues.
make gcc happy
(hopefully) fix refcounting problem in hll/bloom-filter opaque vals. Thanks Robin.
re-use same hash class for all add operations
get hll ready for merging
and forgot a file...
adapt to new structure
fix opaqueval-related memleak.
make it compile on case-sensitive file systems and fix warnings
make error rate configureable
add persistence test not using predetermined random seeds.
update cluster test to also use hll
persistence really works.
well, with this commit synchronizing the data structure should work.. ...if we had consistent hashing.
and also serialize the other things we need
ok, this bug was hard to find.
serialization compiles.
change plugin after feedback of seth
Forgot a file. Again. Like always. Basically.
do away with old file.
...
BIT-1048 #merged
I'm reverting the serializer version update for now as that breaks
Broccoli. Let's do that later for 2.2.
* topic/robin/topk-merge:
update documentation, rename get* to Get* and make hasher persistent
adapt to new folder structure
fix opaqueval-related memleak
synchronize pruned attribute
potentially found wrong Ref.
add sum function that can be used to get the number of total observed elements.
in cluster settings, the resultvals can apparently been uninitialized in some special cases
fix memory leaks
fix warnings
add topk cluster test
make size of topk-list configureable when using sumstats
implement merging for top-k.
add serialization for topk
make the get function const
topk for sumstats
well, a test that works..
implement topk.
* topic/robin/bloom-filter-merge:
Using a real hash function for hashing a BitVector's internal state.
Support UHF hashing for >= UHASH_KEY_SIZE bytes.
Changing the Bloom filter hashing so that it's independent of CompositeHash.
Add new BiF for low-level Bloom filter initialization.
Introduce global_hash_seed script variable.
Conflicts:
testing/btest/Baseline/bifs.bloomfilter/output
* origin/topic/bernhard/topk:
adapt to new folder structure
fix opaqueval-related memleak
synchronize pruned attribute
potentially found wrong Ref.
add sum function that can be used to get the number of total observed elements.
in cluster settings, the resultvals can apparently been uninitialized in some special cases
fix memory leaks
fix warnings
add topk cluster test
make size of topk-list configureable when using sumstats
implement merging for top-k.
add serialization for topk
make the get function const
topk for sumstats
well, a test that works..
implement topk.
CompositeHash.
We do this by hashing values added to a BloomFilter another time more
with a stable hash seeded only by either the filter's name or the
global_hash_seed (or Bro's random() seed if neither is defined).
I'm also adding a new bif bloomfilter_internal_state() that returns a
string representation of a Bloom filter's current internal state. This
is solely for writing tests that check that the filters end up
consistent when seeded with the same value.
This commit adds support for script-level specification of a seed to be used by
hashers. For example, if the given name of a Bloom filter is not empty, then
the seed used by the underlying hasher only depends on the Bloom filter name.
If the name is empty, we check whether the user defined a non-empty
global_hash_seed string variable at script and use it instead. If that script
variable does not exist, then we fall back to the initial seed computed a
Bro startup (which is affected ultimately by $BRO_SEED).
See Hasher::MakeSeed for details.
with a bloom-filter already containing values.
I assume that it is ok to merge an empty bloom-filter with any bloom-filter -
if not we have to change the patch to return an error in this case.
When constructing a Bloom filter, one now has to pass a HashPolicy instance to
it. This separates more clearly the concerns of hashing and Bloom filter
management.
This commit also changes the interface to initialize Bloom filters: there exist
now two initialization functions, one for each type:
(1) bloomfilter_basic_init(fp: double,
capacity: count,
name: string &default=""): opaque of bloomfilter
(2) bloomfilter_counting_init(k: count,
cells: count,
max: count,
name: string &default=""): opaque of bloomfilter
The BiFs for adding elements and performing lookups remain the same. This
essentially gives us "BiF polymorphism" at script land, where the
initialization BiF constructs the most derived type while subsequent BiFs
adhere to the same interface.
The reason why we split up the constructor in this case is that we have not yet
derived the math that computes the optimal number of hash functions for
counting Bloom filters---users have to explicitly parameterize them for now.
Thanks to git this merge was less troublesome that I was afraid it
would be. Not all tests pass yet though (and file hashes have changed
unfortunately).
Conflicts:
cmake
doc/scripts/DocSourcesList.cmake
scripts/base/init-bare.bro
scripts/base/protocols/ftp/main.bro
scripts/base/protocols/irc/dcc-send.bro
scripts/test-all-policy.bro
src/AnalyzerTags.h
src/CMakeLists.txt
src/analyzer/Analyzer.cc
src/analyzer/protocol/file/File.cc
src/analyzer/protocol/file/File.h
src/analyzer/protocol/http/HTTP.cc
src/analyzer/protocol/http/HTTP.h
src/analyzer/protocol/mime/MIME.cc
src/event.bif
src/main.cc
src/util-config.h.in
testing/btest/Baseline/coverage.bare-load-baseline/canonified_loaded_scripts.log
testing/btest/Baseline/coverage.default-load-baseline/canonified_loaded_scripts.log
testing/btest/Baseline/istate.events-ssl/receiver.http.log
testing/btest/Baseline/istate.events-ssl/sender.http.log
testing/btest/Baseline/istate.events/receiver.http.log
testing/btest/Baseline/istate.events/sender.http.log
And changed the endianness parameter of bytestring_to_count() BIF to
default to false (big endian), mostly just to prove that the BIF parser
doesn't choke on default parameters.
observed elements.
Add methods to merge with and without pruning (before only merge
method was with pruning, which invalidates the number of total
observed elements)
I am not (entirely) sure that this is mathematically correct, but
I am (more and more) getting the feeling that it... might be.
In any case - this was the last step and now it should work
in cluster settings.
Note: merging top-k data structures is not yet possible (and is
actually quite awkward/expensive). I will have to think about
how to do that for a bit...