mirror of
https://github.com/zeek/zeek.git
synced 2025-10-02 06:38:20 +00:00

Add a "broxygen" domain Sphinx extension w/ directives to allow on-the-fly documentation to be generated w/ Bro and included in files. This means all autogenerated reST docs are now done by Bro. The odd CMake/Python glue scipts which used to generate some portions are now gone. Bro and the Sphinx extension handle checking for outdated docs themselves. Parallel builds of `make doc` target should now work (mostly because I don't think there's any tasks that can be done in parallel anymore). Overall, this seems to simplify things and make the Broxygen-generated portions of the documentation visible/traceable from the main Sphinx source tree. The one odd thing still is that per-script documentation is rsync'd in to a shadow copy of the Sphinx source tree within the build dir. This is less elegant than using the new broxygen extension to make per-script docs, but rsync is faster and simpler. Simpler as in less code because it seems like, in the best case, I'd need to write a custom Sphinx Builder to be able to get that to even work.
186 lines
8.1 KiB
ReStructuredText
186 lines
8.1 KiB
ReStructuredText
|
|
=============================
|
|
Binary Output with DataSeries
|
|
=============================
|
|
|
|
.. rst-class:: opening
|
|
|
|
Bro's default ASCII log format is not exactly the most efficient
|
|
way for storing and searching large volumes of data. An an
|
|
alternative, Bro comes with experimental support for `DataSeries
|
|
<http://www.hpl.hp.com/techreports/2009/HPL-2009-323.html>`_
|
|
output, an efficient binary format for recording structured bulk
|
|
data. DataSeries is developed and maintained at HP Labs.
|
|
|
|
.. contents::
|
|
|
|
Installing DataSeries
|
|
---------------------
|
|
|
|
To use DataSeries, its libraries must be available at compile-time,
|
|
along with the supporting *Lintel* package. Generally, both are
|
|
distributed on `HP Labs' web site
|
|
<http://tesla.hpl.hp.com/opensource/>`_. Currently, however, you need
|
|
to use recent development versions for both packages, which you can
|
|
download from github like this::
|
|
|
|
git clone http://github.com/dataseries/Lintel
|
|
git clone http://github.com/dataseries/DataSeries
|
|
|
|
To build and install the two into ``<prefix>``, do::
|
|
|
|
( cd Lintel && mkdir build && cd build && cmake -DCMAKE_INSTALL_PREFIX=<prefix> .. && make && make install )
|
|
( cd DataSeries && mkdir build && cd build && cmake -DCMAKE_INSTALL_PREFIX=<prefix> .. && make && make install )
|
|
|
|
Please refer to the packages' documentation for more information about
|
|
the installation process. In particular, there's more information on
|
|
required and optional `dependencies for Lintel
|
|
<https://raw.github.com/dataseries/Lintel/master/doc/dependencies.txt>`_
|
|
and `dependencies for DataSeries
|
|
<https://raw.github.com/dataseries/DataSeries/master/doc/dependencies.txt>`_.
|
|
For users on RedHat-style systems, you'll need the following::
|
|
|
|
yum install libxml2-devel boost-devel
|
|
|
|
Compiling Bro with DataSeries Support
|
|
-------------------------------------
|
|
|
|
Once you have installed DataSeries, Bro's ``configure`` should pick it
|
|
up automatically as long as it finds it in a standard system location.
|
|
Alternatively, you can specify the DataSeries installation prefix
|
|
manually with ``--with-dataseries=<prefix>``. Keep an eye on
|
|
``configure``'s summary output, if it looks like the following, Bro
|
|
found DataSeries and will compile in the support::
|
|
|
|
# ./configure --with-dataseries=/usr/local
|
|
[...]
|
|
====================| Bro Build Summary |=====================
|
|
[...]
|
|
DataSeries: true
|
|
[...]
|
|
================================================================
|
|
|
|
Activating DataSeries
|
|
---------------------
|
|
|
|
The direct way to use DataSeries is to switch *all* log files over to
|
|
the binary format. To do that, just add ``redef
|
|
Log::default_writer=Log::WRITER_DATASERIES;`` to your ``local.bro``.
|
|
For testing, you can also just pass that on the command line::
|
|
|
|
bro -r trace.pcap Log::default_writer=Log::WRITER_DATASERIES
|
|
|
|
With that, Bro will now write all its output into DataSeries files
|
|
``*.ds``. You can inspect these using DataSeries's set of command line
|
|
tools, which its installation process installs into ``<prefix>/bin``.
|
|
For example, to convert a file back into an ASCII representation::
|
|
|
|
$ ds2txt conn.log
|
|
[... We skip a bunch of metadata here ...]
|
|
ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes
|
|
1300475167.096535 CRCC5OdDlXe 141.142.220.202 5353 224.0.0.251 5353 udp dns 0.000000 0 0 S0 F 0 D 1 73 0 0
|
|
1300475167.097012 o7XBsfvo3U1 fe80::217:f2ff:fed7:cf65 5353 ff02::fb 5353 udp 0.000000 0 0 S0 F 0 D 1 199 0 0
|
|
1300475167.099816 pXPi1kPMgxb 141.142.220.50 5353 224.0.0.251 5353 udp 0.000000 0 0 S0 F 0 D 1 179 0 0
|
|
1300475168.853899 R7sOc16woCj 141.142.220.118 43927 141.142.2.2 53 udp dns 0.000435 38 89 SF F 0 Dd 1 66 1 117
|
|
1300475168.854378 Z6dfHVmt0X7 141.142.220.118 37676 141.142.2.2 53 udp dns 0.000420 52 99 SF F 0 Dd 1 80 1 127
|
|
1300475168.854837 k6T92WxgNAh 141.142.220.118 40526 141.142.2.2 53 udp dns 0.000392 38 183 SF F 0 Dd 1 66 1 211
|
|
[...]
|
|
|
|
(``--skip-all`` suppresses the metadata.)
|
|
|
|
Note that the ASCII conversion is *not* equivalent to Bro's default
|
|
output format.
|
|
|
|
You can also switch only individual files over to DataSeries by adding
|
|
code like this to your ``local.bro``:
|
|
|
|
.. code:: bro
|
|
|
|
event bro_init()
|
|
{
|
|
local f = Log::get_filter(Conn::LOG, "default"); # Get default filter for connection log.
|
|
f$writer = Log::WRITER_DATASERIES; # Change writer type.
|
|
Log::add_filter(Conn::LOG, f); # Replace filter with adapted version.
|
|
}
|
|
|
|
Bro's DataSeries writer comes with a few tuning options, see
|
|
:doc:`/scripts/base/frameworks/logging/writers/dataseries.bro`.
|
|
|
|
Working with DataSeries
|
|
=======================
|
|
|
|
Here are a few examples of using DataSeries command line tools to work
|
|
with the output files.
|
|
|
|
* Printing CSV::
|
|
|
|
$ ds2txt --csv conn.log
|
|
ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes
|
|
1258790493.773208,ZTtgbHvf4s3,192.168.1.104,137,192.168.1.255,137,udp,dns,3.748891,350,0,S0,F,0,D,7,546,0,0
|
|
1258790451.402091,pOY6Rw7lhUd,192.168.1.106,138,192.168.1.255,138,udp,,0.000000,0,0,S0,F,0,D,1,229,0,0
|
|
1258790493.787448,pn5IiEslca9,192.168.1.104,138,192.168.1.255,138,udp,,2.243339,348,0,S0,F,0,D,2,404,0,0
|
|
1258790615.268111,D9slyIu3hFj,192.168.1.106,137,192.168.1.255,137,udp,dns,3.764626,350,0,S0,F,0,D,7,546,0,0
|
|
[...]
|
|
|
|
Add ``--separator=X`` to set a different separator.
|
|
|
|
* Extracting a subset of columns::
|
|
|
|
$ ds2txt --select '*' ts,id.resp_h,id.resp_p --skip-all conn.log
|
|
1258790493.773208 192.168.1.255 137
|
|
1258790451.402091 192.168.1.255 138
|
|
1258790493.787448 192.168.1.255 138
|
|
1258790615.268111 192.168.1.255 137
|
|
1258790615.289842 192.168.1.255 138
|
|
[...]
|
|
|
|
* Filtering rows::
|
|
|
|
$ ds2txt --where '*' 'duration > 5 && id.resp_p > 1024' --skip-all conn.ds
|
|
1258790631.532888 V8mV5WLITu5 192.168.1.105 55890 239.255.255.250 1900 udp 15.004568 798 0 S0 F 0 D 6 966 0 0
|
|
1258792413.439596 tMcWVWQptvd 192.168.1.105 55890 239.255.255.250 1900 udp 15.004581 798 0 S0 F 0 D 6 966 0 0
|
|
1258794195.346127 cQwQMRdBrKa 192.168.1.105 55890 239.255.255.250 1900 udp 15.005071 798 0 S0 F 0 D 6 966 0 0
|
|
1258795977.253200 i8TEjhWd2W8 192.168.1.105 55890 239.255.255.250 1900 udp 15.004824 798 0 S0 F 0 D 6 966 0 0
|
|
1258797759.160217 MsLsBA8Ia49 192.168.1.105 55890 239.255.255.250 1900 udp 15.005078 798 0 S0 F 0 D 6 966 0 0
|
|
1258799541.068452 TsOxRWJRGwf 192.168.1.105 55890 239.255.255.250 1900 udp 15.004082 798 0 S0 F 0 D 6 966 0 0
|
|
[...]
|
|
|
|
* Calculate some statistics:
|
|
|
|
Mean/stddev/min/max over a column::
|
|
|
|
$ dsstatgroupby '*' basic duration from conn.ds
|
|
# Begin DSStatGroupByModule
|
|
# processed 2159 rows, where clause eliminated 0 rows
|
|
# count(*), mean(duration), stddev, min, max
|
|
2159, 42.7938, 1858.34, 0, 86370
|
|
[...]
|
|
|
|
Quantiles of total connection volume::
|
|
|
|
$ dsstatgroupby '*' quantile 'orig_bytes + resp_bytes' from conn.ds
|
|
[...]
|
|
2159 data points, mean 24616 +- 343295 [0,1.26615e+07]
|
|
quantiles about every 216 data points:
|
|
10%: 0, 124, 317, 348, 350, 350, 601, 798, 1469
|
|
tails: 90%: 1469, 95%: 7302, 99%: 242629, 99.5%: 1226262
|
|
[...]
|
|
|
|
The ``man`` pages for these tools show further options, and their
|
|
``-h`` option gives some more information (either can be a bit cryptic
|
|
unfortunately though).
|
|
|
|
Deficiencies
|
|
------------
|
|
|
|
Due to limitations of the DataSeries format, one cannot inspect its
|
|
files before they have been fully written. In other words, when using
|
|
DataSeries, it's currently not possible to inspect the live log
|
|
files inside the spool directory before they are rotated to their
|
|
final location. It seems that this could be fixed with some effort,
|
|
and we will work with DataSeries development team on that if the
|
|
format gains traction among Bro users.
|
|
|
|
Likewise, we're considering writing custom command line tools for
|
|
interacting with DataSeries files, making that a bit more convenient
|
|
than what the standard utilities provide.
|