mirror of
https://github.com/zeek/zeek.git
synced 2025-10-05 16:18:19 +00:00
README documentation
This commit is contained in:
parent
f6c841c737
commit
605d636d94
1 changed files with 232 additions and 0 deletions
232
src/script_opt/CPP/README.md
Normal file
232
src/script_opt/CPP/README.md
Normal file
|
@ -0,0 +1,232 @@
|
|||
<h1 align="center">
|
||||
|
||||
Compiling Zeek Scripts To C++: User's Guide
|
||||
|
||||
</h1><h4 align="center">
|
||||
|
||||
[_Overview_](#overview) -
|
||||
[_Workflows_](#workflows) -
|
||||
[_Known Issues_](#known-issues) -
|
||||
|
||||
</h4>
|
||||
|
||||
|
||||
<br>
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Zeek's _script compiler_ is an experimental feature that translates Zeek
|
||||
scripts into C++, which is then compiled directly into the `zeek` binary in
|
||||
order to gain higher performance by removing the need for Zeek to use an
|
||||
interpreter to execute the scripts. Using this feature requires a
|
||||
somewhat complex [workflow](#workflows).
|
||||
|
||||
How much faster will your scripts run? There's no simple answer to that.
|
||||
It depends heavily on several factors:
|
||||
|
||||
* What proportion of the processing during execution is spent in Zeek's
|
||||
_Event Engine_ rather than executing scripts.
|
||||
|
||||
* What proportion of the script's processing is spent executing built-in
|
||||
functions (BiFs).
|
||||
It might well be that most of your script processing actually occurs inside
|
||||
the _Logging Framework_, for example, and thus you won't see much improvement.
|
||||
|
||||
* Those two factors add up to gains often on the order of only 10-15%,
|
||||
rather than something a lot more dramatic. On the other hand, using
|
||||
this feature you can afford to put significantly more functionality in
|
||||
Zeek scripts without worrying as much about introducing performance
|
||||
bottlenecks.
|
||||
|
||||
That said, I'm very interested in situations where the performance
|
||||
gains appear unsatisfying. Also note that when using the compiler, you
|
||||
can analyze the performance of your scripts using C++-oriented tools -
|
||||
the translated C++ code generally bears a clear relationship
|
||||
with the original Zeek script.
|
||||
|
||||
If you want to know how the compiler itself works, see the sketch
|
||||
at the beginning of `Compile.h`.
|
||||
|
||||
<br>
|
||||
|
||||
|
||||
Workflows
|
||||
---------
|
||||
|
||||
The main code generated by the compiler resides in
|
||||
`src/script_opt/CPP/CPP-gen.cc`. That file does not initially exist, but
|
||||
`src/CMakeLists.txt` specifies it, so in order for `./configure` to
|
||||
succeed to start building Zeek, you need to first issue `touch
|
||||
src/script_opt/CPP/CPP-gen.cc`. (The branch does not include an empty
|
||||
version in `git` to prevent the common error of checking in a non-empty
|
||||
version after using the compiler.)
|
||||
|
||||
As a user, the most common workflow is to build a version of Zeek that
|
||||
has a given target script (`target.zeek`) compiled into it. This means
|
||||
_all of the code pulled in by `target.zeek`_, including the base scripts
|
||||
(or the "bare" subset if you invoke the compiler when running `zeek -b`).
|
||||
The following workflow assumes you are in the `build/` subdirectory:
|
||||
|
||||
1. `./src/zeek -O gen-C++ target.zeek`
|
||||
The generated code is written to
|
||||
`CPP-gen-addl.h`. (This name is a reflection of some more complicated
|
||||
features and probably should be changed.) The compiler will also produce
|
||||
a file `CPP-hashes.dat`, for use by an advanced feature.
|
||||
2. `mv CPP-gen-addl.h ../src/script_opt/CPP/CPP-gen.cc`
|
||||
3. `touch ../src/script_opt/CPP/CPP-gen-addl.h`
|
||||
(Needed because `CPP-gen.cc`
|
||||
expects the file to exist, again in support of more complicated features.)
|
||||
4. `ninja` or `make` to recompile Zeek
|
||||
5. `./src/zeek -O use-C++ target.zeek`
|
||||
Executes with each function/hook/
|
||||
event handler pulled in by `target.zeek` replaced with its compiled version.
|
||||
|
||||
Instead of the last line above, you can use the following variants:
|
||||
|
||||
5. `./src/zeek -O force-use-C++ target.zeek`
|
||||
Same as `use-C++`, but also
|
||||
warns about any `target.zeek` functions that didn't have corresponding
|
||||
compiled-to-C++ versions.
|
||||
|
||||
Or:
|
||||
|
||||
5. `./src/zeek -O report-C++ target.zeek`
|
||||
For each function body in
|
||||
`target.zeek`, reports which ones have compiled-to-C++ bodies available,
|
||||
and also any compiled-to-C++ bodies present in the `zeek` binary that
|
||||
`target.zeek` does not use.
|
||||
|
||||
The above workflows require the subsequent `zeek` execution to include
|
||||
the `target.zeek` script. You can avoid this by replacing the first step with:
|
||||
|
||||
1. `./src/zeek -O gen-standalone-C++ target.zeek >target-stand-in.zeek`
|
||||
|
||||
and then continuing the next three steps. This option prints to _stdout_ a
|
||||
(very short) "stand-in" Zeek script that you can load using
|
||||
`-O use-C++ target-stand-in.zeek` to activate the compiled `target.zeek`
|
||||
without needing to include `target.zeek` in the invocation.
|
||||
|
||||
Note: the implementation differences between `gen-C++` and `gen-standalone-C++`
|
||||
wound up being modest enough that it might make sense to just always provide
|
||||
the latter functionality, which it turns out does not introduce any
|
||||
additional constraints compared to the current `gen-C++` functionality.
|
||||
|
||||
There are additional workflows relating to running the test suite, which
|
||||
we document only briefly here as they're likely going to change or go away
|
||||
, as it's not clear they're actually needed.
|
||||
|
||||
First, `-O update-C++` will run using a Zeek instance that already includes
|
||||
compiled scripts and, for any functions pulled in by the command-line scripts,
|
||||
if they're not already compiled, will generate additional C++ code for
|
||||
those that can be combined with the already-compiled code. The
|
||||
additionally compiled code leverages the existing compiled-in functions
|
||||
(and globals), which it learns about via the `CPP-hashes.dat` file mentioned
|
||||
above. Any code compiled in this fashion must be _consistent_ with the
|
||||
previously compiled code, meaning that globals and extensible types (enums,
|
||||
records) have definitions that align with those previously used, and any
|
||||
other code later compiled must also be consistent.
|
||||
|
||||
In a similar vein, `-O add-C++` likewise uses a Zeek instance that already
|
||||
includes compiled scripts. It generates additional C++ code that leverages
|
||||
that existing compilation. However, this code is _not_ meant for use with
|
||||
subsequently compiled code; later code also build with `add-C++` can have
|
||||
inconsistencies with this code. (The utility of this mode is to support
|
||||
compiling the entire test suite as one large incremental compilation,
|
||||
rather than as hundreds of pointwise compilations.)
|
||||
|
||||
Both of these _append_ to any existing `CPP-gen-addl.h` file, providing
|
||||
a means for building it up to reflect a number of compilations.
|
||||
|
||||
The `update-C++` and `add-C++` options help support different
|
||||
ways of building the `btest` test suie. They were meant to enable doing so
|
||||
without requiring per-test-suite-element recompilations. However, experiences
|
||||
to date have found that trying to avoid pointwise compilations incurs
|
||||
additional headaches, so it's better to just bite off the cost of a large
|
||||
number of recompilations. Given that, it might make sense to remove these
|
||||
options.
|
||||
|
||||
Finally, with respect to workflow there are number of simple scripts in
|
||||
`src/script_opt/CPP/` (which should ultimately be replaced) in support of
|
||||
compiler maintenance:
|
||||
|
||||
* `non-embedded-build`
|
||||
Builds `zeek` without any embedded compiled-to-C++ scripts.
|
||||
* `bare-embedded-build`
|
||||
Builds `zeek` with the `-b` "bare-mode" scripts compiled in.
|
||||
* `full-embedded-build`
|
||||
Builds `zeek` with the default scripts compiled in.
|
||||
|
||||
<br>
|
||||
|
||||
* `eval-test-suite`
|
||||
Runs the test suite using the `cpp` alternative over the given set of tests.
|
||||
* `test-suite-build`
|
||||
Incrementally compiles to `CPP-gen-addl.h` code for the given test suite elements.
|
||||
|
||||
<br>
|
||||
|
||||
* `single-test.sh`
|
||||
Builds the given btest test as a single `add-C++` add-on and then runs it.
|
||||
* `single-full-test.sh`
|
||||
Builds the given btest test from scratch as a self-contained `zeek`, and runs it.
|
||||
* `update-single-test.sh`
|
||||
Given an already-compiled `zeek` for the given test, updates its `cpp` test suite alternative.
|
||||
|
||||
Some of these scripts could be made less messy if `btest` supported
|
||||
a "dry run" option that reported the executions it would do for a given
|
||||
test without actually undertaking them.
|
||||
|
||||
<br>
|
||||
|
||||
Known Issues
|
||||
------------
|
||||
|
||||
Here we list various known issues with using the compiler:
|
||||
<br>
|
||||
|
||||
* Compilation of compiled code can be noticeably slow (if built using
|
||||
`./configure --enable-debug`) or hugely slow (if not), with the latter
|
||||
taking on the order of an hour on a beefy laptop. This slowness complicates
|
||||
CI/CD approaches for always running compiled code against the test suite
|
||||
when merging changes. It's not presently clear how feasible it is to
|
||||
speed this up.
|
||||
|
||||
* Subtle bugs can arise when compiling code that uses `@if` conditional
|
||||
compilation. The compiled code will not directly use the wrong instance
|
||||
of a script body (one that differs due to the `@if` conditional having a
|
||||
different resolution at compile time versus later run-time). However, if
|
||||
compiled code itself calls a function that has conditional code, the
|
||||
compiled code will always call the version of the function present during
|
||||
compilation, rather than the run-time version. This problem can be fixed
|
||||
at the cost of making all function calls more expensive (perhaps a measure
|
||||
that requires an explicit flag to activate); or, when possible, by modifying
|
||||
the conditional code to check the condition at run-time rather than at
|
||||
compile-time.
|
||||
|
||||
* Code compiled with `-O gen-standalone-C++` will not execute any global
|
||||
statements when invoked using the "stand-in" script. The right fix for
|
||||
this is to shift from encapsulating global statements in a pseudo-function,
|
||||
as currently done, to instead be in a pseudo-event handler.
|
||||
|
||||
* Code compiled with `-O gen-standalone-C++` likely has bugs if that
|
||||
code requires initializing a global variable that specifies extend fields in
|
||||
an extensible record (i.e., fields added using `redef`).
|
||||
|
||||
* The compiler will not compile bodies that include "when" statements
|
||||
This is fairly involved to fix.
|
||||
|
||||
* The compiler will not compile bodies that include "type" switches.
|
||||
This is not hard to fix.
|
||||
|
||||
* If a lambda generates an event that is not otherwise referred to, that
|
||||
event will not be registered upon instantiating the lambda. This is not
|
||||
particularly difficult to fix.
|
||||
|
||||
* A number of steps could be taken to increase the performance of
|
||||
the optimized code. These include:
|
||||
1. Switching the generated code to use the new ZVal-related interfaces.
|
||||
2. Directly calling BiFs rather than using the `Invoke()` method to do so. This relates to the broader question of switching BiFs to be based on a notion of "inlined C++" code in Zeek functions, rather than using the standalone `bifcl` BiF compiler.
|
||||
3. Switching the Event Engine over to queuing events with `ZVal` arguments rather than `ValPtr` arguments.
|
||||
4. Making the compiler aware of certain BiFs that can be directly inlined (e.g., `network_time()`), a technique employed effectively by the ZAM compiler.
|
||||
5. Inspecting the generated code for inefficiencies that the compiler could avoid.
|
Loading…
Add table
Add a link
Reference in a new issue