README documentation

2025-10-05 16:18:19 +00:00 · 2021-04-19 23:09:38 -07:00 · 2021-04-19 23:09:38 -07:00 · 605d636d94
commit 605d636d94
parent f6c841c737
1 changed files with 232 additions and 0 deletions
--- a/src/script_opt/CPP/README.md
+++ b/src/script_opt/CPP/README.md
@ -0,0 +1,232 @@
 <h1 align="center">
 Compiling Zeek Scripts To C++: User's Guide
 </h1><h4 align="center">
 [_Overview_](#overview) -
 [_Workflows_](#workflows) -
 [_Known Issues_](#known-issues) -
 </h4>
 <br>
 Overview
 --------
 Zeek's _script compiler_ is an experimental feature that translates Zeek
 scripts into C++, which is then compiled directly into the `zeek` binary in
 order to gain higher performance by removing the need for Zeek to use an
 interpreter to execute the scripts.  Using this feature requires a
 somewhat complex [workflow](#workflows).
 How much faster will your scripts run?  There's no simple answer to that.
 It depends heavily on several factors:
 * What proportion of the processing during execution is spent in Zeek's
 _Event Engine_ rather than executing scripts.
 * What proportion of the script's processing is spent executing built-in
 functions (BiFs).
 It might well be that most of your script processing actually occurs inside
 the _Logging Framework_, for example, and thus you won't see much improvement.
 * Those two factors add up to gains often on the order of only 10-15%,
 rather than something a lot more dramatic.  On the other hand, using
 this feature you can afford to put significantly more functionality in
 Zeek scripts without worrying as much about introducing performance
 bottlenecks.
 That said, I'm very interested in situations where the performance
 gains appear unsatisfying.  Also note that when using the compiler, you
 can analyze the performance of your scripts using C++-oriented tools -
 the translated C++ code generally bears a clear relationship
 with the original Zeek script.
 If you want to know how the compiler itself works, see the sketch
 at the beginning of `Compile.h`.
 <br>
 Workflows
 ---------
 The main code generated by the compiler resides in
 `src/script_opt/CPP/CPP-gen.cc`.  That file does not initially exist, but
 `src/CMakeLists.txt` specifies it, so in order for `./configure` to
 succeed to start building Zeek, you need to first issue `touch
 src/script_opt/CPP/CPP-gen.cc`.  (The branch does not include an empty
 version in `git` to prevent the common error of checking in a non-empty
 version after using the compiler.)
 As a user, the most common workflow is to build a version of Zeek that
 has a given target script (`target.zeek`) compiled into it.  This means
 _all of the code pulled in by `target.zeek`_, including the base scripts
 (or the "bare" subset if you invoke the compiler when running `zeek -b`).
 The following workflow assumes you are in the `build/` subdirectory:
 1. `./src/zeek -O gen-C++ target.zeek`  
 The generated code is written to
 `CPP-gen-addl.h`.  (This name is a reflection of some more complicated
 features and probably should be changed.)  The compiler will also produce
 a file `CPP-hashes.dat`, for use by an advanced feature.
 2. `mv CPP-gen-addl.h ../src/script_opt/CPP/CPP-gen.cc`
 3. `touch ../src/script_opt/CPP/CPP-gen-addl.h`  
 (Needed because `CPP-gen.cc`
 expects the file to exist, again in support of more complicated features.)
 4. `ninja` or `make` to recompile Zeek
 5. `./src/zeek -O use-C++ target.zeek`  
 Executes with each function/hook/
 event handler pulled in by `target.zeek` replaced with its compiled version.
 Instead of the last line above, you can use the following variants:
 5. `./src/zeek -O force-use-C++ target.zeek`  
 Same as `use-C++`, but also
 warns about any `target.zeek` functions that didn't have corresponding
 compiled-to-C++ versions.
 Or:
 5. `./src/zeek -O report-C++ target.zeek`  
 For each function body in
 `target.zeek`, reports which ones have compiled-to-C++ bodies available,
 and also any compiled-to-C++ bodies present in the `zeek` binary that
 `target.zeek` does not use.
 The above workflows require the subsequent `zeek` execution to include
 the `target.zeek` script.  You can avoid this by replacing the first step with:
 1. `./src/zeek -O gen-standalone-C++ target.zeek >target-stand-in.zeek`
 and then continuing the next three steps.  This option prints to _stdout_ a 
 (very short) "stand-in" Zeek script that you can load using
 `-O use-C++ target-stand-in.zeek` to activate the compiled `target.zeek`
 without needing to include `target.zeek` in the invocation.
 Note: the implementation differences between `gen-C++` and `gen-standalone-C++`
 wound up being modest enough that it might make sense to just always provide
 the latter functionality, which it turns out does not introduce any
 additional constraints compared to the current `gen-C++` functionality.
 There are additional workflows relating to running the test suite, which
 we document only briefly here as they're likely going to change or go away
 , as it's not clear they're actually needed.
 First, `-O update-C++` will run using a Zeek instance that already includes
 compiled scripts and, for any functions pulled in by the command-line scripts,
 if they're not already compiled, will generate additional C++ code for
 those that can be combined with the already-compiled code.  The
 additionally compiled code leverages the existing compiled-in functions
 (and globals), which it learns about via the `CPP-hashes.dat` file mentioned
 above.  Any code compiled in this fashion must be _consistent_ with the
 previously compiled code, meaning that globals and extensible types (enums,
 records) have definitions that align with those previously used, and any
 other code later compiled must also be consistent.
 In a similar vein, `-O add-C++` likewise uses a Zeek instance that already
 includes compiled scripts.  It generates additional C++ code that leverages
 that existing compilation.  However, this code is _not_ meant for use with
 subsequently compiled code; later code also build with `add-C++` can have
 inconsistencies with this code.  (The utility of this mode is to support
 compiling the entire test suite as one large incremental compilation,
 rather than as hundreds of pointwise compilations.)
 Both of these _append_ to any existing `CPP-gen-addl.h` file, providing
 a means for building it up to reflect a number of compilations.
 The `update-C++` and `add-C++` options help support different
 ways of building the `btest` test suie.  They were meant to enable doing so
 without requiring per-test-suite-element recompilations.  However, experiences
 to date have found that trying to avoid pointwise compilations incurs
 additional headaches, so it's better to just bite off the cost of a large
 number of recompilations.  Given that, it might make sense to remove these
 options.
 Finally, with respect to workflow there are number of simple scripts in
 `src/script_opt/CPP/` (which should ultimately be replaced) in support of
 compiler maintenance:
 * `non-embedded-build`  
 Builds `zeek` without any embedded compiled-to-C++ scripts.
 * `bare-embedded-build`  
 Builds `zeek` with the `-b` "bare-mode" scripts compiled in.
 * `full-embedded-build`  
 Builds `zeek` with the default scripts compiled in.
 <br>
 * `eval-test-suite`  
 Runs the test suite using the `cpp` alternative over the given set of tests.
 * `test-suite-build`  
 Incrementally compiles to `CPP-gen-addl.h` code for the given test suite elements.
 <br>
 * `single-test.sh`  
 Builds the given btest test as a single `add-C++` add-on and then runs it.
 * `single-full-test.sh`  
 Builds the given btest test from scratch as a self-contained `zeek`, and runs it.
 * `update-single-test.sh`  
 Given an already-compiled `zeek` for the given test, updates its `cpp` test suite alternative.
 Some of these scripts could be made less messy if `btest` supported
 a "dry run" option that reported the executions it would do for a given
 test without actually undertaking them.
 <br>
 Known Issues
 ------------
 Here we list various known issues with using the compiler:
 <br>
 * Compilation of compiled code can be noticeably slow (if built using
 `./configure --enable-debug`) or hugely slow (if not), with the latter
 taking on the order of an hour on a beefy laptop.  This slowness complicates
 CI/CD approaches for always running compiled code against the test suite
 when merging changes.  It's not presently clear how feasible it is to
 speed this up.
 * Subtle bugs can arise when compiling code that uses `@if` conditional
 compilation.  The compiled code will not directly use the wrong instance
 of a script body (one that differs due to the `@if` conditional having a
 different resolution at compile time versus later run-time).  However, if
 compiled code itself calls a function that has conditional code, the
 compiled code will always call the version of the function present during
 compilation, rather than the run-time version.  This problem can be fixed
 at the cost of making all function calls more expensive (perhaps a measure
 that requires an explicit flag to activate); or, when possible, by modifying
 the conditional code to check the condition at run-time rather than at
 compile-time.
 * Code compiled with `-O gen-standalone-C++` will not execute any global
 statements when invoked using the "stand-in" script.  The right fix for
 this is to shift from encapsulating global statements in a pseudo-function,
 as currently done, to instead be in a pseudo-event handler.
 * Code compiled with `-O gen-standalone-C++` likely has bugs if that
 code requires initializing a global variable that specifies extend fields in
 an extensible record (i.e., fields added using `redef`).
 * The compiler will not compile bodies that include "when" statements
 This is fairly involved to fix.
 * The compiler will not compile bodies that include "type" switches.
 This is not hard to fix.
 * If a lambda generates an event that is not otherwise referred to, that
 event will not be registered upon instantiating the lambda.  This is not
 particularly difficult to fix.
 * A number of steps could be taken to increase the performance of
 the optimized code.  These include:
 	1. Switching the generated code to use the new ZVal-related interfaces.
 	2. Directly calling BiFs rather than using the `Invoke()` method to do so.  This relates to the broader question of switching BiFs to be based on a notion of "inlined C++" code in Zeek functions, rather than using the standalone `bifcl` BiF compiler.
 	3. Switching the Event Engine over to queuing events with `ZVal` arguments rather than `ValPtr` arguments.
 	4. Making the compiler aware of certain BiFs that can be directly inlined (e.g., `network_time()`), a technique employed effectively by the ZAM compiler.
 	5. Inspecting the generated code for inefficiencies that the compiler could avoid.