Compiling Zeek Scripts To C++: User's Guide

<h1 align="center">

Compiling Zeek Scripts To C++: User's Guide

</h1><h4 align="center">

[_Overview_](#overview) -
[_Workflows_](#workflows) -
[_Known Issues_](#known-issues) -

</h4>


<br>

Overview
--------

Zeek's _script compiler_ is an experimental feature that translates Zeek
scripts into C++, which is then compiled directly into the `zeek` binary in
order to gain higher performance by removing the need for Zeek to use an
interpreter to execute the scripts.  Using this feature requires a
somewhat complex [workflow](#workflows).

How much faster will your scripts run?  There's no simple answer to that.
It depends heavily on several factors:

* What proportion of the processing during execution is spent in Zeek's
_Event Engine_ rather than executing scripts.

* What proportion of the script's processing is spent executing built-in
functions (BiFs).
It might well be that most of your script processing actually occurs inside
the _Logging Framework_, for example, and thus you won't see much improvement.

* Those two factors add up to gains often on the order of only 10-15%,
rather than something a lot more dramatic.  On the other hand, using
this feature you can afford to put significantly more functionality in
Zeek scripts without worrying as much about introducing performance
bottlenecks.

That said, I'm very interested in situations where the performance
gains appear unsatisfying.  Also note that when using the compiler, you
can analyze the performance of your scripts using C++-oriented tools -
the translated C++ code generally bears a clear relationship
with the original Zeek script.

If you want to know how the compiler itself works, see the sketch
at the beginning of `Compile.h`.

<br>


Workflows
---------

_Before building Zeek_, see the first of the [_Known Issues_](#known-issues)
below regarding compilation times.  If your aim is to exploration of the
functionality rather than production use, you might want to build Zeek
using `./configure --enable-debug`, which can reduce compilation times by
50x (!).  Once you've built it, the following sketches how to create
and use compiled scripts.

The main code generated by the compiler is taken from
`build/CPP-gen.cc`.  An empty version of this is generated when
first building Zeek.

As a user, the most common workflow is to build a version of Zeek that
has a given target script (`target.zeek`) compiled into it.  This means
_all of the code pulled in by `target.zeek`_, including the base scripts
(or the "bare" subset if you invoke the compiler when running `zeek -b`).
The following workflow assumes you are in the `build/` subdirectory:

1. `./src/zeek -O gen-C++ target.zeek`  
The generated code is written to
`CPP-gen.cc`.
2. `ninja` or `make` to recompile Zeek
3. `./src/zeek -O use-C++ target.zeek`  
Executes with each function/hook/event
handler pulled in by `target.zeek` replaced with its compiled version.

Instead of the last line above, you can use the following variants:

3. `./src/zeek -O report-C++ target.zeek`  
For each function body in
`target.zeek`, reports which ones have compiled-to-C++ bodies available,
and also any compiled-to-C++ bodies present in the `zeek` binary that
`target.zeek` does not use.  Useful for debugging.

The above workflows require the subsequent `zeek` execution to include
the `target.zeek` script.  You can avoid this by replacing the first step with:

1. `./src/zeek -O gen-standalone-C++ target.zeek >target-stand-in.zeek`

(and then building as in the 2nd step above).
This option prints to _stdout_ a 
(very short) "stand-in" Zeek script that you can load using
`target-stand-in.zeek` to activate the compiled `target.zeek`
without needing to include `target.zeek` in the invocation (nor
the `-O use-C++` option).  After loading the stand-in script,
you can still access types and functions declared in `target.zeek`.

Note: the implementation differences between `gen-C++` and `gen-standalone-C++`
wound up being modest enough that it might make sense to just always provide
the latter functionality, which it turns out does not introduce any
additional constraints compared to the current `gen-C++` functionality.
On the other hand, it's possible (not yet established) that code created
using `gen-C++` can be made to compile significantly faster than
standalone code.

Another option, `-O add-C++`, instead _appends_ the generated code to existing C++ in `CPP-gen.cc`.
You can use this option repeatedly for different scripts and then
compile the collection _en masse_.

There are additional workflows relating to running the test suite, which
we document only briefly here as they're likely going to change or go away
, as it's not clear they're actually needed.

* `non-embedded-build`  
Builds `zeek` without any embedded compiled-to-C++ scripts.
* `bare-embedded-build`  
Builds `zeek` with the `-b` "bare-mode" scripts compiled in.
* `full-embedded-build`  
Builds `zeek` with the default scripts compiled in.

<br>

* `eval-test-suite`  
Runs the test suite using the `cpp` alternative over the given set of tests.
* `test-suite-build`  
Incrementally compiles to `CPP-gen-addl.h` code for the given test suite elements.

<br>

* `single-test.sh`  
Builds the given btest test as a single `add-C++` add-on and then runs it.
* `single-full-test.sh`  
Builds the given btest test from scratch as a self-contained `zeek`, and runs it.
* `update-single-test.sh`  
Given an already-compiled `zeek` for the given test, updates its `cpp` test suite alternative.

Some of these scripts could be made less messy if `btest` supported
a "dry run" option that reported the executions it would do for a given
test without actually undertaking them.

<br>

Known Issues
------------

Here we list various known issues with using the compiler:
<br>

* Compilation of compiled code can be quite slow when the C++ compilation
includes optimization,
taking many minutes on a beefy laptop.  This slowness complicates
CI/CD approaches for always running compiled code against the test suite
when merging changes.

* Run-time error messages generally lack location information and information
about associated expressions/statements, making them hard to puzzle out.
This could be fixed, but would add execution overhead in passing around
the necessary strings / `Location` objects.

* To avoid subtle bugs, the compiler will refrain from compiling script elements (functions, hooks, event handlers) that include conditional code.  In addition, when using `--optimize-files` it will not compile any functions appearing in a source file that includes conditional code (even if it's not in a function body).

* Code compiled with `-O gen-standalone-C++` will not execute any global
statements when invoked using the "stand-in" script.  The right fix for
this is to shift from encapsulating global statements in a pseudo-function,
as currently done, to instead be in a pseudo-event handler.

* Code compiled with `-O gen-standalone-C++` likely has bugs if that
code requires initializing a global variable that specifies extend fields in
an extensible record (i.e., fields added using `redef`).

* The compiler will not compile bodies that include "when" statements
This is fairly involved to fix.

* The compiler will not compile bodies that include "type" switches.
This is not hard to fix.

* If a lambda generates an event that is not otherwise referred to, that
event will not be registered upon instantiating the lambda.  This is not
particularly difficult to fix.

* A number of steps could be taken to increase the performance of
the optimized code.  These include:
	1. Switching the generated code to use the new ZVal-related interfaces.
	2. Directly calling BiFs rather than using the `Invoke()` method to do so.  This relates to the broader question of switching BiFs to be based on a notion of "inlined C++" code in Zeek functions, rather than using the standalone `bifcl` BiF compiler.
	3. Switching the Event Engine over to queuing events with `ZVal` arguments rather than `ValPtr` arguments.
	4. Making the compiler aware of certain BiFs that can be directly inlined (e.g., `network_time()`), a technique employed effectively by the ZAM compiler.
	5. Inspecting the generated code for inefficiencies that the compiler could avoid.