mirror of https://github.com/zeek/zeek.git synced 2025-10-05 08:08:19 +00:00

History

Tim Wojtulewicz 64748edab1 Replace most uses of typedef with using for type aliasing		2021-10-11 14:51:10 -07:00
..
Attrs.cc	Reformat the world	2021-09-16 15:35:39 -07:00
bare-embedded-build	updates to development helper scripts to support new workflow	2021-06-04 17:02:43 -07:00
Compile.h	Reformat the world	2021-09-16 15:35:39 -07:00
Consts.cc	clang-format: Set IndentCaseBlocks to false	2021-09-27 10:49:48 -07:00
CPP-load.bif	fixes for standalone C++ scripts making types & variables/functions available	2021-06-04 17:14:46 -07:00
DeclFunc.cc	Reformat the world	2021-09-16 15:35:39 -07:00
Driver.cc	Reformat the world	2021-09-16 15:35:39 -07:00
Emit.cc	Reformat the world	2021-09-16 15:35:39 -07:00
eval-test-suite	minor tweaks tidyness tweaks	2021-05-05 16:55:04 -07:00
Exprs.cc	clang-format: Set penalty for breaking after assignment operator	2021-09-27 10:49:48 -07:00
full-embedded-build	updates to development helper scripts to support new workflow	2021-06-04 17:02:43 -07:00
Func.cc	Reformat the world	2021-09-16 15:35:39 -07:00
Func.h	Reformat the world	2021-09-16 15:35:39 -07:00
GenFunc.cc	Reformat the world	2021-09-16 15:35:39 -07:00
HashMgr.cc	Reformat the world	2021-09-16 15:35:39 -07:00
HashMgr.h	Reformat the world	2021-09-16 15:35:39 -07:00
Inits.cc	clang-format: Set penalty for breaking after assignment operator	2021-09-27 10:49:48 -07:00
ISSUES	the bulk of the compiler	2021-05-05 16:55:04 -07:00
non-embedded-build	updates to development helper scripts to support new workflow	2021-06-04 17:02:43 -07:00
README.md	low-level tidying/nits - no semantic changes	2021-09-08 10:23:38 -07:00
Runtime.h	Reformat the world	2021-09-16 15:35:39 -07:00
RuntimeInit.cc	Reformat the world	2021-09-16 15:35:39 -07:00
RuntimeInit.h	Replace most uses of typedef with using for type aliasing	2021-10-11 14:51:10 -07:00
RuntimeOps.cc	Reformat the world	2021-09-16 15:35:39 -07:00
RuntimeOps.h	Reformat the world	2021-09-16 15:35:39 -07:00
RuntimeVec.cc	clang-format: Set IndentCaseBlocks to false	2021-09-27 10:49:48 -07:00
RuntimeVec.h	Reformat the world	2021-09-16 15:35:39 -07:00
single-full-test.sh	updates to development helper scripts to support new workflow	2021-06-04 17:02:43 -07:00
single-test.sh	updates to development helper scripts to support new workflow	2021-06-04 17:02:43 -07:00
Stmts.cc	clang-format: Set IndentCaseBlocks to false	2021-09-27 10:49:48 -07:00
test-suite-build	the bulk of the compiler	2021-05-05 16:55:04 -07:00
Tracker.cc	Reformat the world	2021-09-16 15:35:39 -07:00
Tracker.h	Reformat the world	2021-09-16 15:35:39 -07:00
Types.cc	clang-format: Set IndentCaseBlocks to false	2021-09-27 10:49:48 -07:00
update-single-test.sh	updates to development helper scripts to support new workflow	2021-06-04 17:02:43 -07:00
Util.cc	Reformat the world	2021-09-16 15:35:39 -07:00
Util.h	Reformat the world	2021-09-16 15:35:39 -07:00
Vars.cc	Reformat the world	2021-09-16 15:35:39 -07:00

README.md

Compiling Zeek Scripts To C++: User's Guide

Overview - Workflows - Known Issues -

Overview

Zeek's script compiler is an experimental feature that translates Zeek scripts into C++, which is then compiled directly into the zeek binary in order to gain higher performance by removing the need for Zeek to use an interpreter to execute the scripts. Using this feature requires a somewhat complex workflow.

How much faster will your scripts run? There's no simple answer to that. It depends heavily on several factors:

What proportion of the processing during execution is spent in Zeek's Event Engine rather than executing scripts.
What proportion of the script's processing is spent executing built-in functions (BiFs). It might well be that most of your script processing actually occurs inside the Logging Framework, for example, and thus you won't see much improvement.
Those two factors add up to gains often on the order of only 10-15%, rather than something a lot more dramatic. On the other hand, using this feature you can afford to put significantly more functionality in Zeek scripts without worrying as much about introducing performance bottlenecks.

That said, I'm very interested in situations where the performance gains appear unsatisfying. Also note that when using the compiler, you can analyze the performance of your scripts using C++-oriented tools - the translated C++ code generally bears a clear relationship with the original Zeek script.

If you want to know how the compiler itself works, see the sketch at the beginning of Compile.h.

Workflows

Before building Zeek, see the first of the Known Issues below regarding compilation times. If your aim is to exploration of the functionality rather than production use, you might want to build Zeek using ./configure --enable-debug, which can reduce compilation times by 50x (!). Once you've built it, the following sketches how to create and use compiled scripts.

The main code generated by the compiler is taken from build/CPP-gen.cc. An empty version of this is generated when first building Zeek.

As a user, the most common workflow is to build a version of Zeek that has a given target script (target.zeek) compiled into it. This means all of the code pulled in by target.zeek, including the base scripts (or the "bare" subset if you invoke the compiler when running zeek -b). The following workflow assumes you are in the build/ subdirectory:

./src/zeek -O gen-C++ target.zeek
The generated code is written to CPP-gen.cc. The compiler will also produce a file CPP-hashes.dat, for use by an advanced feature, and an empty CPP-gen-addl.h file (same).
ninja or make to recompile Zeek
./src/zeek -O use-C++ target.zeek
Executes with each function/hook/event handler pulled in by target.zeek replaced with its compiled version.

Instead of the last line above, you can use the following variants:

./src/zeek -O report-C++ target.zeek
For each function body in target.zeek, reports which ones have compiled-to-C++ bodies available, and also any compiled-to-C++ bodies present in the zeek binary that target.zeek does not use. Useful for debugging.

The above workflows require the subsequent zeek execution to include the target.zeek script. You can avoid this by replacing the first step with:

./src/zeek -O gen-standalone-C++ target.zeek >target-stand-in.zeek

(and then building as in the 2nd step above). This option prints to stdout a (very short) "stand-in" Zeek script that you can load using target-stand-in.zeek to activate the compiled target.zeek without needing to include target.zeek in the invocation (nor the -O use-C++ option). After loading the stand-in script, you can still access types and functions declared in target.zeek.

Note: the implementation differences between gen-C++ and gen-standalone-C++ wound up being modest enough that it might make sense to just always provide the latter functionality, which it turns out does not introduce any additional constraints compared to the current gen-C++ functionality. On the other hand, it's possible (not yet established) that code created using gen-C++ can be made to compile significantly faster than standalone code.

There are additional workflows relating to running the test suite, which we document only briefly here as they're likely going to change or go away , as it's not clear they're actually needed.

First, -O update-C++ will run using a Zeek instance that already includes compiled scripts and, for any functions pulled in by the command-line scripts, if they're not already compiled, will generate additional C++ code for those that can be combined with the already-compiled code. The additionally compiled code leverages the existing compiled-in functions (and globals), which it learns about via the CPP-hashes.dat file mentioned above. Any code compiled in this fashion must be consistent with the previously compiled code, meaning that globals and extensible types (enums, records) have definitions that align with those previously used, and any other code later compiled must also be consistent.

In a similar vein, -O add-C++ likewise uses a Zeek instance that already includes compiled scripts. It generates additional C++ code that leverages that existing compilation. However, this code is not meant for use with subsequently compiled code; later code also build with add-C++ can have inconsistencies with this code. (The utility of this mode is to support compiling the entire test suite as one large incremental compilation, rather than as hundreds of pointwise compilations.)

Both of these append to any existing CPP-gen-addl.h file, providing a means for building it up to reflect a number of compilations.

The update-C++ and add-C++ options help support different ways of building the btest test suite. They were meant to enable doing so without requiring per-test-suite-element recompilations. However, experiences to date have found that trying to avoid pointwise compilations incurs additional headaches, so it's better to just bite off the cost of a large number of recompilations. Given that, it might make sense to remove these options.

Finally, with respect to workflow there are number of simple scripts in src/script_opt/CPP/ (which should ultimately be replaced) in support of compiler maintenance:

non-embedded-build
Builds zeek without any embedded compiled-to-C++ scripts.
bare-embedded-build
Builds zeek with the -b "bare-mode" scripts compiled in.
full-embedded-build
Builds zeek with the default scripts compiled in.

eval-test-suite
Runs the test suite using the cpp alternative over the given set of tests.
test-suite-build
Incrementally compiles to CPP-gen-addl.h code for the given test suite elements.

single-test.sh
Builds the given btest test as a single add-C++ add-on and then runs it.
single-full-test.sh
Builds the given btest test from scratch as a self-contained zeek, and runs it.
update-single-test.sh
Given an already-compiled zeek for the given test, updates its cpp test suite alternative.

Some of these scripts could be made less messy if btest supported a "dry run" option that reported the executions it would do for a given test without actually undertaking them.

Known Issues

Here we list various known issues with using the compiler:

Compilation of compiled code can be noticeably slow (if built using ./configure --enable-debug) or hugely slow (if not), with the latter taking on the order of an hour on a beefy laptop. This slowness complicates CI/CD approaches for always running compiled code against the test suite when merging changes. It's not presently clear how feasible it is to speed this up.
Run-time error messages generally lack location information and information about associated expressions/statements, making them hard to puzzle out. This could be fixed, but would add execution overhead in passing around the necessary strings / Location objects.
Subtle bugs can arise when compiling code that uses @if conditional compilation. The compiled code will not directly use the wrong instance of a script body (one that differs due to the @if conditional having a different resolution at compile time versus later run-time). However, if compiled code itself calls a function that has conditional code, the compiled code will always call the version of the function present during compilation, rather than the run-time version. This problem can be fixed at the cost of making all function calls more expensive (perhaps a measure that requires an explicit flag to activate); or, when possible, by modifying the conditional code to check the condition at run-time rather than at compile-time.
Code compiled with -O gen-standalone-C++ will not execute any global statements when invoked using the "stand-in" script. The right fix for this is to shift from encapsulating global statements in a pseudo-function, as currently done, to instead be in a pseudo-event handler.
Code compiled with -O gen-standalone-C++ likely has bugs if that code requires initializing a global variable that specifies extend fields in an extensible record (i.e., fields added using redef).
The compiler will not compile bodies that include "when" statements This is fairly involved to fix.
The compiler will not compile bodies that include "type" switches. This is not hard to fix.
If a lambda generates an event that is not otherwise referred to, that event will not be registered upon instantiating the lambda. This is not particularly difficult to fix.
A number of steps could be taken to increase the performance of the optimized code. These include:
1. Switching the generated code to use the new ZVal-related interfaces.
2. Directly calling BiFs rather than using the Invoke() method to do so. This relates to the broader question of switching BiFs to be based on a notion of "inlined C++" code in Zeek functions, rather than using the standalone bifcl BiF compiler.
3. Switching the Event Engine over to queuing events with ZVal arguments rather than ValPtr arguments.
4. Making the compiler aware of certain BiFs that can be directly inlined (e.g., network_time()), a technique employed effectively by the ZAM compiler.
5. Inspecting the generated code for inefficiencies that the compiler could avoid.