util::safe_write() calls abort() in case of EAGAIN errors. This is
easily observed when starting clusters with 32 workers or more.
Add a custom write_message() function handling EAGAIN by retrying
after a small sleep. It's not clear a more complicated poll() would be
much better: The pipe might be ready for writing, but then our message
might not actually fit in, resulting in another EAGAIN error. And even
poll() would introduce blocking/sleeping code.
Take some precautions against the stem and the supervisor dead-locking
when both pipes are full by draining the other end on EAGAIN errors.
Closes#3043
When multiple loggers are configured in a Supervisor controlled cluster
configuration, encode extra information into the rotated filename to
identify which logger produced the log.
This is similar to the approach taken for ZeekControl, re-using the
log_suffix terminology, but as there's only a single zeek-archiver
process and no postprocessors and no other side-channel for additional
information, we encode extra metadata into the filename. zeek-archiver
is extended to recognize the special metadata part of the filename.
This also solves the issue that multiple loggers in a supervisor setup
overwrite each others log files within a single log-queue directory.
In supervised nodes, the Supervisor's NodeConfig$scripts vector adds scripts to
the end of the user-provided scripts (options.scripts_to_load), so they load
_after_ any user-provided ones. This can cause confusing redef pitfalls when
users expect their customizations to run last, as they normally do.
This adds two members in Supervisor::NodeConfig, `addl_base_scripts` and
`addl_user_scripts`, to store scripts to load before and after the user scripts,
respectively. The latter serves the same purpose as the old `scripts` member,
which is still there but deprecated (in scriptland only). It functions as
before, after any scripts added via `addl_user_scripts`.
The Supervisor generates this event every time it receives a status update from
the stem, meaning a node got created or re-created. A corresponding
SupervisorControl::node_status event relays the same information for users
interacting with the Supervisor over Broker.
When omitted, the node inherits the Supervisor's bare-mode
status. When true/false, the new Zeek node will enable/disable bare
mode, respectively. It continues to load any scripts passed at the
command line and in the additional scripts list already provided in
the node configuration.
Includes testcase.
The NodeConfig record now has a table for specifying environment variable names
and values, which the supervisor sets in the created node.
This also repositions the cpu_affinity member to keep the order the same in
the corresponding script-layer and in-core types.
Includes testcase.
- Use `-b` most everywhere, it will save time.
- Start some intel tests upon the input file being fully read instead of
at an arbitrary time.
- Improve termination condition for some sumstats/cluster tests.
- Filter uninteresting output from some supervisor tests.
- Test for `notice_policy.log` is no longer needed.
These may be redefined to customize log rotation path prefixes,
including use of a directory. File extensions are still up to
individual log writers to add themselves during the actual rotation.
These new also allow for some simplication to the default
ASCII postprocessor function: it eliminates the need for it doing an
extra/awkward rename() operation that only changes the timestamp format.
This also teaches the supervisor framework to use these new options
to rotate ascii logs into a log-queue/ directory with a specific
file name format (intended for an external archiver process to
monitor separately).
This helps prevent a node from being killed/crashing in the middle
of writing a log, restarting, and eventually clobbering that log
file that never underwent the rotation/archival process.
The old `archive-log` and `post-terminate` scripts as used by
ZeekControl previously implemented this behavior, but the new logic is
entirely in the ASCII writer. It uses ".shadow" log files stored
alongside the real log to help detect such scenarios and rotate them
correctly upon the next startup of the Zeek process.
The stdout/stderr of child processes is now redirected over a pipe back
to the supervisor process so that it can prefix the output with
the name of the emitting node.
* Generally increase timeouts for tests that have recent transient
failures
* Change any test that relied on `btest-bg-wait -k` since that's never
going to play with with CI systems. Instead, we always need to have
a well-defined termination condition in the test itself (and most
already did, so didn't really need the `-k` flag anyway).