When an agent is already running the configuration it's asked to deploy,
it will now recognize this and by default do nothing. The requester can force
it if needed, via a new argument to the deploy_request event.
The agent's Broker::peer_added handler now recognizes the Supervisor and does
not trigger a notify_agent_hello event upon it. It might still send such events
repeatedly as other things peer with the agent.
The controller now knows three states that a cluster configuration can be in:
- STAGED: as uploaded by the client
- READY: with needed tweaks applied, e.g. to fill in ports
- DEPLOYED: as sent off to agents for deployment
These states aren't exclusive, they represent checkpoints that a config goes
through from upload through deployment. A deployed configuration will also exist
in its STAGED and READY versions, unless a client has uploaded a new
configuration, which will overwrite the STAGED and READY ones.
The controller saves all of these in a table, which lets us use Broker to
persist all states to disk. We use &broker_allow_complex_type, since we only
ever store entire configurations.
This separates uploading a configuration from deploying it to the instances into
separate event transactions. set_configuration_request/response remains, but now
only conducts validation and storage of the new configuration (upon validation
success, and not yet persisted to disk). The response event indicates success or
the list of validation errors. Successful upload now returns the configuration's
ID in the result record's data struct.
The new deploy_request/response event takes a previously uploaded configuration
and deploys it to the agents.
The controller now tracks uploaded and deployed configurations
separately. Uploading assigns g_config_staged; deployment assigns
g_config_deployed. Deployment does not affect g_config_staged.
The get_config_request/response event pair now allows selecting the
configuration the caller would like to retrieve.
This renames the agent's functionality for setting a configuration to reflect
the controller's upcoming separation of set_configuration and deployment.
The instance and error fields are now optional instead of defaulting to empty
strings, which caused minor output deviations in the client.
Agents now ensure that any Result record they create has the instance field
filled in.
During `set_configuration_request` handling the controller now validates
received configurations, checking for a few common gotchas around naming and
port use. Validation continues once it finds a problem, resulting in a list
summarizing all identified problems.
The numbering process now accounts for the possibility of colliding with the
agent port, as well as with ports explicitly assigned in the configuration. It
also avoids nondeterminism that could result from traversal of sets.
It helps during testing to be able to control whether the Supervisor process
also routs node output to the console, in addition to writing to output
files. Since the Supervisor runs as the main process in Docker containers, its
output becomes visible in "docker logs" that way, simplifying diagnostics.
When the controller receives a configuration with no instances (and thus no
nodes), it needs to roundtrip to agents and can send the response right away.
This makes agents handle log archival automatically. By default, they invoke
zeek-archiver once every log rotation interval to archive rotated files from the
log-queue spool directory into the installation's log directory. The user can
disable the feature, customize the command to invoke, and adjust the rotation
interval.
Up to now, agents and controllers listened locally only, and the Supervisor
(which listens when we run an agent) listened globally. It's now the other way
around: controllers and agents listen globally and the Supervisor, when
listening, does so locally.
This enables the controller to assign listening ports to managers, loggers, and
proxies. (We don't currently make the workers listen.) The feature is controlled
by the Management::Controller::auto_assign_ports flag. When enabled (the
default), enumeration starts from Management::Controller::auto_assign_start_port,
beginning with the manager, then the logger(s), then proxy(s). When the feature
is disabled and nodes that require a port lack it, the controller rejects the
configuration.
The get-nodes command also benefits from showing the state on connected agents
more broadly (as opposed to just the one for the current configuration).
Also a bugfix: ensure we use an agent's IP address as seen by the
controller. This avoids reporting "0.0.0.0" in some cases.
This response so far contained only the connected instances that are relevant to
the current configuration, but this isn't very helpful when troubleshooting
instance connectivity. It now reports all currently connected instances, with
network addresses & ports as known to Broker.
This swaps the host event argument for the Broker ID. The latter is more useful,
since the sending agent doesn't necessarily know its IP address as visible to
the controller, and the controller can pull up the full Broker context via the
ID.
It also adds an explicit argument to the event to indicate whether the agent
connected to the controller or vice versa. This simplifies the controller's
internal logic.
Also minor tweaks to logging to show Broker IDs.
This uses the new frameworks/management/supervisor functionality to maintain
stdout/stderr files, and hooks output context into set_configuration error
results.
This improves the framework's handling of Zeek node stdout and stderr by
extending the (script-layer) Supervisor functionality.
- The Supervisor _either_ directs Zeek nodes' stdout/stderr to files _or_ lets
you hook into it at the script level. We'd like both: files make sense to allow
inspection outside of the framework, and the framework would benefit from
tapping into the streams e.g. for error context. We now provide the file
redirection functionality in the Supervisor, in addition to the hook
mechanism. The hook mechanism also builds up rolling windows of up to
100 lines (configurable) into stdout/stderr.
- The new Mangement::Supervisor::API::notify_node_exit event notifies
subscribers (agents, really) that a particular node has exited (and is possibly
being restarted by the Supervisor). The event includes the name of the node,
plus its recent stdout/stderr context.
During Zeekygen's doc generation both the agent's and controller's main.zeek get
loaded. This just happened to not throw errors so far because the redefs either
matched perfectly or used different field names.
We so far reported one result record per agent, which made it hard to report
per-node outcomes for the new configuration. Agents now report one result record
per node they're responsible for.
When the controller relays requests to agents, we want agents to time out more
quickly than the corresponding controller requests. This allows agents to
respond with more meaningful errors, while the controller's timeout acts mostly
as a last resort to ensure a response to the client actually happens.
This dials down the table_expire_interval to 2 seconds in both agent and
controller, for more predictable timeout behavior. It also dials the agent-side
request expiration interval down to 5 seconds, compared to the agent's 10
seconds.
We may have to revisit this to allow custom expiration intervals per
request/response message type.
We so far hoped for the best when an agent asked the Supervisor to launch a
node. Since the Management::Node::API::notify_node_hello events arriving from
new nodes signal when such nodes are up and running, we can use those events to
track once/whether all launched nodes have checked in, and respond accordingly.
This delays the set_configuration_response event until these checkins have
occurred, or a timeout kicks in. In case of error, the agent's response to the
controller is in error state and has the remaining, unresponsive/failed set of
nodes as its data member.
This establishes a directory "nodes" in Management::state_dir and places each
Zeek process into a subdirectory in it, named after the Zeek process. For
example, node "worker-01" runs with cwd <state_dir>/nodes/worker-01/.
Explicitly configured directories can override the naming logic, and also ignore
the state directory if they're absolute paths. One exception remains: the
Supervisor itself -- we'd have to use LogAscii::logdir to automatically place it
too in its own directory, but that feature currently does not interoperate with
log rotation.
This adds management/persistence.zeek to establish common configuration for log
rotation and persistent variable state. Log-writing Zeek processes initially
write locally in their working directory, and rotate into subdirectory
"log-queue" of the spool. Since agent and controller have no logger,
persistence.zeek puts in place compatible configurations for them.
Storage folders for Broker-backed tables and clusterized stores default to
subdirectories of the new Zeek-level state folder.
When setting the ZEEK_MANAGEMENT_TESTING environment variable, persistent state
is kept in the local directory, and log rotation remains disabled.
This also tweaks @loads a bit in favor of simply loading frameworks/management,
which is easier to keep track of.
This allows specifying spool and variable-state directories specifically for the
management framework. They default to the corresponding installation-level
folders.
Load the agent/controller bootstrapping code only from the Supervisor, and the
basic config only from a supervisee. When we're neither (which is likely a
mistake), we do nothing.
The fallback mechanism when no explicit agent/controller names are configured
didn't work properly, because many places in the code relied on accessing the
name via the variables meant for explicit configuration, such as
Management::Agent::name. Agent and controller now offer functions for computing
the correct effective name, and we use that throughout.
Includes submodule bumps for Broker (to pull in better handling of data
structures that are difficult to unserialize in Python), zeek-client (for the
get-config command), and a commit hash update for the external testsuite.
This adds an optional set of cluster node names to narrow the querying to. It
similarly expands the dispatch mechanism, since it likely most sense for any
such request to apply only to a subset of nodes.
Requests for invalid nodes trigger Response records in error state.
When agents receive a configuration, we don't currently honor requested run
states (there's no such thing as registering a node but not running it, for
example). To reflect this, we now start off nodes in state PENDING as we
launch them via the Supervisor, and move them to RUNNING when they check
in with us via Management::Node::API::notify_node_hello.
This adds support for retrieving the value of a global identifier from any
subset of cluster nodes. It relies on the lookup_ID() BiF to retrieve the val,
and to_json() to render the value to an easily parsed string. Ideally we'd send
the val directly, but this hits several roadblocks, including the fact that
Broker won't serialize arbitrary values.
This adds request/response event pairs to enable the controller to dispatch
"actions" (pre-implemented Zeek script actions) on subsets of Zeek cluster nodes
and collect the results. Using generic events to carry multiple such "run X on
the nodes" scenarios simplifies adding these in the future.
This provides Broker-level plumbing that allows agents to reach out to their
managed Zeek nodes and collect responses.
As a first event, it establishes Management::Node::API::notify_agent_hello,
to notify the agent when the cluster node is ready to communicate.
Also a bit of comment rewording to replace use of "data cluster" with simply
"cluster", to avoid ambiguity with data nodes in SumStats, and expansion of
test-all-policy.zeek and related/dependent tests, since we're introducing new
scripts.