This PR changes the way in which the SSL analyzer tracks the direction
of connections. So far, the SSL analyzer assumed that the originator of
a connection would send the client hello (and other associated
client-side events), and that the responder would be the SSL servers.
In some circumstances this is not true, and the initiator of a
connection is the server, with the responder being the client. So far
this confused some of the internal statekeeping logic and could lead to
mis-parsing of extensions.
This reversal of roles can happen in DTLS, if a connection uses STUN -
and potentially in some StartTLS protocols.
This PR tracks the direction of a TLS connection using the hello
request, client hello and server hello handshake messages. Furthermore,
it changes the SSL events from providing is_orig to providing is_client,
where is_client is true for the client_side of a connection. Since the
argument positioning in the event has not changed, old scripts will
continue to work seamlessly - the new semantics are what everyone
writing SSL scripts will have expected in any case.
There is a new event that is raised when a connection is flipped. A
weird is raised if a flip happens repeatedly.
Addresses GH-2198.
This adds restart request/response event pairs that restart nodes in the running
Zeek cluster. The implementation is very similar to get_id_value, which also
involves distributing a list of nodes to agents and aggregating the responses.
This declares our helper functions for sending events to the Supervisor, and
makes them return the created request objects to enable the caller to modify
them. It also adds a helper for restart and status requests, uses the helpers
throughout the module, and makes all handlers more resilient in case Supervisor
events other than the agent's arrive.
The controller now logs its deployment attempt of a persisted configuration at
startup. This is generally helpful to see recorded, and also explains timeout of
the underlying request in case of failure (which triggers a timeout message).
For the case of a running cluster with no connected agents, use the
g_instances_known table instead of g_instances. The latter reflects the contents
of the last deployed config, not the live scenario of actually attached agents.
The timeout result wasn't actually stored in requests timing out in the
agent. (So far that's for deployment requests.) Also log the timing out of any
request state, similar to the controller.
No functional change, just a consistency tweak. Since agent and controller send
response events via Broker::publish(), the arguments aren't named and so this
only affects the API definition.
* topic/christian/management-deploy: (21 commits)
Management framework: bump external cluster testsuite
Management framework: bump zeek-client
Management framework: rename set_configuration events to stage_configuration
Management framework: trigger deployment upon when instances are ready
Management framework: more resilient node shutdown upon deployment
Management framework: re-trigger deployment upon controller launch
Management framework: move most deployment handling to internal function
Management framework: distinguish internally and externally requested deployments
Management framework: track instances by their Broker IDs
Management framework: tweak Supervisor event logging
Management framework: make helper function a local
Management framework: rename "log_level" to "level"
Management framework: add "finish" callback to requests
Management framework: add a helper for rendering result vectors to a string
Management framework: agents now skip re-deployment of current config
Management framework: suppress notify_agent_hello upon Supervisor peering
Management framework: introduce state machine for configs and persist them
Management framework: introduce deployment API in controller
Management framework: rename agent "set_configuration" to "deploy"
Management framework: consistency fixes to the Result record
...
More resilience: when an agent restarts, it checks in with the controller. If
the controller has deployed a config, this check-in may lead to an internal
notify_agents_ready event. At that point, we now trigger a deployment when there
currently isn't already one running. This ensures that any agents not yet
running the current cluster will start to do so, and does nothing when those
agents already run it, since they ignore the request in that case.
When agents had to terminate existing Zeek cluster nodes at the beginning of a
new deployment, they so far used their internal state to look up the nodes and
fired off requests to the Supervisor to shut these down. This has a problem:
when an agent restarts unexpectedly, it has no internal state, and when it then
tries to create nodes that already exist, the Supervisor complains with error
messages.
To avoid this, the agent now tears down all Supervised nodes other than agents
and controllers. In order to do so, it first needs to query the Supervisor for
the current node status, which means there are now two such status requests: one
upon deployment, and one during get_nodes requests. In order to disambiguate
these contexts in the SupervisorControl::status_request/response transactions,
we use the finish() callback in the corresponding request state to continue
execution as needed.
A resilience feature: when a booting controller has a previously deployed
configuration (just reloaded from persistent state), it now triggers a
deployment. When agents at this point run something else, this restores the
controller's understanding of what's deployed, and if the agents do still run
this configuration, does nothing since agents ignore deployment of a
configuration they already run.
The controller now runs most of a config deployment via an internal function,
allowing it to be called from multiple places instead of just the deploy_request
event handler.
The controller's deployment request state now features a bit that indicates
whether the deployment was requested by a client, or triggered internally. This
affects logging and the transmission of deployment response events via Broker,
which are skipped when the deployment is internal.
This is in preparation of resilience features when the controller (re-)boots.
This allows us to handle loss of Broker peerings, updating instance state as we
see instances go away. This also tweaks logging slightly to differentiate
between an instance checking in for the first time, and checking in when the
controller already knows it.
These callbacks are handy for stringing together codepaths separated by event
request/response transactions: when such a transaction completes, the callback
allows locating a parent request for the finished one, to continue its
processing.
When an agent is already running the configuration it's asked to deploy,
it will now recognize this and by default do nothing. The requester can force
it if needed, via a new argument to the deploy_request event.
The agent's Broker::peer_added handler now recognizes the Supervisor and does
not trigger a notify_agent_hello event upon it. It might still send such events
repeatedly as other things peer with the agent.
The controller now knows three states that a cluster configuration can be in:
- STAGED: as uploaded by the client
- READY: with needed tweaks applied, e.g. to fill in ports
- DEPLOYED: as sent off to agents for deployment
These states aren't exclusive, they represent checkpoints that a config goes
through from upload through deployment. A deployed configuration will also exist
in its STAGED and READY versions, unless a client has uploaded a new
configuration, which will overwrite the STAGED and READY ones.
The controller saves all of these in a table, which lets us use Broker to
persist all states to disk. We use &broker_allow_complex_type, since we only
ever store entire configurations.
This separates uploading a configuration from deploying it to the instances into
separate event transactions. set_configuration_request/response remains, but now
only conducts validation and storage of the new configuration (upon validation
success, and not yet persisted to disk). The response event indicates success or
the list of validation errors. Successful upload now returns the configuration's
ID in the result record's data struct.
The new deploy_request/response event takes a previously uploaded configuration
and deploys it to the agents.
The controller now tracks uploaded and deployed configurations
separately. Uploading assigns g_config_staged; deployment assigns
g_config_deployed. Deployment does not affect g_config_staged.
The get_config_request/response event pair now allows selecting the
configuration the caller would like to retrieve.
This renames the agent's functionality for setting a configuration to reflect
the controller's upcoming separation of set_configuration and deployment.
The instance and error fields are now optional instead of defaulting to empty
strings, which caused minor output deviations in the client.
Agents now ensure that any Result record they create has the instance field
filled in.
During `set_configuration_request` handling the controller now validates
received configurations, checking for a few common gotchas around naming and
port use. Validation continues once it finds a problem, resulting in a list
summarizing all identified problems.
The numbering process now accounts for the possibility of colliding with the
agent port, as well as with ports explicitly assigned in the configuration. It
also avoids nondeterminism that could result from traversal of sets.
It helps during testing to be able to control whether the Supervisor process
also routs node output to the console, in addition to writing to output
files. Since the Supervisor runs as the main process in Docker containers, its
output becomes visible in "docker logs" that way, simplifying diagnostics.
When the controller receives a configuration with no instances (and thus no
nodes), it needs to roundtrip to agents and can send the response right away.
This makes agents handle log archival automatically. By default, they invoke
zeek-archiver once every log rotation interval to archive rotated files from the
log-queue spool directory into the installation's log directory. The user can
disable the feature, customize the command to invoke, and adjust the rotation
interval.
Up to now, agents and controllers listened locally only, and the Supervisor
(which listens when we run an agent) listened globally. It's now the other way
around: controllers and agents listen globally and the Supervisor, when
listening, does so locally.
This enables the controller to assign listening ports to managers, loggers, and
proxies. (We don't currently make the workers listen.) The feature is controlled
by the Management::Controller::auto_assign_ports flag. When enabled (the
default), enumeration starts from Management::Controller::auto_assign_start_port,
beginning with the manager, then the logger(s), then proxy(s). When the feature
is disabled and nodes that require a port lack it, the controller rejects the
configuration.
The get-nodes command also benefits from showing the state on connected agents
more broadly (as opposed to just the one for the current configuration).
Also a bugfix: ensure we use an agent's IP address as seen by the
controller. This avoids reporting "0.0.0.0" in some cases.
This response so far contained only the connected instances that are relevant to
the current configuration, but this isn't very helpful when troubleshooting
instance connectivity. It now reports all currently connected instances, with
network addresses & ports as known to Broker.
This swaps the host event argument for the Broker ID. The latter is more useful,
since the sending agent doesn't necessarily know its IP address as visible to
the controller, and the controller can pull up the full Broker context via the
ID.
It also adds an explicit argument to the event to indicate whether the agent
connected to the controller or vice versa. This simplifies the controller's
internal logic.
Also minor tweaks to logging to show Broker IDs.
* zeek-as-org/as-org:
Mark lookup_asn() BIF as deprecated in v6.1
Define geo_autonomous_system record type
Add lookup_autonomous_system() BIF that returns AS number and org
* topic/christian/gh-2134-fix-intel-test-races:
Expand scripts.base.frameworks.intel.cluster-transparency test
Fix races in scripts.base.frameworks.intel.cluster-transparency-with-proxy test
Add Intel::send_store_on_node_up boolean to control min_data_store delivery
This exposes Broker's new WebSocket support in Zeek. To enable it,
call `Broker::listen_websocket()`. Zeek will then start listening on
port 9997 for incoming WebSocket connections.
See the Broker documentation for a description of the message format
expected over these WebSocket connections.