It is not safe to use the same socket from different threads, but the
current code used the xsub socket directly from the main thread (to setup
subscriptions) and from the internal thread for polling and reading.
Leverage the PAIR socket already in use for forwarding publish operations
to the internal thread also for subscribe and unsubscribe.
The failure mode is/was a bit annoying. Essentially, closing of the
context would hang indefinitely in zmq_ctx_term().
This is a cluster backend implementation using a central XPUB/XSUB proxy
that by default runs on the manager node. Logging is implemented leveraging
PUSH/PULL sockets between logger and other nodes, rather than going
through XPUB/XSUB.
The test-all-policy-cluster baseline changed: Previously, Broker::peer()
would be called from setup-connections.zeek, causing the IO loop to be
alive. With the ZeroMQ backend, the IO loop is only alive when
Cluster::init() is called, but that doesn't happen anymore.
- This gives the cluster controller and agent the common name "Management
framework" and changes the start directory of the sources from
"policy/frameworks/cluster" to "policy/frameworks/management". This avoids
ambiguity with the existing cluster framework.
- It renames the "ClusterController" and "ClusterAgent" script modules to
"Management::Controller" and "Management::Agent", respectively. This allows us
to anchor tooling common to both controller and agent at the "Management"
module.
- It moves common configuration settings, logging, requests, types, and
utilities to the common "Management" module.
- It removes the explicit "::Types" submodule (so a request/response result is
now a Management::Result, not a Management::Types::Result), which makes
typenames more readable.
- It updates tests that depend on module naming and full set of scripts.
Mostly trivial changes, except for one aspect: if a module exports a record type
and that record bears Zeekygen comments, then redefs that add to the record in
another module cannot be private to that module. Zeekygen will complain with
"unknown target" errors, even when such redefs have Zeekygen comments. So this
commits also adds two export-blocks that aren't technically required at this point.
This allows querying the status of Zeek nodes currently running in a cluster.
The controller relays the request to all instances and accumulates their
responses.
The response back to the client contains one Result record per instance
response, each of which carrying a ClusterController::Types::NodeState vector in
its $data member to convey the state of each node at that instance.
The NodeState record tracks the name of the node, its role in the controller (if
any), its role in the data cluster (if any), as well as PID and listening port,
if any.
This makes cluster node listening ports &optional, and maps absent values to
0/unknown, the value the cluster framework currently uses to indicate that
listening isn't desired.
It's easier to track outstanding controller/agent requests via a simple set of
pending agent names, and we can remove all of the result aggregation logic since
we can simply re-use the results reported by the agents.
This can serve as a template for request-response patterns where a client's
request triggers a request to all agents, followed by a response to the client
once all agents have responded. Once we have a few more of those, it'll become
clearer how to abstract this further.
This changes the agent-controller communication to remove the need for ongoing
pinging of the controller by agents not actively "in service". Instead, agents
now use the notify_agent_hello event to the controller to report only their
identity. The controller puts them into service via an agent_welcome_request/
response pair, and takes them out of service via agent_standby_request/response.
This removes the on_change handler from the set of agents that is ready for
service, because not every change to this set is now a suitable time to
potentially send out the configuration. We now invoke this check explicitly in
the two situations where it's warranted: when a agent reports ready for service,
and when we've received a new configuration.
This has no practical relevance other than allowing the two to be loaded a the
same time, which some of our (cluster-unrelated) tests require. Absence of
namespacing would trigger symbol clashes at this point.
This now features support for the test_timeout_request/response events, as
supported by the client, and also adds a timeout event for set_configuration, in
case agents do not respond in time.
Includes corresponding zeek-client submodule bump.
This establishes a timeout controlled via ClusterController::request_timeout,
triggering a ClusterController::Request::request_expired event whenever a
timeout rolls around before request state has been finalized by a request's
normal processing.
This changes the basic agent-management model to one in which the configurations
received from the client define not just the data cluster, but also set the set
of acceptable instances. Unless connectivity already exists, the controller will
establish peerings with new agents that listen, or wait for ones that connect to
the controller to check in.
Once all required agents are available, the controller triggers the new
notify_agents_ready event, an agent/controller-level "cluster-is-ready"
event. The controller also uses this event to submit a pending config update to
the now-ready instances.
Prior to this, static configuration needed to be in place to configure the
controller/agent layout. The configuration update can now include new instances
that the controller will connect to, assuming they're instances with a listening
agent.
Request records for configuration updates now store the full configuration. The
ClusterController::Request module now provies a to_string() function for
rendering requests to a string.
* topic/christian/cluster-controller:
Add a cluster controller testcase for agent-controller checkin
Add zeek-client via new submodule
Update baselines affected by cluster controller changes
Introduce cluster controller and cluster agent scripting
Establish a separate init script when using the supervisor
Add optional bare-mode boolean flag to Supervisor's node configuration
Add support for making the supervisor listen for requests
Add support for setting environment variables via supervisor
This is a preliminary implementation of a subset of the functionality set out in
our cluster controller architecture. The controller is the central management
node, existing once in any Zeek cluster. The agent is a node that runs once per
instance, where an instance will commonly be a physical machine. The agent in
turn manages the "data cluster", i.e. the traditional notion of a Zeek cluster
with manager, worker nodes, etc.
Agent and controller live in the policy folder, and are activated when loading
policy/frameworks/cluster/agent and policy/frameworks/cluster/controller,
respectively. Both run in nodes forked by the supervisor. When Zeek doesn't use
the supervisor, they do nothing. Otherwise, boot.zeek instructs the supervisor
to create the respective node, running main.zeek.
Both controller and agent have their own config.zeek with relevant knobs. For
both, controller/types.zeek provides common data types, and controller/log.zeek
provides basic logging (without logger communication -- no such node might
exist).
A primitive request-tracking abstraction can be found in controller/request.zeek
to track outstanding request events and their subsequent responses.