Commit graph

42 commits

Author SHA1 Message Date
Arne Welzel
073de9f5fd cluster/zeromq: Drop events when overloaded
When either the XPUB socket's hwm is reached, or the onloop queue is
full, drop the events. Users can set ths xpub_sndhwm and
onloop_queue_hwm to 0 to avoid these drops at the risk of unbounded
memory growth.
2025-07-29 11:23:53 +02:00
Arne Welzel
5dc4586b70 cluster/zeromq: Support local XPUB/XSUB hwm and buf configurability 2025-07-29 11:23:53 +02:00
Arne Welzel
26f5166d7a cluster/telemetry: Move topic_normalization redef to zeromq 2025-06-26 15:22:11 +02:00
Arne Welzel
89c0b0faf3 cluster/zeromq: Hook up and enable IPV6 by default
ZeroMQ's IPv6 support isn't enabled by default, resulting in
"No such device" errors when attempting to listen on an IPv6
address. This change adds a ipv6 option to the ZeroMQ module
and enables it by default. Further, adds a test configuring
everything to listen on IPv6 ::1 as well, and one test to provoke
the original error. This also regularizes some error messages.

The addr_to_uri() calls weren't actually needed, but they apparently do
not hurt and the result is easier on the eyes, so use them :-)
2025-06-24 17:12:45 +02:00
Arne Welzel
cf43cf1809 cluster/zeromq/connect: Make failures fatal
The cluster is borked if the initialization fails, so may as well just
completely abort Zeek at that point with a fatal error. There's no real
point in continuing to run.
2025-06-20 13:03:47 +02:00
Arne Welzel
643b926625 cluster/zeromq: Implement DoReadyToPublishCallback()
The ZeroMQ heuristic for "ready to publish" is to create an unique and
ephemeral subscription using the XSUB socket and observe it arrive on the
XPUB socket. At this point, visibility into other node's subscriptions
is provided.
2025-04-25 09:57:06 +00:00
Arne Welzel
2963c49f27 cluster/zeromq: Fix node_topic() and nodeid_topic()
Due to prefix matching, worker-1's node_topic() also matched worker-10,
worker-11, etc. Suffix the node topic with a `.`. The original implementation
came from NATS, where subjects are separated by `.`.

Adapt nodeid_topic() for consistency.
2025-03-24 18:36:26 +01:00
Arne Welzel
cc0c48423d cluster/backends/zeromq: Fix rst link in docs 2025-03-12 10:11:25 +01:00
Arne Welzel
aad512c616 cluster/zeromq: Support configuring IO threads for proxy thread 2025-03-10 17:07:30 +01:00
Arne Welzel
ba7b605a97 cluster/zeromq: Move variable lookups from DoInit() to DoInitPostScript() 2025-03-10 17:07:30 +01:00
Benjamin Bannier
5d44073b94 Bump pre-commit hooks 2025-03-04 08:14:26 +01:00
Arne Welzel
fa22f91ca4 cluster/zeromq: Fix XSUB threading issues
It is not safe to use the same socket from different threads, but the
current code used the xsub socket directly from the main thread (to setup
subscriptions) and from the internal thread for polling and reading.

Leverage the PAIR socket already in use for forwarding publish operations
to the internal thread also for subscribe and unsubscribe.

The failure mode is/was a bit annoying. Essentially, closing of the
context would hang indefinitely in zmq_ctx_term().
2025-02-05 10:39:56 +01:00
Arne Welzel
35c79ab2e3 cluster/backend/zeromq: Add ZeroMQ based cluster backend
This is a cluster backend implementation using a central XPUB/XSUB proxy
that by default runs on the manager node. Logging is implemented leveraging
PUSH/PULL sockets between logger and other nodes, rather than going
through XPUB/XSUB.

The test-all-policy-cluster baseline changed: Previously, Broker::peer()
would be called from setup-connections.zeek, causing the IO loop to be
alive. With the ZeroMQ backend, the IO loop is only alive when
Cluster::init() is called, but that doesn't happen anymore.
2024-12-10 20:33:02 +01:00
Arne Welzel
416887157c cluster_started: No Broker::auto_publish() use 2024-11-14 12:59:22 +01:00
Jan Grashoefer
0cd32ba07c Add cluster_started and node_fully_connected events. 2023-04-21 19:04:52 +02:00
Christian Kreibich
54aaf3a623 Reorg of the cluster controller to new "Management framework" layout
- This gives the cluster controller and agent the common name "Management
framework" and changes the start directory of the sources from
"policy/frameworks/cluster" to "policy/frameworks/management". This avoids
ambiguity with the existing cluster framework.

- It renames the "ClusterController" and "ClusterAgent" script modules to
"Management::Controller" and "Management::Agent", respectively. This allows us
to anchor tooling common to both controller and agent at the "Management"
module.

- It moves common configuration settings, logging, requests, types, and
utilities to the common "Management" module.

- It removes the explicit "::Types" submodule (so a request/response result is
now a Management::Result, not a Management::Types::Result), which makes
typenames more readable.

- It updates tests that depend on module naming and full set of scripts.
2022-02-09 18:09:42 -08:00
Christian Kreibich
3e0a86e3b3 Updates to the cluster controller scripts to fix the docs build
Mostly trivial changes, except for one aspect: if a module exports a record type
and that record bears Zeekygen comments, then redefs that add to the record in
another module cannot be private to that module. Zeekygen will complain with
"unknown target" errors, even when such redefs have Zeekygen comments. So this
commits also adds two export-blocks that aren't technically required at this point.
2022-02-09 12:28:47 -08:00
Christian Kreibich
7db8634c8b Add ClusterController::API::get_nodes_request/response event pair
This allows querying the status of Zeek nodes currently running in a cluster.
The controller relays the request to all instances and accumulates their
responses.

The response back to the client contains one Result record per instance
response, each of which carrying a ClusterController::Types::NodeState vector in
its $data member to convey the state of each node at that instance.

The NodeState record tracks the name of the node, its role in the controller (if
any), its role in the data cluster (if any), as well as PID and listening port,
if any.
2022-02-02 22:59:22 -08:00
Christian Kreibich
791e5545b1 Support optional listening ports for cluster nodes
This makes cluster node listening ports &optional, and maps absent values to
0/unknown, the value the cluster framework currently uses to indicate that
listening isn't desired.
2022-02-02 16:10:46 -08:00
Christian Kreibich
c79c2a2b00 Don't auto-publish Supervisor response events in the cluster agent
This was an oversight: we auto-publish the agent's requests _to_ the supervisor,
not the latter's responses.
2022-01-31 18:42:53 -08:00
Christian Kreibich
ad4744eba6 Make members of the ClusterController::Types::State enum all-caps
A consistency tweak since we mostly use all-caps elsewhere as well.
2022-01-31 18:42:03 -08:00
Christian Kreibich
3da95de5b8 Be more conservative with triggering request timeout events 2022-01-31 18:38:40 -08:00
Christian Kreibich
4b5584a85d Move redefs of ClusterController::Request::Request to their places of use
The Request module does not need to know about additional state tucked onto it
by its users.
2022-01-31 18:29:58 -08:00
Christian Kreibich
f9ac03d6e3 Simplify ClusterController::API::set_configuration_request/response
It's easier to track outstanding controller/agent requests via a simple set of
pending agent names, and we can remove all of the result aggregation logic since
we can simply re-use the results reported by the agents.

This can serve as a template for request-response patterns where a client's
request triggers a request to all agents, followed by a response to the client
once all agents have responded. Once we have a few more of those, it'll become
clearer how to abstract this further.
2022-01-31 17:45:14 -08:00
Christian Kreibich
5a72864ae8 Docs/comment pass over the cluster controller framework 2022-01-03 00:31:03 -08:00
Christian Kreibich
ac40d5c5b2 Remove periodic pinging of controller by agents
This changes the agent-controller communication to remove the need for ongoing
pinging of the controller by agents not actively "in service". Instead, agents
now use the notify_agent_hello event to the controller to report only their
identity. The controller puts them into service via an agent_welcome_request/
response pair, and takes them out of service via agent_standby_request/response.

This removes the on_change handler from the set of agents that is ready for
service, because not every change to this set is now a suitable time to
potentially send out the configuration. We now invoke this check explicitly in
the two situations where it's warranted: when a agent reports ready for service,
and when we've received a new configuration.
2021-12-21 16:44:04 -08:00
Christian Kreibich
8463f14a52 Move cluster controller/agent main.zeek scripts into their own modules
This has no practical relevance other than allowing the two to be loaded a the
same time, which some of our (cluster-unrelated) tests require. Absence of
namespacing would trigger symbol clashes at this point.
2021-12-21 14:52:29 -08:00
Christian Kreibich
30db1b3bfb First uses of request state timeouts
This now features support for the test_timeout_request/response events, as
supported by the client, and also adds a timeout event for set_configuration, in
case agents do not respond in time.

Includes corresponding zeek-client submodule bump.
2021-12-21 14:52:29 -08:00
Christian Kreibich
1e823f931e Add expiration mechanism to client request state.
This establishes a timeout controlled via ClusterController::request_timeout,
triggering a ClusterController::Request::request_expired event whenever a
timeout rolls around before request state has been finalized by a request's
normal processing.
2021-12-21 14:52:29 -08:00
Christian Kreibich
fc9679e510 Move get_instances_response event to using a Result record
Includes corresponding zeek-client bump.
2021-12-21 14:52:29 -08:00
Christian Kreibich
1461d56340 Track successful config deployment in cluster controller
This allows us to start returning deployed configurations to the client upon
request.
2021-12-21 14:52:29 -08:00
Christian Kreibich
09d9be3433 Add ClusterController::API::notify_agents_ready event
This changes the basic agent-management model to one in which the configurations
received from the client define not just the data cluster, but also set the set
of acceptable instances. Unless connectivity already exists, the controller will
establish peerings with new agents that listen, or wait for ones that connect to
the controller to check in.

Once all required agents are available, the controller triggers the new
notify_agents_ready event, an agent/controller-level "cluster-is-ready"
event. The controller also uses this event to submit a pending config update to
the now-ready instances.
2021-12-21 14:52:29 -08:00
Christian Kreibich
b57be021b7 Make all globals start with a "g_" prefix
This makes it easier to spot them in code, and is shorter than using explicit
namespacing.
2021-12-21 14:52:28 -08:00
Christian Kreibich
14a8c979c1 Add missing debug() log function to log module's API 2021-12-21 14:52:28 -08:00
Christian Kreibich
a56ee6b9a6 Add separate utility module for controller and agent
We can figure out later whether & where to re-settle helper functions that end
up in there.
2021-12-21 14:52:28 -08:00
Christian Kreibich
ddbd83fee4 Support for dropping instances no longer needed after config updates
This sends such expired instances empty configurations that will cause them to
shut down their remaining data cluster nodes.
2021-12-21 14:52:28 -08:00
Christian Kreibich
8eee5bb3d2 Additional infrastructure for printing types
Also added convenience for instantiating (dummy) configuration records.
2021-12-21 14:52:28 -08:00
Christian Kreibich
5cb44c2f69 Support on-demand peering with agents when receiving new cluster configuration
Prior to this, static configuration needed to be in place to configure the
controller/agent layout. The configuration update can now include new instances
that the controller will connect to, assuming they're instances with a listening
agent.
2021-12-21 14:52:28 -08:00
Christian Kreibich
484f79f599 Expand requests support in the controller
Request records for configuration updates now store the full configuration. The
ClusterController::Request module now provies a to_string() function for
rendering requests to a string.
2021-12-21 14:52:28 -08:00
Christian Kreibich
aceb05099a Whitespace tweaks in cluster controller and agent scripts 2021-12-21 14:52:28 -08:00
Christian Kreibich
8db985ea78 Merge branch 'topic/christian/cluster-controller'
* topic/christian/cluster-controller:
  Add a cluster controller testcase for agent-controller checkin
  Add zeek-client via new submodule
  Update baselines affected by cluster controller changes
  Introduce cluster controller and cluster agent scripting
  Establish a separate init script when using the supervisor
  Add optional bare-mode boolean flag to Supervisor's node configuration
  Add support for making the supervisor listen for requests
  Add support for setting environment variables via supervisor
2021-07-08 16:51:11 -07:00
Christian Kreibich
c744702f94 Introduce cluster controller and cluster agent scripting
This is a preliminary implementation of a subset of the functionality set out in
our cluster controller architecture. The controller is the central management
node, existing once in any Zeek cluster. The agent is a node that runs once per
instance, where an instance will commonly be a physical machine. The agent in
turn manages the "data cluster", i.e. the traditional notion of a Zeek cluster
with manager, worker nodes, etc.

Agent and controller live in the policy folder, and are activated when loading
policy/frameworks/cluster/agent and policy/frameworks/cluster/controller,
respectively. Both run in nodes forked by the supervisor. When Zeek doesn't use
the supervisor, they do nothing. Otherwise, boot.zeek instructs the supervisor
to create the respective node, running main.zeek.

Both controller and agent have their own config.zeek with relevant knobs. For
both, controller/types.zeek provides common data types, and controller/log.zeek
provides basic logging (without logger communication -- no such node might
exist).

A primitive request-tracking abstraction can be found in controller/request.zeek
to track outstanding request events and their subsequent responses.
2021-07-08 13:12:53 -07:00