Commit graph

149 commits

Author SHA1 Message Date
Arne Welzel
2f15f5ce6a cluster: Introduce pubsub.zeek and types.zeek
Allow access to the Cluster's subscribe(), unsubscribe(), publish(),
publish_hrw() and publish_rr() methods by loading only the
base/frameworks/cluster/pubsub, rather than everything that
__load__.zeek or also main.zeek pulls in.

This can be used by other scripts to use these functions without
relying or expecting the rest of the cluster framework to be loaded
already. Concretely, this is needed to move the Supervisor framework
from Broker to Cluster.
2025-09-25 19:29:55 +02:00
Tim Wojtulewicz
d95affde4d Remove deprecations tagged for v8.1 2025-08-12 10:19:03 -07:00
Arne Welzel
1a87ebab72 cluster: Add on_subscribe() and on_unsubscribe() hooks
Closes #4176
2025-08-01 14:06:19 +02:00
Benjamin Bannier
d5fd29edcd Prefer explicit construction to coercion in record initialization
While we support initializing records via coercion from an expression
list, e.g.,

    local x: X = [$x1=1, $x2=2];

this can sometimes obscure the code to readers, e.g., when assigning to
value declared and typed elsewhere. The language runtime has a similar
overhead since instead of just constructing a known type it needs to
check at runtime that the coercion from the expression list is valid;
this can be slower than just writing the readible code in the first
place, see #4559.

With this patch we use explicit construction, e.g.,

    local x = X($x1=1, $x2=2);
2025-07-11 16:28:37 -07:00
Arne Welzel
1d931b5a2f cluster/WebSocket: Include X-Application-Name in cluster.log
A bit ad-hoc formatting for the log, but that's mostly because cluster.log
only has message field and I don't think having a dedicated application_name
column is worth it. That could also be added by custom scripts if it's really
wanted for a given deployment.
2025-06-30 17:55:24 +02:00
Arne Welzel
26f5166d7a cluster/telemetry: Move topic_normalization redef to zeromq 2025-06-26 15:22:11 +02:00
Arne Welzel
4c34274a6c cluster: Introduce telemetry component 2025-06-25 16:59:49 +02:00
Arne Welzel
544d571089 cluster/websocket: Deprecate $listen_host, introduce $listen_addr
This only changes the script-layer API, but keeps the std::string host
in the C++ layer's ServerOptions. Mostly because the ixwebsocket library
takes host as std::string. Also, maybe at  some point we'd want to
support something scheme-based like unix:///var/run/zeek.sock and placing
that in a string could not be totally wrong.

Add tests for IPV6, too.
2025-05-30 11:02:41 +02:00
Arne Welzel
a61aff010f cluster/websocket: Propagate code and reason to websocket_client_lost()
This allows to get visibility into the reason why ixwebsocket or the
client decided to disconnect.

Closed #4440
2025-05-13 18:26:03 +02:00
Arne Welzel
aaddeb19ad cluster/websocket: Support configurable ping interval
Primarily for testing purposes and maybe the hard-coded 5 seconds is too
aggressive for some deployments, so makes sense for it to be
configurable.
2025-05-13 18:26:03 +02:00
Christian Kreibich
738ce1c235 Bugfix: accurately track Broker buffer overflows w/ multiple peerings
When a node restarts or a peering between two nodes starts over for other
reasons, the internal tracking in the Broker manager resets its state (since
it's per-peering), and thus the message overflow counter. The script layer was
unaware of this, and threw errors when trying to reset the corresponding counter
metric down to zero at sync time.

We now track past buffer overflows via a separate epoch table, using Broker peer
ID comparisons to identify new peerings, and set the counter to the sum of past
and current overflows.

I considered just making this a gauge, but it seems more helpful to be able to
look at a counter to see whether any messages have ever been dropped over the
lifetime of the node process.

As an aside, this now also avoids repeatedly creating the labels vector,
re-using the same one for each metric.

Thanks to @pbcullen for identifying this one!
2025-05-07 17:27:38 -07:00
Christian Kreibich
68fadd0464 Lower listen/connect retry intervals in Broker and the cluster framework to 1sec
The former defaults (30sec, 1min) can slow down cluster startup and recovery
considerably, and other systems have more aggressive intervals still.
2025-04-25 10:22:35 -07:00
Christian Kreibich
88a0cda8ca Add cluster framework telemetry for Broker's send-buffer use
This hooks into Telemetry::sync() to update Broker-level metrics tracking the
peerings' send buffer state. We do this in the cluster framework so we can label
the resulting metrics with Zeek cluster node names, not Broker's endpoint IDs.
2025-04-25 09:14:33 -07:00
Arne Welzel
011029addc cluster/websocket: Make websocket dispatcher queue size configurable
Limit the number WebSocket events queued from external clients to
dispatcher instances to produce back pressure to the clients if
Zeek's IO loop is overloaded.
2025-04-23 14:27:43 +02:00
Arne Welzel
3d3b7a0759 cluster/Backend: Add ProcessError()
Allow backends to pass errors to a strategy. Locally, these raise
Cluster::Backend::error() events that are logged to the reporter
as errors.
2025-04-23 14:19:08 +02:00
Arne Welzel
6032741868 cluster/websocket: Implement WebSocket server 2025-03-10 17:07:30 +01:00
Arne Welzel
769044e8e1 cluster/Backend: Pass node_id via Init() 2025-02-05 10:39:56 +01:00
Tim Wojtulewicz
25554fa668 Merge remote-tracking branch 'origin/topic/awelzel/fix-cluster-publish-any'
* origin/topic/awelzel/fix-cluster-publish-any:
  cluster/Backend: Handle unspecified table/set
  cluster: Fix Cluster::publish() of Broker::Data
  cluster: Be noisy when attempting to connect to an unknown node
2024-12-12 13:17:08 -07:00
Arne Welzel
271fc15041 cluster: Be noisy when attempting to connect to an unknown node
Mostly due to spending too much time wondering why nodes didn't connect
when there was a mismatch between "manager" and "manager-1" in the
cluster layout. Remove manager from test-all-policy-cluster test to
avoid connection attempts in this test.
2024-12-12 13:01:04 +01:00
Justin Azoff
10438408a5 Pre-compute the node topics for all pool entries.
A zeek script profile showed a small percentage of time spent in
Cluster::node_topic, but this never changes and can be cached.
2024-12-11 15:57:01 -05:00
Arne Welzel
a2249f7ecb cluster: Add Cluster::node_id(), allow redef of node_topic(), nodeid_topic()
This provides a way for non-broker cluster backends to override a
node's identifier and its own topics that it listens on by default.
2024-12-10 20:33:02 +01:00
Christian Kreibich
ead6134501 Add backpressure disconnect notification to cluster.log and via telemetry
This adds a Broker-specific script to the cluster framework, loaded only when
Zeek is running in cluster mode. It adds logging in cluster.log as well as
telemetry via a metrics counter for Broker-observed backpressure disconnects.

The new zeek_broker_backpressure_disconnects counter, labeled by the neighboring
peer that the reporting node has determined to be unresponsive, counts the
number of unpeerings for this reason.

Here the node "worker" has observed node "proxy" falling behind once:

# HELP zeek_broker_backpressure_disconnects_total Number of Broker peering drops due to a neighbor falling too far behind in message I/O
# TYPE zeek_broker_backpressure_disconnects_total counter
zeek_broker_backpressure_disconnects_total{endpoint="worker",peer="proxy"} 1

Includes small btest baseline update to reflect @load of a new script.
2024-12-06 15:18:05 -08:00
Christian Kreibich
46a11ec37d Add Cluster::nodeid_to_node() helper function
This translates backend-specific node identifiers (like Broker IDs) to
cluster nodes and their names, if available.
2024-12-06 15:18:05 -08:00
Christian Kreibich
e81856a4af No need to namespace Cluster:: functions in their own namespace 2024-12-06 15:18:05 -08:00
Arne Welzel
fc12be1f17 cluster/setup-connections: Switch to Cluster::subscribe(), short-circuit broker
For the time being, this is easiest, otherwise we'd need to
conditionally load a broker-specific policy script based on
Cluster::backend being set.
2024-11-26 12:58:23 +01:00
Arne Welzel
ef04a199c8 cluster: Add Cluster scoped bifs
... and a broker based test using Cluster::publish() and
Cluster::subscribe().
2024-11-26 12:58:23 +01:00
Arne Welzel
fb23a06f6f cluster/Backend: Interface for cluster backends 2024-11-22 10:43:50 +01:00
Arne Welzel
6bb7b9d726 scripts/base/cluster: Move active node management into node_down()
With the idea of an alternative cluster backend, we should
not maintain Cluster state within low-level Broker events.
2024-09-27 15:32:09 +02:00
Tim Wojtulewicz
4e9d843cec Remove deprecated Cluster::Node::interface field 2024-08-07 11:58:22 -07:00
Tim Wojtulewicz
a716903f3a Remove deprecated time machine settings 2024-08-07 11:58:21 -07:00
Christian Kreibich
fa6361af56 Management framework: propagate metrics port from agent
This propagates the metrics port from the node config passed through the
supervisor all the way into the script layer.
2024-07-08 23:05:24 -07:00
Christian Kreibich
737b1a2013 Remove the Supervisor's internal ClusterEndpoint struct.
This eliminates one place in which we currently need to mirror changes to the
script-land Cluster::Node record. Instead of keeping an exact in-core equivalent, the
Supervisor now treats the data structure as opaque, and stores the whole cluster
table as a JSON string.

We may replace the script-layer Supervisor::ClusterEndpoint in the future, using
Cluster::Node directly. But that's a more invasive change that will affect how
people invoke Supervisor::create() and similars.

Relying on JSON for serialization has the side-effect of removing the
Supervisor's earlier quirk of using 0/tcp, not 0/unknown, to indicate unused
ports in the Supervisor::ClusterEndpoint record.
2024-07-02 14:52:17 -07:00
Christian Kreibich
a98ec6b08b Provide a script-layer equivalent to Supervisor::__init_cluster().
If the script layer is able to access the current node's config via
Supervisor::node(), it can handle populating Cluster::nodes. That code
is much more straightforward than an equivalent in-core implementation
(especially with the upcoming change to the cluster table's implementation).
This introduces base/frameworks/cluster/supervisor.zeek and
Cluster::Supervisor::__init_cluster_nodes() for that purpose.

The @load of the Supervisor API in cluster/main.zeek isn't technically
necessary since we already load it explicitly even in init-bare.zeek,
but being explicit seems better.
2024-07-02 14:52:13 -07:00
Tim Wojtulewicz
e93e4cc26d Add a services.json endpoint for Prometheus service discovery 2024-05-31 13:30:31 -07:00
Christian Kreibich
873d734c79 Do not default PoolSpec topics to the empty string.
Similar to `node_topic`, we already spell out a topic in the existing use and
there's no obviously meaningful default value.
2024-02-05 18:03:08 -08:00
Christian Kreibich
8437012346 Do not default to proxy nodes in Broker::PoolSpec
This requires pool creation to spell out a spec explicitly, which the only code
using these types already does. There's no reason for pools to automatically
refer to proxies.
2024-02-05 17:51:11 -08:00
Arne Welzel
cd24acdfc8 time machine: Mark leftovers for removal in v7.1
I suspect we could just drop these directly, but lets follow the
deprecation cycle.
2023-11-07 16:06:16 +01:00
Arne Welzel
d88b147ac9 cluster: Deprecate the Cluster::Node$interface field
This field isn't required by a worker and it's certainly not used by a
worker to listen on that specific interface. It also isn't required to
be set consistently and its use in-tree limited to the old load-balancing
script.

There's a bif called packet_source() which on a worker will provide
information about the actually used packet source.

Relates to zeek/zeek#2877.
2023-11-07 16:06:16 +01:00
Tim Wojtulewicz
4229af6820 Remove deprecations tagged for v6.1 2023-06-14 10:07:22 -07:00
Arne Welzel
7a043e5e8f all: Fix typos identified by typos pre-commit hook 2023-06-13 17:57:32 +02:00
Arne Welzel
f53aefdd5b Merge branch 'topic/awelzel/3112-log-suffix-left-over-log-rotation'
* topic/awelzel/3112-log-suffix-left-over-log-rotation:
  cluster/logger: Fix leftover-log-rotation in multi-logger setups
  cluster/logger: Fix global var reference
2023-06-13 17:33:56 +02:00
Arne Welzel
6d1991fb6a cluster/logger: Fix leftover-log-rotation in multi-logger setups
Populating log_metadata during zeek_init() is too late for the
leftover-log-rotation functionality, so do it at script parse time.

Also, prepend archiver_ to the log_metadata table and encoding function
due to being in the global namespace and to align with the
archiver_rotation_format_func. This hasn't been in a released
version yet, so fine to rename still.

Closes #3112
2023-06-13 10:47:20 +02:00
Arne Welzel
27432c457c cluster/logger: Fix global var reference 2023-06-13 10:47:20 +02:00
Arne Welzel
eef7acc1e9 cluster/main: Remove extra @if ( Cluster::is_enabled() )
These have been discussed in the context of "@if &analyze" [1] and
am much in favor for not disabling/removing ~100 lines (more than
fits on a single terminal) out from the middle of a file. There's no
performance impact for having these handlers enabled unconditionally.
Also, any future work on "@if &analyze" will look at them again which
we could also skip.

This also reverts back to the behavior where the Cluster::LOG stream
is created even in non cluster setups like in previous Zeek versions.
As long as no one writes to it there's essentially no difference. If
someone does write to Cluster::LOG, I'd argue not black holing these
messages is better. Schema generators using Log::active_streams will
continue to discover Cluster::LOG even if they run in non-cluster
mode.

https://github.com/zeek/zeek/pull/3062#discussion_r1200498905
2023-06-06 15:20:10 +02:00
Tim Wojtulewicz
5a3abbe364 Revert "Merge remote-tracking branch 'origin/topic/vern/at-if-analyze'"
This reverts commit 4e797ddbbc, reversing
changes made to 3ac28ba5a2.
2023-05-31 09:20:33 +02:00
Vern Paxson
890010915a change base scripts to use run-time if's or @if ... &analyze 2023-05-19 13:26:27 -07:00
Arne Welzel
c2a07476cc Merge remote-tracking branch 'jgras/topic/jgras/cluster-active-node-count-fix'
* jgras/topic/jgras/cluster-active-node-count-fix:
  Fix get_active_node_count for node types not present.

Changed over to explicit existence check instead to avoid the set()
creation upon missed lookups.
2023-05-17 10:37:00 +02:00
Jan Grashoefer
e4f654c14c Fix get_active_node_count for node types not present. 2023-05-16 17:47:50 +02:00
Arne Welzel
9330a74fe1 Merge remote-tracking branch 'origin/topic/awelzel/zeek-archiver-multiple-loggers'
* origin/topic/awelzel/zeek-archiver-multiple-loggers:
  cluster/supervisor: Multi-logger awareness
  Bump zeek-archiver submodule
2023-05-09 15:20:53 +02:00
Arne Welzel
c813872915 cluster/supervisor: Multi-logger awareness
When multiple loggers are configured in a Supervisor controlled cluster
configuration, encode extra information into the rotated filename to
identify which logger produced the log.

This is similar to the approach taken for ZeekControl, re-using the
log_suffix terminology, but as there's only a single zeek-archiver
process and no postprocessors and no other side-channel for additional
information, we encode extra metadata into the filename. zeek-archiver
is extended to recognize the special metadata part of the filename.

This also solves the issue that multiple loggers in a supervisor setup
overwrite each others log files within a single log-queue directory.
2023-05-05 12:27:25 +02:00