Allow access to the Cluster's subscribe(), unsubscribe(), publish(),
publish_hrw() and publish_rr() methods by loading only the
base/frameworks/cluster/pubsub, rather than everything that
__load__.zeek or also main.zeek pulls in.
This can be used by other scripts to use these functions without
relying or expecting the rest of the cluster framework to be loaded
already. Concretely, this is needed to move the Supervisor framework
from Broker to Cluster.
While we support initializing records via coercion from an expression
list, e.g.,
local x: X = [$x1=1, $x2=2];
this can sometimes obscure the code to readers, e.g., when assigning to
value declared and typed elsewhere. The language runtime has a similar
overhead since instead of just constructing a known type it needs to
check at runtime that the coercion from the expression list is valid;
this can be slower than just writing the readible code in the first
place, see #4559.
With this patch we use explicit construction, e.g.,
local x = X($x1=1, $x2=2);
A bit ad-hoc formatting for the log, but that's mostly because cluster.log
only has message field and I don't think having a dedicated application_name
column is worth it. That could also be added by custom scripts if it's really
wanted for a given deployment.
This only changes the script-layer API, but keeps the std::string host
in the C++ layer's ServerOptions. Mostly because the ixwebsocket library
takes host as std::string. Also, maybe at some point we'd want to
support something scheme-based like unix:///var/run/zeek.sock and placing
that in a string could not be totally wrong.
Add tests for IPV6, too.
When a node restarts or a peering between two nodes starts over for other
reasons, the internal tracking in the Broker manager resets its state (since
it's per-peering), and thus the message overflow counter. The script layer was
unaware of this, and threw errors when trying to reset the corresponding counter
metric down to zero at sync time.
We now track past buffer overflows via a separate epoch table, using Broker peer
ID comparisons to identify new peerings, and set the counter to the sum of past
and current overflows.
I considered just making this a gauge, but it seems more helpful to be able to
look at a counter to see whether any messages have ever been dropped over the
lifetime of the node process.
As an aside, this now also avoids repeatedly creating the labels vector,
re-using the same one for each metric.
Thanks to @pbcullen for identifying this one!
This hooks into Telemetry::sync() to update Broker-level metrics tracking the
peerings' send buffer state. We do this in the cluster framework so we can label
the resulting metrics with Zeek cluster node names, not Broker's endpoint IDs.
Limit the number WebSocket events queued from external clients to
dispatcher instances to produce back pressure to the clients if
Zeek's IO loop is overloaded.
* origin/topic/awelzel/fix-cluster-publish-any:
cluster/Backend: Handle unspecified table/set
cluster: Fix Cluster::publish() of Broker::Data
cluster: Be noisy when attempting to connect to an unknown node
Mostly due to spending too much time wondering why nodes didn't connect
when there was a mismatch between "manager" and "manager-1" in the
cluster layout. Remove manager from test-all-policy-cluster test to
avoid connection attempts in this test.
This adds a Broker-specific script to the cluster framework, loaded only when
Zeek is running in cluster mode. It adds logging in cluster.log as well as
telemetry via a metrics counter for Broker-observed backpressure disconnects.
The new zeek_broker_backpressure_disconnects counter, labeled by the neighboring
peer that the reporting node has determined to be unresponsive, counts the
number of unpeerings for this reason.
Here the node "worker" has observed node "proxy" falling behind once:
# HELP zeek_broker_backpressure_disconnects_total Number of Broker peering drops due to a neighbor falling too far behind in message I/O
# TYPE zeek_broker_backpressure_disconnects_total counter
zeek_broker_backpressure_disconnects_total{endpoint="worker",peer="proxy"} 1
Includes small btest baseline update to reflect @load of a new script.
This eliminates one place in which we currently need to mirror changes to the
script-land Cluster::Node record. Instead of keeping an exact in-core equivalent, the
Supervisor now treats the data structure as opaque, and stores the whole cluster
table as a JSON string.
We may replace the script-layer Supervisor::ClusterEndpoint in the future, using
Cluster::Node directly. But that's a more invasive change that will affect how
people invoke Supervisor::create() and similars.
Relying on JSON for serialization has the side-effect of removing the
Supervisor's earlier quirk of using 0/tcp, not 0/unknown, to indicate unused
ports in the Supervisor::ClusterEndpoint record.
If the script layer is able to access the current node's config via
Supervisor::node(), it can handle populating Cluster::nodes. That code
is much more straightforward than an equivalent in-core implementation
(especially with the upcoming change to the cluster table's implementation).
This introduces base/frameworks/cluster/supervisor.zeek and
Cluster::Supervisor::__init_cluster_nodes() for that purpose.
The @load of the Supervisor API in cluster/main.zeek isn't technically
necessary since we already load it explicitly even in init-bare.zeek,
but being explicit seems better.
This requires pool creation to spell out a spec explicitly, which the only code
using these types already does. There's no reason for pools to automatically
refer to proxies.
This field isn't required by a worker and it's certainly not used by a
worker to listen on that specific interface. It also isn't required to
be set consistently and its use in-tree limited to the old load-balancing
script.
There's a bif called packet_source() which on a worker will provide
information about the actually used packet source.
Relates to zeek/zeek#2877.
* topic/awelzel/3112-log-suffix-left-over-log-rotation:
cluster/logger: Fix leftover-log-rotation in multi-logger setups
cluster/logger: Fix global var reference
Populating log_metadata during zeek_init() is too late for the
leftover-log-rotation functionality, so do it at script parse time.
Also, prepend archiver_ to the log_metadata table and encoding function
due to being in the global namespace and to align with the
archiver_rotation_format_func. This hasn't been in a released
version yet, so fine to rename still.
Closes#3112
These have been discussed in the context of "@if &analyze" [1] and
am much in favor for not disabling/removing ~100 lines (more than
fits on a single terminal) out from the middle of a file. There's no
performance impact for having these handlers enabled unconditionally.
Also, any future work on "@if &analyze" will look at them again which
we could also skip.
This also reverts back to the behavior where the Cluster::LOG stream
is created even in non cluster setups like in previous Zeek versions.
As long as no one writes to it there's essentially no difference. If
someone does write to Cluster::LOG, I'd argue not black holing these
messages is better. Schema generators using Log::active_streams will
continue to discover Cluster::LOG even if they run in non-cluster
mode.
https://github.com/zeek/zeek/pull/3062#discussion_r1200498905
* jgras/topic/jgras/cluster-active-node-count-fix:
Fix get_active_node_count for node types not present.
Changed over to explicit existence check instead to avoid the set()
creation upon missed lookups.
When multiple loggers are configured in a Supervisor controlled cluster
configuration, encode extra information into the rotated filename to
identify which logger produced the log.
This is similar to the approach taken for ZeekControl, re-using the
log_suffix terminology, but as there's only a single zeek-archiver
process and no postprocessors and no other side-channel for additional
information, we encode extra metadata into the filename. zeek-archiver
is extended to recognize the special metadata part of the filename.
This also solves the issue that multiple loggers in a supervisor setup
overwrite each others log files within a single log-queue directory.