Add peer buffer update tracking to the Broker manager's event_observer

This implements basic tracking of each peering's current fill level, the maximum level over a recent time interval (via a new Broker::buffer_stats_reset_interval tunable, defaulting to 1min), and the number of times a buffer overflows. For the disconnect policy this is the number of depeerings, but for drop_newest and drop_oldest it implies the number of messages lost. This doesn't use "proper" telemetry metrics for a few reasons: this tracking is Broker-specific, so we need to track each peering via endpoint_ids, while we want the metrics to use Cluster node name labels, and the latter live in the script layer. Using broker::endpoint_id directly as keys also means we rely on their ability to hash in STL containers, which should be fast. This does not track the buffer levels for Broker "clients" (as opposed to "peers"), i.e. WebSockets, since we currently don't have a way to name these, and we don't want to use ephemeral Broker IDs in their telemetry. To make the stats accessible to the script layer the Broker manager (via a new helper class that lives in the event_observer) maintains a TableVal mapping Broker IDs to a new BrokerPeeringStats record. The table's members get updated every time that table is requested. This minimizes new val instantiation and allows the script layer to customize the BrokerPeeringStats record by redefing, updating fields, etc. Since we can't use Zeek vals outside the main thread, this requires some care so all table updates happen only in the Zeek-side table updater, PeerBufferState::GetPeeringStatsTable().
2025-10-02 06:38:20 +00:00 · 2025-04-15 18:08:16 -07:00 · 2025-04-15 18:08:16 -07:00 · f5fbad23ff
commit f5fbad23ff
parent 23554280e0
7 changed files with 241 additions and 9 deletions
--- a/scripts/base/frameworks/broker/main.zeek
+++ b/scripts/base/frameworks/broker/main.zeek
@ -104,6 +104,10 @@ export {
 	## Same as :zeek:see:`Broker::peer_overflow_policy` but for WebSocket clients.
 	const web_socket_overflow_policy = "disconnect" &redef;

+	## How frequently Zeek resets some peering/client buffer statistics,
+	## such as ``max_queued_recently`` in :zeek:see:`BrokerPeeringStats`.
+	const buffer_stats_reset_interval = 1min &redef;
+
 	## The CAF scheduling policy to use.  Available options are "sharing" and
 	## "stealing".  The "sharing" policy uses a single, global work queue along
 	## with mutex and condition variable used for accessing it, which may be
@ -392,6 +396,12 @@ export {
 	## Returns: a unique identifier for the local broker endpoint.
 	global node_id: function(): string;

+	## Obtain each peering's send-buffer statistics. The keys are Broker
+	## endpoint IDs.
+	##
+	## Returns: per-peering statistics.
+	global peering_stats: function(): table[string] of BrokerPeeringStats;
+
 	## Sends all pending log messages to remote peers.  This normally
 	## doesn't need to be used except for test cases that are time-sensitive.
 	global flush_logs: function(): count;
@ -554,6 +564,11 @@ function node_id(): string
 	return __node_id();
 	}

+function peering_stats(): table[string] of BrokerPeeringStats
+	{
+	return __peering_stats();
+	}
+
 function flush_logs(): count
 	{
 	return __flush_logs();