Commit graph

16818 commits

Author SHA1 Message Date
Christian Kreibich
ace5c11048 Bugfix: accurately track Broker buffer overflows w/ multiple peerings
When a node restarts or a peering between two nodes starts over for other
reasons, the internal tracking in the Broker manager resets its state (since
it's per-peering), and thus the message overflow counter. The script layer was
unaware of this, and threw errors when trying to reset the corresponding counter
metric down to zero at sync time.

We now track past buffer overflows via a separate epoch table, using Broker peer
ID comparisons to identify new peerings, and set the counter to the sum of past
and current overflows.

I considered just making this a gauge, but it seems more helpful to be able to
look at a counter to see whether any messages have ever been dropped over the
lifetime of the node process.

As an aside, this now also avoids repeatedly creating the labels vector,
re-using the same one for each metric.

Thanks to @pbcullen for identifying this one!
2025-05-07 17:30:45 -07:00
Christian Kreibich
d9f11643a2 Use Broker peering directionality when re-peering after backpressure overflows
This avoids creating pointless connection reattempts to ephemeral TCP
client-side ports, which have been cluttering up the Broker logs since 7.1.

(cherry picked from commit 549e678dff)
2025-04-29 17:00:50 -07:00
Christian Kreibich
4372cdfe2a Expand Broker APIs to allow tracking directionality of peering establishment
This provides ways to figure out for a given peer, or a given address/port pair,
whether the local node originally established the peering.

(cherry picked from commit b430d5235c)
2025-04-29 17:00:30 -07:00
Christian Kreibich
458b887df1 Lower listen/connect retry intervals in Broker and the cluster framework to 1sec
The former defaults (30sec, 1min) can slow down cluster startup and recovery
considerably, and other systems have more aggressive intervals still.

(cherry picked from commit 68fadd0464)
2025-04-29 16:47:13 -07:00
Christian Kreibich
446f49e6bc Switch Broker's default backpressure policy to drop_oldest, bump buffer sizes
At every site where we've dug into backpressure disconnect findings, it has been
the case that the default values were too small. 8192, so 4x the old default,
suffices at every site to drown out premature disconnects.

With metrics now available for the send buffers regardless of backpressure
overflow policy, this also switches the default from "disconnect" to
"drop_oldest" (for both peers and websockets), meaning that peerings remain
untouched but the oldest queued message simply gets dropped when a new message
is enqueued. With this policy, the number of backpressure overflows is then
simply the count of discarded messages, something that users can tune to see
drop to zero in everyday use.  Another benefit is that marginal overflows cause
less message loss than when an entire buffer's worth (plus potentially more
in-flight messages) gets thrown out with a disconnect.

(cherry picked from commit 841a40ff88)
2025-04-29 16:47:13 -07:00
Christian Kreibich
e6705732ec Add basic btest to verify that Broker peering telemetry is available.
This differs in the upstream version in that it explicitly invokes
Telemetry::sync(), since 7.0.x doesn't have the on-demand invocation
of the hook at scrape & collection time.

(cherry picked from commit 35ab9d5c80)
2025-04-29 16:47:07 -07:00
Christian Kreibich
8b9b16d7a8 Add cluster framework telemetry for Broker's send-buffer use
This hooks into Telemetry::sync() to update Broker-level metrics tracking the
peerings' send buffer state. We do this in the cluster framework so we can label
the resulting metrics with Zeek cluster node names, not Broker's endpoint IDs.

(cherry picked from commit 88a0cda8ca)
2025-04-29 15:19:38 -07:00
Tim Wojtulewicz
c78335c47b Fix use-after-move in recent broker changes
(cherry picked from commit f8d2f30cec)
2025-04-29 15:08:05 -07:00
Christian Kreibich
d5bbf05a32 Add peer buffer update tracking to the Broker manager's event_observer
This implements basic tracking of each peering's current fill level, the maximum
level over a recent time interval (via a new Broker::buffer_stats_reset_interval
tunable, defaulting to 1min), and the number of times a buffer overflows. For
the disconnect policy this is the number of depeerings, but for drop_newest and
drop_oldest it implies the number of messages lost.

This doesn't use "proper" telemetry metrics for a few reasons: this tracking is
Broker-specific, so we need to track each peering via endpoint_ids, while we
want the metrics to use Cluster node name labels, and the latter live in the
script layer. Using broker::endpoint_id directly as keys also means we rely on
their ability to hash in STL containers, which should be fast.

This does not track the buffer levels for Broker "clients" (as opposed to
"peers"), i.e. WebSockets, since we currently don't have a way to name these,
and we don't want to use ephemeral Broker IDs in their telemetry.

To make the stats accessible to the script layer the Broker manager (via a new
helper class that lives in the event_observer) maintains a TableVal mapping
Broker IDs to a new BrokerPeeringStats record. The table's members get updated
every time that table is requested. This minimizes new val instantiation and
allows the script layer to customize the BrokerPeeringStats record by redefing,
updating fields, etc. Since we can't use Zeek vals outside the main thread, this
requires some care so all table updates happen only in the Zeek-side table
updater, PeerBufferState::GetPeeringStatsTable().

(cherry picked from commit f5fbad23ff)
2025-04-29 15:08:05 -07:00
Christian Kreibich
3bf709f705 Rename the Broker manager's LoggerAdapter
This is about to do more than just log handling, so this renames it simply to
Observer, reflecting the fact that it implements broker::event_observer.

(cherry picked from commit 23554280e0)
2025-04-29 15:08:05 -07:00
Christian Kreibich
ae14eff1f6 Add event observer via Broker's new API
This is a heavily modified version of 30615f425e,
part of PR #3998, removing all of the logging-specific parts. It only
establishes the basic adapter and the broker::logging() call to register it.
2025-04-29 15:08:05 -07:00
Christian Kreibich
6e8906c0d8 Update scripts.base.frameworks.telemetry.internal-metrics baseline
With the Broker submodule bump, the broker_buffered_messages metric no longer
exists.
2025-04-29 15:08:05 -07:00
Christian Kreibich
5b29dad2c7 Bump Broker to pull in new observer API 2025-04-29 14:49:15 -07:00
Christian Kreibich
186dbe085f Expand documentation of Broker events.
(cherry picked from commit feb2aa890d)
2025-04-08 15:09:44 -07:00
Christian Kreibich
90ecf7ff0d Add backpressure disconnect notification to cluster.log and via telemetry
This adds a Broker-specific script to the cluster framework, loaded only when
Zeek is running in cluster mode. It adds logging in cluster.log as well as
telemetry via a metrics counter for Broker-observed backpressure disconnects.

The new zeek_broker_backpressure_disconnects counter, labeled by the neighboring
peer that the reporting node has determined to be unresponsive, counts the
number of unpeerings for this reason.

Here the node "worker" has observed node "proxy" falling behind once:

# HELP zeek_broker_backpressure_disconnects_total Number of Broker peering drops due to a neighbor falling too far behind in message I/O
# TYPE zeek_broker_backpressure_disconnects_total counter
zeek_broker_backpressure_disconnects_total{endpoint="worker",peer="proxy"} 1

Includes small btest baseline update to reflect @load of a new script.

(cherry picked from commit ead6134501)
2025-04-08 15:09:44 -07:00
Christian Kreibich
67f135f57a Remove unneeded @loads from base/misc/version.zeek
This module is loaded by the telemetry framework, which we're now loading via
the cluster framework, i.e. also in bare mode. The resulting additional
thread (for creating reporter.log) trips up a number of btest baselines.

version.zeek doesn't use any of the string helper functions.

(cherry picked from commit d260a5b7a9)
2025-04-08 15:09:44 -07:00
Christian Kreibich
06fa47e21d Add Cluster::nodeid_to_node() helper function
This translates backend-specific node identifiers (like Broker IDs) to
cluster nodes and their names, if available.

(cherry picked from commit 46a11ec37d)
2025-04-08 15:09:44 -07:00
Christian Kreibich
1cbbbc5c40 Support re-peering with Broker peers that fall behind
This adds re-peering at the Broker level for peers that Broker decided to
unpeer. We keep this at the Broker level since this behavior is specific to
it (as opposed to other cluster backends).

Includes baseline updates for btests that pick up on the new script's @load.

(cherry picked from commit 0010e65f6d)
2025-04-08 15:09:44 -07:00
Dominik Charousset
eeb0e7184d Add Zeek-level configurability of Broker slow-peer disconnects
(cherry picked from commit 4c4eb4b8e2)
2025-04-08 15:09:44 -07:00
Christian Kreibich
f7e8fe1d68 Bump Broker to pull in disconnect feature and infinite-loop fix
(cherry picked from commit b9df1674b7)
2025-04-08 15:09:41 -07:00
Christian Kreibich
11701d4734 No need to namespace Cluster:: functions in their own namespace
(cherry picked from e81856a4af)
2025-04-08 14:50:50 -07:00
Christian Kreibich
2ad80f8fb2 Telemetry framework: move BIFs to the primary-bif stage
This moves the Telemetry framework's BIF-defined functionalit from the
secondary-BIFs stage to the primary one. That is, this functionality is now
available from the end of init-bare.zeek, not only after the end of
init-frameworks-and-bifs.zeek.

This allows us to use script-layer telemetry in our Zeek's own code that get
pulled in during init-frameworks-and-bifs.

This change splits up the BIF features into functions, constants, and types,
because that's the granularity most workable in Func.cc and NetVar. It also now
defines the Telemetry::MetricsType enum once, not redundantly in BIFs and script
layer.

Due to subtle load ordering issues between the telemetry and cluster frameworks
this pushes the redef stage of Telemetry::metrics_port and address into
base/frameworks/telemetry/options.zeek, which is loaded sufficiently late in
init-frameworks-and-bifs.zeek to sidestep those issues. (When not doing this,
the effect is that the redef in telemetry/main.zeek doesn't yet find the
cluster-provided values, and Zeek does not end up listening on these ports.)

The need to add basic Zeek headers in script_opt/ZAM/ZBody.cc as a side-effect
of this is curious, but looks harmless.

Also includes baseline updates for the usual btests and adds a few doc strings.

(cherry picked from commit 71f7e89974)
2025-04-08 14:50:45 -07:00
Christian Kreibich
5503688758 Minor comment tweaks for init-frameworks-and-bifs.zeek
(cherry picked from acdd7a7934)
2025-04-08 14:50:28 -07:00
Tim Wojtulewicz
3e5060018a Update docs submodule to fix RTD [nomail] [skip ci] 2025-03-20 13:48:45 -07:00
Tim Wojtulewicz
9f8e27118e Update CHANGES, VERSION, and NEWS for 7.0.6 release 2025-03-20 12:24:26 -07:00
Tim Wojtulewicz
89376095dc Update zeekctl submodule to fix a couple btests 2025-03-19 13:04:31 -07:00
Tim Wojtulewicz
3e8af6497e Update zeekjs to v0.16.0 2025-03-19 10:43:17 -07:00
Tim Wojtulewicz
5051cce720 Updating CHANGES and VERSION. 2025-03-19 10:43:02 -07:00
Tim Wojtulewicz
c30b835a14 Update mozilla-ca-list.zeek and ct-list.zeek to NSS 3.109 2025-03-18 17:59:01 -07:00
Tim Wojtulewicz
a041080e3f Update core/vntag-in-vlan baseline to remove ip_proto field for 7.0 2025-03-18 17:03:05 -07:00
Tim Wojtulewicz
fc3001c76a CI: Force rebuild of tumbleweed docker image 2025-03-18 16:33:45 -07:00
Tim Wojtulewicz
e2b2c79306 Merge remote-tracking branch 'origin/topic/timw/ci-macos-upgrade-pip'
* origin/topic/timw/ci-macos-upgrade-pip:
  CI: Unconditionally upgrade pip on macOS

(cherry picked from commit e8d91c8227)
2025-03-18 16:21:45 -07:00
Tim Wojtulewicz
ed32ee73fa Merge remote-tracking branch 'origin/topic/timw/ci-macos-sequoia'
* origin/topic/timw/ci-macos-sequoia:
  ci/init-external-repo.sh: Use regex to match macos cirrus task
  CI: Change macOS runner to Sequoia

(cherry picked from commit 43f108bb71)
2025-03-18 16:21:13 -07:00
Tim Wojtulewicz
eed9858bc4 CI: Update freebsd to 13.4 and 14.2 2025-03-18 16:20:06 -07:00
Tim Wojtulewicz
ed081212ae Merge remote-tracking branch 'origin/topic/timw/vntag-in-vlan'
* origin/topic/timw/vntag-in-vlan:
  Add analyzer registration from VLAN to VNTAG

(cherry picked from commit cb5e3d0054)
2025-03-18 16:18:13 -07:00
Arne Welzel
ec04c925a0 Merge remote-tracking branch 'origin/topic/awelzel/2311-load-plugin-bare-mode'
* origin/topic/awelzel/2311-load-plugin-bare-mode:
  scan.l: Fix @load-plugin scripts loading
  scan.l: Extract switch_to() from load_files()
  ScannedFile: Allow skipping canonicalization

(cherry picked from commit a3a08fa0f3)
2025-03-18 16:16:39 -07:00
Arne Welzel
de8127f3cd Merge remote-tracking branch 'origin/topic/awelzel/4198-4201-quic-maintenance'
* origin/topic/awelzel/4198-4201-quic-maintenance:
  QUIC/decrypt_crypto: Rename all_data to data
  QUIC: Confirm before forwarding data to SSL
  QUIC: Parse all QUIC packets in a UDP datagram
  QUIC: Only slurp till packet end, not till &eod

(cherry picked from commit 44304973fb)
2025-03-18 16:15:34 -07:00
Arne Welzel
b5774f2de9 Merge remote-tracking branch 'origin/topic/vern/ZAM-field-assign-in-op'
* origin/topic/vern/ZAM-field-assign-in-op:
  pre-commit: Bump spicy-format to 0.23
  fix for ZAM optimization of assigning a record field to result of "in" operation

(cherry picked from commit 991bc9644d)
2025-03-18 16:13:01 -07:00
Tim Wojtulewicz
7c8a7680ba Update CHANGES, VERSION, and NEWS for 7.0.5 release 2024-12-16 11:12:48 -07:00
Tim Wojtulewicz
26b50908e1 Merge remote-tracking branch 'security/topic/timw/7.0.5-patches' into release/7.0
* security/topic/timw/7.0.5-patches:
  QUIC/decrypt_crypto: Actually check if decryption was successful
  QUIC/decrypt_crypto: Limit payload_length to 10k
  QUIC/decrypt_crypto: Fix decrypting into too small stack buffer
2024-12-16 10:21:59 -07:00
Arne Welzel
c2f2388f18 QUIC/decrypt_crypto: Actually check if decryption was successful
...and bail if it wasn't.

PCAP was produced using OSS-Fuzz input from issue 383379789.
2024-12-13 13:10:45 -07:00
Arne Welzel
d745d746bc QUIC/decrypt_crypto: Limit payload_length to 10k
Given we dynamically allocate memory for decryption, employ a limit
that is unlikely to be hit, but allows for large payloads produced
by the fuzzer or jumbo frames.
2024-12-13 13:10:45 -07:00
Arne Welzel
5fbb6b4599 QUIC/decrypt_crypto: Fix decrypting into too small stack buffer
A QUIC initial packet larger than 1500 bytes could lead to crashes
due to the usage of a fixed size stack buffer for decryption.

Allocate the necessary memory dynamically on the heap instead.
2024-12-13 13:10:45 -07:00
Tim Wojtulewicz
7c463b5f92 Update docs submodule [nomail] [skip ci] 2024-12-13 13:08:51 -07:00
Tim Wojtulewicz
e7f694bcbb Merge remote-tracking branch 'origin/topic/vern/ZAM-tbl-iteration-memory-mgt-fix'
* origin/topic/vern/ZAM-tbl-iteration-memory-mgt-fix:
  fix for memory management associated with ZAM table iteration

(cherry picked from commit 805e9db588)
2024-12-13 12:27:16 -07:00
Arne Welzel
f54416eae4 Merge remote-tracking branch 'origin/topic/christian/fix-zam-analyzer-name'
* origin/topic/christian/fix-zam-analyzer-name:
  Fix ZAM's implementation of Analyzer::name() BiF

(cherry picked from commit e100a8e698)
2024-12-12 13:14:10 -07:00
Arne Welzel
68bfe8d1c0 Merge remote-tracking branch 'origin/topic/vern/zam-exception-leaks'
* origin/topic/vern/zam-exception-leaks:
  More robust memory management for ZAM execution - fixes #4052

(cherry picked from commit c3b30b187e)
2024-12-12 13:05:13 -07:00
Arne Welzel
cf97ed6ac1 Merge remote-tracking branch 'origin/topic/awelzel/bump-zeekjs-0-14-0'
* origin/topic/awelzel/bump-zeekjs-0-14-0:
  Bump zeekjs to v0.14.0

(cherry picked from commit aac640ebff)
2024-12-12 12:45:14 -07:00
Benjamin Bannier
35cd891d6e Merge remote-tracking branch 'origin/topic/bbannier/doc-have-spicy'
(cherry picked from commit 4a96d34af6)
2024-12-12 12:43:43 -07:00
Tim Wojtulewicz
f300ddb9fe Update CHANGES, VERSION, and NEWS for 7.0.4 release 2024-11-19 12:35:32 -07:00