Commit graph

3011 commits

Author SHA1 Message Date
Christian Kreibich
738ce1c235 Bugfix: accurately track Broker buffer overflows w/ multiple peerings
When a node restarts or a peering between two nodes starts over for other
reasons, the internal tracking in the Broker manager resets its state (since
it's per-peering), and thus the message overflow counter. The script layer was
unaware of this, and threw errors when trying to reset the corresponding counter
metric down to zero at sync time.

We now track past buffer overflows via a separate epoch table, using Broker peer
ID comparisons to identify new peerings, and set the counter to the sum of past
and current overflows.

I considered just making this a gauge, but it seems more helpful to be able to
look at a counter to see whether any messages have ever been dropped over the
lifetime of the node process.

As an aside, this now also avoids repeatedly creating the labels vector,
re-using the same one for each metric.

Thanks to @pbcullen for identifying this one!
2025-05-07 17:27:38 -07:00
Tim Wojtulewicz
2cf8497bf7 Merge remote-tracking branch 'origin/topic/timw/update-ct-ca-lists'
* origin/topic/timw/update-ct-ca-lists:
  External tests: add removed logs to CT list to prevent baseline changes
  Update Mozilla CA list and CT list to NSS 3.110
2025-04-29 08:53:04 -07:00
Kshitiz Bartariya
40935c31b1 Ignore case when matching prefix in http analyzer 2025-04-25 10:33:11 -07:00
Christian Kreibich
68fadd0464 Lower listen/connect retry intervals in Broker and the cluster framework to 1sec
The former defaults (30sec, 1min) can slow down cluster startup and recovery
considerably, and other systems have more aggressive intervals still.
2025-04-25 10:22:35 -07:00
Christian Kreibich
841a40ff88 Switch Broker's default backpressure policy to drop_oldest, bump buffer sizes
At every site where we've dug into backpressure disconnect findings, it has been
the case that the default values were too small. 8192, so 4x the old default,
suffices at every site to drown out premature disconnects.

With metrics now available for the send buffers regardless of backpressure
overflow policy, this also switches the default from "disconnect" to
"drop_oldest" (for both peers and websockets), meaning that peerings remain
untouched but the oldest queued message simply gets dropped when a new message
is enqueued. With this policy, the number of backpressure overflows is then
simply the count of discarded messages, something that users can tune to see
drop to zero in everyday use.  Another benefit is that marginal overflows cause
less message loss than when an entire buffer's worth (plus potentially more
in-flight messages) gets thrown out with a disconnect.
2025-04-25 10:22:35 -07:00
Christian Kreibich
5008f586ea Deprecate Broker::congestion_queue_size and stop using it internally
Since a reorg in the Broker library (commit b04195183) that revamped flow
control and that we pulled in with Zeek 5.0, this setting hasn't done
anything. Broker's endpoint::make_subscriber() and
endpoint::make_status_subscriber() take a queue size argument (with a default
value) that simply gets dropped in the eventual subscriber::make() call. See:

b041951835 (diff-5c0d2baa7981caeb6a4080708ddca6ad929746d10c73d66598e46d7c2c03c8deL34-R178)
2025-04-25 10:22:35 -07:00
Christian Kreibich
c1a5f70df8 Merge branch 'topic/christian/broker-backpressure-metrics'
* topic/christian/broker-backpressure-metrics:
  Add basic btest to verify that Broker peering telemetry is available.
  Add cluster framework telemetry for Broker's send-buffer use
  Add peer buffer update tracking to the Broker manager's event_observer
  Rename the Broker manager's LoggerAdapter
  Avoid race in the cluster/broker/publish-any btest
2025-04-25 10:04:09 -07:00
Christian Kreibich
88a0cda8ca Add cluster framework telemetry for Broker's send-buffer use
This hooks into Telemetry::sync() to update Broker-level metrics tracking the
peerings' send buffer state. We do this in the cluster framework so we can label
the resulting metrics with Zeek cluster node names, not Broker's endpoint IDs.
2025-04-25 09:14:33 -07:00
Christian Kreibich
f5fbad23ff Add peer buffer update tracking to the Broker manager's event_observer
This implements basic tracking of each peering's current fill level, the maximum
level over a recent time interval (via a new Broker::buffer_stats_reset_interval
tunable, defaulting to 1min), and the number of times a buffer overflows. For
the disconnect policy this is the number of depeerings, but for drop_newest and
drop_oldest it implies the number of messages lost.

This doesn't use "proper" telemetry metrics for a few reasons: this tracking is
Broker-specific, so we need to track each peering via endpoint_ids, while we
want the metrics to use Cluster node name labels, and the latter live in the
script layer. Using broker::endpoint_id directly as keys also means we rely on
their ability to hash in STL containers, which should be fast.

This does not track the buffer levels for Broker "clients" (as opposed to
"peers"), i.e. WebSockets, since we currently don't have a way to name these,
and we don't want to use ephemeral Broker IDs in their telemetry.

To make the stats accessible to the script layer the Broker manager (via a new
helper class that lives in the event_observer) maintains a TableVal mapping
Broker IDs to a new BrokerPeeringStats record. The table's members get updated
every time that table is requested. This minimizes new val instantiation and
allows the script layer to customize the BrokerPeeringStats record by redefing,
updating fields, etc. Since we can't use Zeek vals outside the main thread, this
requires some care so all table updates happen only in the Zeek-side table
updater, PeerBufferState::GetPeeringStatsTable().
2025-04-24 22:47:18 -07:00
Tim Wojtulewicz
3ab83a3f74 Minor changes to storage framework script docs 2025-04-24 11:11:08 -07:00
Steve Smoot
9ef579b09e Change from_json to return an error rather than print it. 2025-04-23 15:56:12 -07:00
Tim Wojtulewicz
cb35da08bc Update Mozilla CA list and CT list to NSS 3.110 2025-04-23 10:41:19 -07:00
Arne Welzel
011029addc cluster/websocket: Make websocket dispatcher queue size configurable
Limit the number WebSocket events queued from external clients to
dispatcher instances to produce back pressure to the clients if
Zeek's IO loop is overloaded.
2025-04-23 14:27:43 +02:00
Arne Welzel
ab25e5d24b broker/main: Reference Cluster::publish() for auto_publish() deprecation
In hindsight, this is the better thing to do and with Zeek 7.2 we should
be confident enough that it'll work.
2025-04-23 14:27:43 +02:00
Arne Welzel
a7423104e1 broker/main: Deprecate Broker::listen_websocket()
Optimistically deprecate Broker::listen_websocket() and promote
Cluster::listen_websocket() instead.
2025-04-23 14:27:43 +02:00
Arne Welzel
3d3b7a0759 cluster/Backend: Add ProcessError()
Allow backends to pass errors to a strategy. Locally, these raise
Cluster::Backend::error() events that are logged to the reporter
as errors.
2025-04-23 14:19:08 +02:00
Christian Kreibich
549e678dff Use Broker peering directionality when re-peering after backpressure overflows
This avoids creating pointless connection reattempts to ephemeral TCP
client-side ports, which have been cluttering up the Broker logs since 7.1.
2025-04-21 14:08:42 -07:00
Christian Kreibich
b430d5235c Expand Broker APIs to allow tracking directionality of peering establishment
This provides ways to figure out for a given peer, or a given address/port pair,
whether the local node originally established the peering.
2025-04-21 14:08:42 -07:00
Arne Welzel
b8e573a3b9 ldap: Clean up from code review
Co-authored-by: Benjamin Bannier <benjamin.bannier@corelight.com>
2025-04-15 20:10:56 +02:00
Arne Welzel
07bf7f8b18 ldap: Add Sicily Authentication constants
The aduser1-ntlm.pcap contains bindRequest messages using Microsoft AD
specific Sicily Authentication [1]. Add the entries to the enum so we
don't log undefined for these and also check the NTLMSSP signature.

[1] https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-adts/8b9dbfb2-5b6a-497a-a533-7e709cb9a982
2025-04-15 20:10:56 +02:00
Tim Wojtulewicz
cb1ef47a31 Add STORAGE_ prefixes for backends and serializers 2025-04-14 10:11:13 -07:00
Tim Wojtulewicz
e545fe8256 Ground work for pluggable storage serializers 2025-04-14 10:02:35 -07:00
Arne Welzel
6bc36e8cf8 broker/main: Adapt enum values to agree with comm.bif
Logic to detect this error already existed, but due to enum identifiers
not having a value set, it never triggered before.

Should probably backport this one.
2025-04-04 15:36:42 +02:00
Tim Wojtulewicz
55e458c5f7 Add comment annotation to disable copying redef value into docs 2025-04-01 10:23:55 -07:00
Arne Welzel
14697ea6ba Merge remote-tracking branch 'origin/topic/neverlord/broker-logging'
* origin/topic/neverlord/broker-logging:
  Integrate review feedback
  Hook into Broker logs via its new API
2025-03-31 18:53:43 +02:00
Christian Kreibich
98c203b8cb Add "U" to QUIC history docstrings and expand version string docs
Looks like we overlooked documenting "U" in zeek/zeek#3526 .
2025-03-27 13:29:40 -07:00
Christian Kreibich
2199cb1ddd Remove "experimental" from the QUIC history field's comment string [skip ci]
We're unlikely to fundamentally change (or remove) this field at this point, and
some users wondered whether we might do so, given the labeling.
2025-03-26 14:03:52 -07:00
Tim Wojtulewicz
43faea880b Add analyzer registration from VLAN to VNTAG 2025-03-18 11:51:27 -07:00
Tim Wojtulewicz
c7015e8250 Split storage.bif file into events/sync/async, add more comments 2025-03-18 10:20:34 -07:00
Tim Wojtulewicz
f40947f6ac Update comments in script files, run zeek-format on all of them 2025-03-18 10:20:34 -07:00
Tim Wojtulewicz
a40db844eb Redis: Handle disconnection correctly via callback 2025-03-18 10:20:34 -07:00
Tim Wojtulewicz
c7503654e8 Add IN_PROGRESS return code, handle for async backends 2025-03-18 10:20:34 -07:00
Tim Wojtulewicz
9ed3e33f97 Completely rework return values from storage operations 2025-03-18 10:20:33 -07:00
Tim Wojtulewicz
40f60f26b3 Run expiration on a separate thread 2025-03-18 10:20:33 -07:00
Tim Wojtulewicz
a485b1d237 Make backend options a record, move actual options to be sub-records 2025-03-18 10:20:33 -07:00
Tim Wojtulewicz
28951dccf1 Split sync and async into separate script-land namespaces 2025-03-18 10:20:33 -07:00
Tim Wojtulewicz
f1a7376e0a Return generic result for get operations that includes error messages 2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
4695060d75 Allow opening and closing backends to be async 2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
7ad6a05f5b Add infrastructure for asynchronous storage operations 2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
d07d27453a Add infrastructure for automated expiration of storage entries
This is used for backends that don't support expiration natively.
2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
8dee733a7d Change args to Storage::put to be a record
The number of args being passed to the put() methods was getting to be
fairly long, with more on the horizon. Changing to a record means simplifying
things a little bit.
2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
69d940533d Pass key/value types for validation when opening backends 2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
2ea0f3e70a Lay out initial parts for the Storage framework
This includes a manager, component manager, BIF and script code, and
parts to support new storage backend plugins.
2025-03-18 09:32:34 -07:00
Arne Welzel
6032741868 cluster/websocket: Implement WebSocket server 2025-03-10 17:07:30 +01:00
Johanna Amann
b8c135d7cb Remove violating analyzer from services field again
This reverts some of the recent DPD changes; specifically violations
trigger removal from the services field, again, by default.

Discussion in GH-4521
2025-03-04 15:10:49 +00:00
Arne Welzel
776c003033 PacketAnalyzer::Geneve: Add get_options()
Allow to extract Geneve options on-demand, for example during a
new_connection() event.
2025-02-22 12:19:42 -08:00
Dominik Charousset
20b3eca257 Integrate review feedback 2025-02-15 16:37:24 +01:00
Dominik Charousset
30615f425e Hook into Broker logs via its new API
The new Broker API allows us to provide a custom logger to Broker that
pulls previously unattainable context information out of Broker to put
them into broker.log for users of Zeek.

Since Broker log events happen asynchronously, we cache them in a queue
and use a flare to notify Zeek of activity. Furthermore, the Broker
manager now implements the `ProcessFd` function to avoid unnecessary
polling of the new log queue. As a side effect, data stores are polled
less as well.
2025-02-08 16:28:02 +01:00
Johanna Amann
fc233fd8d0 Merge remote-tracking branch 'origin/topic/johanna/dpd-changes'
* origin/topic/johanna/dpd-changes:
  DPD: failed services logging alignment
  DPD: update test baselines; change options for external tests.
  DPD: change policy script for service violation logging; add NEWS
  DPD changes - small script fixes and renames.
  Update public and private test suite for DPD changes.
  Allow to track service violations in conn.log.
  Make conn.log service field ordered
  DPD: change handling of pre-confirmation violations, remove max_violations
  DPD: log analyzers that have confirmed
  IRC analyzer - make protocol confirmation more robust.
2025-02-07 07:35:30 +00:00
Johanna Amann
c402c28f7e Merge remote-tracking branch 'origin/topic/johanna/sslindentation'
* origin/topic/johanna/sslindentation:
  SSL main.zeek - fix indentation
2025-02-06 17:00:40 +00:00