Commit graph

1352 commits

Author SHA1 Message Date
Arne Welzel
473723cc47 Attr: Deprecate using &default and &optional together on record fields
If &default implies re-initialization of the field, using them together
doesn't make much sense.
2025-07-30 10:26:06 +02:00
Tim Wojtulewicz
7e3ed2010d Add flag to force synchronous mode when calling storage script-land functions 2025-07-23 13:14:34 -07:00
Tim Wojtulewicz
a0ffe7f748 Add storage metrics for operations, expirations, data transferred 2025-07-18 14:28:04 -07:00
Benjamin Bannier
d5fd29edcd Prefer explicit construction to coercion in record initialization
While we support initializing records via coercion from an expression
list, e.g.,

    local x: X = [$x1=1, $x2=2];

this can sometimes obscure the code to readers, e.g., when assigning to
value declared and typed elsewhere. The language runtime has a similar
overhead since instead of just constructing a known type it needs to
check at runtime that the coercion from the expression list is valid;
this can be slower than just writing the readible code in the first
place, see #4559.

With this patch we use explicit construction, e.g.,

    local x = X($x1=1, $x2=2);
2025-07-11 16:28:37 -07:00
Arne Welzel
df581c59b4 scripts: Use tpe instead of type_, again
The .rst generation doesn't escape the trailing `_` and the docs build
gets upset due to using `type` as a reference target then.

For the better or worse, revert to using tpe. Though I acknowledge this
means we need to be careful with trailing underscores because our docs
build is so fragile.

Partly reverts b9eabbabba.
2025-07-03 20:25:34 +02:00
Benjamin Bannier
b9eabbabba Bump pre-commit hooks 2025-07-01 10:39:47 +02:00
Arne Welzel
1d931b5a2f cluster/WebSocket: Include X-Application-Name in cluster.log
A bit ad-hoc formatting for the log, but that's mostly because cluster.log
only has message field and I don't think having a dedicated application_name
column is worth it. That could also be added by custom scripts if it's really
wanted for a given deployment.
2025-06-30 17:55:24 +02:00
Arne Welzel
26f5166d7a cluster/telemetry: Move topic_normalization redef to zeromq 2025-06-26 15:22:11 +02:00
Arne Welzel
4c34274a6c cluster: Introduce telemetry component 2025-06-25 16:59:49 +02:00
Arne Welzel
4b472f2771 Merge remote-tracking branch 'origin/topic/awelzel/telemetry-endpoint-to-node-rename'
* origin/topic/awelzel/telemetry-endpoint-to-node-rename:
  telemetry: Rename endpoint label to node label
2025-06-25 09:33:55 +02:00
Arne Welzel
eea194ddd8 telemetry: Rename endpoint label to node label
Using a label named "endpoint" is not intuitive and requires explaining to
users that it's really just the Cluster::node value. Change the label to
"node", so that we don't need to do the explaining.

This probably breaks some existing users of the Prometheus metrics, but after
looking more at metrics recently, "endpoint" really is a thorn in my eye.
2025-06-25 09:33:01 +02:00
Christian Kreibich
1dcd13a019 Fix a typo. 2025-06-05 17:51:54 -07:00
Johanna Amann
58613f0313 Introduce new c$failed_analyzers field
This field is used internally to trace which analyzers already had a
violation. This is mostly used to prevent duplicate logging.

In the past, c$service_violation was used for a similar purpose -
however it has slightly different semantics. Where c$failed_analyzers
tracks analyzers that were removed due to a violation,
c$service_violation tracks violations - and doesn't care if an analyzer
was actually removed due to it.
2025-06-04 12:07:13 +01:00
Johanna Amann
42ba2fcca0 Settle on analyzer.log for the dpd.log replacement
This commit renames analyzer-failed.log to analyzer.log, and updates the
respective news entry.
2025-06-03 17:33:36 +01:00
Johanna Amann
130c89a0a7 dpd->analyzer.log change - rename files
To address review feedback in GH-4362: rename analyzer-failed-log.zeek
to loggig.zeek, analyzer-debug-log.zeek to debug-logging.zeek and
dpd-log.zeek to deprecated-dpd-log.zeek.

Includes respective test, NEWS, etc updates.
2025-06-03 16:32:52 +01:00
Johanna Amann
af77a7a83b Analyzer failure logging: tweaks and test fixes
The main part of this commit are changes in tests. A lot of the tests
that previously relied on analyzer.log or dpd.log now use the new
analyzer-failed.log.

I verified all the changes and, as far as I can tell, everything
behaves as it should. This includes the external test baselines.

This change also enables logging of file and packet analyzer to
analyzer_failed.log and fixes some small behavior issues.

The analyzer_failed event is no longer raised when the removal of an
analyzer is vetoed.

If an analyzer is no longer active when an analyzer violation is raised,
currently the analyzer_failed event is raised. This can, e.g., happen
when an analyzer error happens at the very end of the connection. This
makes the behavior more similar to what happened in the past, and also
intuitively seems to make sense.

A bug introduced in the failed service logging was fixed.
2025-06-03 15:56:42 +01:00
Johanna Amann
8c814fa88c Introduce analyzer-failed.log, as a replacement for dpd.log
Analyzer-failed.log is, essentially, the replacement for dpd.log. The
name should make more sense, as it does now log analyzer failures. For
protocol analyzers specifically, these are failures that lead to the
analyzer being disabled.
2025-06-03 15:17:26 +01:00
Johanna Amann
c55e21da71 Rename analyzer.log to analyzer.debug log; move to policy
The current analyzer.log is more useful for debugging than for
operational purposes. Hence this is disabled by default, moved to a
policy script, and the log is renamed to analyzer-debug.log.

Furthermore, logging of analyzer confirmations and disabling analyzers
are now enabled by default.
2025-06-03 15:17:26 +01:00
Johanna Amann
6183c5086b Move dpd.log to policy script
This is the first phase of moving from the current dpd log to a more
modern logfile, without some of the weirdnesses that the current dpd log
contains.

Tests will not pass in the current state; this is just splitting out
functionality.
2025-06-03 15:17:26 +01:00
Arne Welzel
7eb849ddf4 intel: Add indicator_inserted and indicator_removed hooks
This change adds two new hooks to the Intel framework that can be used
to intercept added and removed indicators and their type.

These hooks are fairly low-level. One immediate use-case is to count the
number of indicators loaded per Intel::Type and enable and disable the
corresponding event groups of the intel/seen scripts.

I attempted to gauge the overhead and while it's definitely there, loading
a file with ~500k DOMAIN entries takes somewhere around ~0.5 seconds hooks
when populated via the min_data_store store mechanism. While that
doesn't sound great, it actually takes the manager on my system 2.5
seconds to serialize and Cluster::publish() the min_data_store alone
and its doing that serially for every active worker. Mostly to say that
the bigger overhead in that area on the manager doing redundant work
per worker.

Co-authored-by: Mohan Dhawan <mohan@corelight.com>
2025-06-02 09:50:48 +02:00
Arne Welzel
544d571089 cluster/websocket: Deprecate $listen_host, introduce $listen_addr
This only changes the script-layer API, but keeps the std::string host
in the C++ layer's ServerOptions. Mostly because the ixwebsocket library
takes host as std::string. Also, maybe at  some point we'd want to
support something scheme-based like unix:///var/run/zeek.sock and placing
that in a string could not be totally wrong.

Add tests for IPV6, too.
2025-05-30 11:02:41 +02:00
Arne Welzel
a61aff010f cluster/websocket: Propagate code and reason to websocket_client_lost()
This allows to get visibility into the reason why ixwebsocket or the
client decided to disconnect.

Closed #4440
2025-05-13 18:26:03 +02:00
Arne Welzel
aaddeb19ad cluster/websocket: Support configurable ping interval
Primarily for testing purposes and maybe the hard-coded 5 seconds is too
aggressive for some deployments, so makes sense for it to be
configurable.
2025-05-13 18:26:03 +02:00
Christian Kreibich
738ce1c235 Bugfix: accurately track Broker buffer overflows w/ multiple peerings
When a node restarts or a peering between two nodes starts over for other
reasons, the internal tracking in the Broker manager resets its state (since
it's per-peering), and thus the message overflow counter. The script layer was
unaware of this, and threw errors when trying to reset the corresponding counter
metric down to zero at sync time.

We now track past buffer overflows via a separate epoch table, using Broker peer
ID comparisons to identify new peerings, and set the counter to the sum of past
and current overflows.

I considered just making this a gauge, but it seems more helpful to be able to
look at a counter to see whether any messages have ever been dropped over the
lifetime of the node process.

As an aside, this now also avoids repeatedly creating the labels vector,
re-using the same one for each metric.

Thanks to @pbcullen for identifying this one!
2025-05-07 17:27:38 -07:00
Christian Kreibich
68fadd0464 Lower listen/connect retry intervals in Broker and the cluster framework to 1sec
The former defaults (30sec, 1min) can slow down cluster startup and recovery
considerably, and other systems have more aggressive intervals still.
2025-04-25 10:22:35 -07:00
Christian Kreibich
841a40ff88 Switch Broker's default backpressure policy to drop_oldest, bump buffer sizes
At every site where we've dug into backpressure disconnect findings, it has been
the case that the default values were too small. 8192, so 4x the old default,
suffices at every site to drown out premature disconnects.

With metrics now available for the send buffers regardless of backpressure
overflow policy, this also switches the default from "disconnect" to
"drop_oldest" (for both peers and websockets), meaning that peerings remain
untouched but the oldest queued message simply gets dropped when a new message
is enqueued. With this policy, the number of backpressure overflows is then
simply the count of discarded messages, something that users can tune to see
drop to zero in everyday use.  Another benefit is that marginal overflows cause
less message loss than when an entire buffer's worth (plus potentially more
in-flight messages) gets thrown out with a disconnect.
2025-04-25 10:22:35 -07:00
Christian Kreibich
5008f586ea Deprecate Broker::congestion_queue_size and stop using it internally
Since a reorg in the Broker library (commit b04195183) that revamped flow
control and that we pulled in with Zeek 5.0, this setting hasn't done
anything. Broker's endpoint::make_subscriber() and
endpoint::make_status_subscriber() take a queue size argument (with a default
value) that simply gets dropped in the eventual subscriber::make() call. See:

b041951835 (diff-5c0d2baa7981caeb6a4080708ddca6ad929746d10c73d66598e46d7c2c03c8deL34-R178)
2025-04-25 10:22:35 -07:00
Christian Kreibich
88a0cda8ca Add cluster framework telemetry for Broker's send-buffer use
This hooks into Telemetry::sync() to update Broker-level metrics tracking the
peerings' send buffer state. We do this in the cluster framework so we can label
the resulting metrics with Zeek cluster node names, not Broker's endpoint IDs.
2025-04-25 09:14:33 -07:00
Christian Kreibich
f5fbad23ff Add peer buffer update tracking to the Broker manager's event_observer
This implements basic tracking of each peering's current fill level, the maximum
level over a recent time interval (via a new Broker::buffer_stats_reset_interval
tunable, defaulting to 1min), and the number of times a buffer overflows. For
the disconnect policy this is the number of depeerings, but for drop_newest and
drop_oldest it implies the number of messages lost.

This doesn't use "proper" telemetry metrics for a few reasons: this tracking is
Broker-specific, so we need to track each peering via endpoint_ids, while we
want the metrics to use Cluster node name labels, and the latter live in the
script layer. Using broker::endpoint_id directly as keys also means we rely on
their ability to hash in STL containers, which should be fast.

This does not track the buffer levels for Broker "clients" (as opposed to
"peers"), i.e. WebSockets, since we currently don't have a way to name these,
and we don't want to use ephemeral Broker IDs in their telemetry.

To make the stats accessible to the script layer the Broker manager (via a new
helper class that lives in the event_observer) maintains a TableVal mapping
Broker IDs to a new BrokerPeeringStats record. The table's members get updated
every time that table is requested. This minimizes new val instantiation and
allows the script layer to customize the BrokerPeeringStats record by redefing,
updating fields, etc. Since we can't use Zeek vals outside the main thread, this
requires some care so all table updates happen only in the Zeek-side table
updater, PeerBufferState::GetPeeringStatsTable().
2025-04-24 22:47:18 -07:00
Arne Welzel
011029addc cluster/websocket: Make websocket dispatcher queue size configurable
Limit the number WebSocket events queued from external clients to
dispatcher instances to produce back pressure to the clients if
Zeek's IO loop is overloaded.
2025-04-23 14:27:43 +02:00
Arne Welzel
ab25e5d24b broker/main: Reference Cluster::publish() for auto_publish() deprecation
In hindsight, this is the better thing to do and with Zeek 7.2 we should
be confident enough that it'll work.
2025-04-23 14:27:43 +02:00
Arne Welzel
a7423104e1 broker/main: Deprecate Broker::listen_websocket()
Optimistically deprecate Broker::listen_websocket() and promote
Cluster::listen_websocket() instead.
2025-04-23 14:27:43 +02:00
Arne Welzel
3d3b7a0759 cluster/Backend: Add ProcessError()
Allow backends to pass errors to a strategy. Locally, these raise
Cluster::Backend::error() events that are logged to the reporter
as errors.
2025-04-23 14:19:08 +02:00
Christian Kreibich
549e678dff Use Broker peering directionality when re-peering after backpressure overflows
This avoids creating pointless connection reattempts to ephemeral TCP
client-side ports, which have been cluttering up the Broker logs since 7.1.
2025-04-21 14:08:42 -07:00
Christian Kreibich
b430d5235c Expand Broker APIs to allow tracking directionality of peering establishment
This provides ways to figure out for a given peer, or a given address/port pair,
whether the local node originally established the peering.
2025-04-21 14:08:42 -07:00
Tim Wojtulewicz
cb1ef47a31 Add STORAGE_ prefixes for backends and serializers 2025-04-14 10:11:13 -07:00
Tim Wojtulewicz
e545fe8256 Ground work for pluggable storage serializers 2025-04-14 10:02:35 -07:00
Arne Welzel
6bc36e8cf8 broker/main: Adapt enum values to agree with comm.bif
Logic to detect this error already existed, but due to enum identifiers
not having a value set, it never triggered before.

Should probably backport this one.
2025-04-04 15:36:42 +02:00
Arne Welzel
14697ea6ba Merge remote-tracking branch 'origin/topic/neverlord/broker-logging'
* origin/topic/neverlord/broker-logging:
  Integrate review feedback
  Hook into Broker logs via its new API
2025-03-31 18:53:43 +02:00
Tim Wojtulewicz
c7015e8250 Split storage.bif file into events/sync/async, add more comments 2025-03-18 10:20:34 -07:00
Tim Wojtulewicz
f40947f6ac Update comments in script files, run zeek-format on all of them 2025-03-18 10:20:34 -07:00
Tim Wojtulewicz
9ed3e33f97 Completely rework return values from storage operations 2025-03-18 10:20:33 -07:00
Tim Wojtulewicz
a485b1d237 Make backend options a record, move actual options to be sub-records 2025-03-18 10:20:33 -07:00
Tim Wojtulewicz
28951dccf1 Split sync and async into separate script-land namespaces 2025-03-18 10:20:33 -07:00
Tim Wojtulewicz
f1a7376e0a Return generic result for get operations that includes error messages 2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
4695060d75 Allow opening and closing backends to be async 2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
7ad6a05f5b Add infrastructure for asynchronous storage operations 2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
d07d27453a Add infrastructure for automated expiration of storage entries
This is used for backends that don't support expiration natively.
2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
8dee733a7d Change args to Storage::put to be a record
The number of args being passed to the put() methods was getting to be
fairly long, with more on the horizon. Changing to a record means simplifying
things a little bit.
2025-03-18 09:32:34 -07:00
Tim Wojtulewicz
69d940533d Pass key/value types for validation when opening backends 2025-03-18 09:32:34 -07:00