Commit graph

3411 commits

Author SHA1 Message Date
Christian Kreibich
ace5c11048 Bugfix: accurately track Broker buffer overflows w/ multiple peerings
When a node restarts or a peering between two nodes starts over for other
reasons, the internal tracking in the Broker manager resets its state (since
it's per-peering), and thus the message overflow counter. The script layer was
unaware of this, and threw errors when trying to reset the corresponding counter
metric down to zero at sync time.

We now track past buffer overflows via a separate epoch table, using Broker peer
ID comparisons to identify new peerings, and set the counter to the sum of past
and current overflows.

I considered just making this a gauge, but it seems more helpful to be able to
look at a counter to see whether any messages have ever been dropped over the
lifetime of the node process.

As an aside, this now also avoids repeatedly creating the labels vector,
re-using the same one for each metric.

Thanks to @pbcullen for identifying this one!
2025-05-07 17:30:45 -07:00
Christian Kreibich
d9f11643a2 Use Broker peering directionality when re-peering after backpressure overflows
This avoids creating pointless connection reattempts to ephemeral TCP
client-side ports, which have been cluttering up the Broker logs since 7.1.

(cherry picked from commit 549e678dff)
2025-04-29 17:00:50 -07:00
Christian Kreibich
4372cdfe2a Expand Broker APIs to allow tracking directionality of peering establishment
This provides ways to figure out for a given peer, or a given address/port pair,
whether the local node originally established the peering.

(cherry picked from commit b430d5235c)
2025-04-29 17:00:30 -07:00
Christian Kreibich
458b887df1 Lower listen/connect retry intervals in Broker and the cluster framework to 1sec
The former defaults (30sec, 1min) can slow down cluster startup and recovery
considerably, and other systems have more aggressive intervals still.

(cherry picked from commit 68fadd0464)
2025-04-29 16:47:13 -07:00
Christian Kreibich
446f49e6bc Switch Broker's default backpressure policy to drop_oldest, bump buffer sizes
At every site where we've dug into backpressure disconnect findings, it has been
the case that the default values were too small. 8192, so 4x the old default,
suffices at every site to drown out premature disconnects.

With metrics now available for the send buffers regardless of backpressure
overflow policy, this also switches the default from "disconnect" to
"drop_oldest" (for both peers and websockets), meaning that peerings remain
untouched but the oldest queued message simply gets dropped when a new message
is enqueued. With this policy, the number of backpressure overflows is then
simply the count of discarded messages, something that users can tune to see
drop to zero in everyday use.  Another benefit is that marginal overflows cause
less message loss than when an entire buffer's worth (plus potentially more
in-flight messages) gets thrown out with a disconnect.

(cherry picked from commit 841a40ff88)
2025-04-29 16:47:13 -07:00
Christian Kreibich
8b9b16d7a8 Add cluster framework telemetry for Broker's send-buffer use
This hooks into Telemetry::sync() to update Broker-level metrics tracking the
peerings' send buffer state. We do this in the cluster framework so we can label
the resulting metrics with Zeek cluster node names, not Broker's endpoint IDs.

(cherry picked from commit 88a0cda8ca)
2025-04-29 15:19:38 -07:00
Christian Kreibich
d5bbf05a32 Add peer buffer update tracking to the Broker manager's event_observer
This implements basic tracking of each peering's current fill level, the maximum
level over a recent time interval (via a new Broker::buffer_stats_reset_interval
tunable, defaulting to 1min), and the number of times a buffer overflows. For
the disconnect policy this is the number of depeerings, but for drop_newest and
drop_oldest it implies the number of messages lost.

This doesn't use "proper" telemetry metrics for a few reasons: this tracking is
Broker-specific, so we need to track each peering via endpoint_ids, while we
want the metrics to use Cluster node name labels, and the latter live in the
script layer. Using broker::endpoint_id directly as keys also means we rely on
their ability to hash in STL containers, which should be fast.

This does not track the buffer levels for Broker "clients" (as opposed to
"peers"), i.e. WebSockets, since we currently don't have a way to name these,
and we don't want to use ephemeral Broker IDs in their telemetry.

To make the stats accessible to the script layer the Broker manager (via a new
helper class that lives in the event_observer) maintains a TableVal mapping
Broker IDs to a new BrokerPeeringStats record. The table's members get updated
every time that table is requested. This minimizes new val instantiation and
allows the script layer to customize the BrokerPeeringStats record by redefing,
updating fields, etc. Since we can't use Zeek vals outside the main thread, this
requires some care so all table updates happen only in the Zeek-side table
updater, PeerBufferState::GetPeeringStatsTable().

(cherry picked from commit f5fbad23ff)
2025-04-29 15:08:05 -07:00
Christian Kreibich
90ecf7ff0d Add backpressure disconnect notification to cluster.log and via telemetry
This adds a Broker-specific script to the cluster framework, loaded only when
Zeek is running in cluster mode. It adds logging in cluster.log as well as
telemetry via a metrics counter for Broker-observed backpressure disconnects.

The new zeek_broker_backpressure_disconnects counter, labeled by the neighboring
peer that the reporting node has determined to be unresponsive, counts the
number of unpeerings for this reason.

Here the node "worker" has observed node "proxy" falling behind once:

# HELP zeek_broker_backpressure_disconnects_total Number of Broker peering drops due to a neighbor falling too far behind in message I/O
# TYPE zeek_broker_backpressure_disconnects_total counter
zeek_broker_backpressure_disconnects_total{endpoint="worker",peer="proxy"} 1

Includes small btest baseline update to reflect @load of a new script.

(cherry picked from commit ead6134501)
2025-04-08 15:09:44 -07:00
Christian Kreibich
67f135f57a Remove unneeded @loads from base/misc/version.zeek
This module is loaded by the telemetry framework, which we're now loading via
the cluster framework, i.e. also in bare mode. The resulting additional
thread (for creating reporter.log) trips up a number of btest baselines.

version.zeek doesn't use any of the string helper functions.

(cherry picked from commit d260a5b7a9)
2025-04-08 15:09:44 -07:00
Christian Kreibich
06fa47e21d Add Cluster::nodeid_to_node() helper function
This translates backend-specific node identifiers (like Broker IDs) to
cluster nodes and their names, if available.

(cherry picked from commit 46a11ec37d)
2025-04-08 15:09:44 -07:00
Christian Kreibich
1cbbbc5c40 Support re-peering with Broker peers that fall behind
This adds re-peering at the Broker level for peers that Broker decided to
unpeer. We keep this at the Broker level since this behavior is specific to
it (as opposed to other cluster backends).

Includes baseline updates for btests that pick up on the new script's @load.

(cherry picked from commit 0010e65f6d)
2025-04-08 15:09:44 -07:00
Dominik Charousset
eeb0e7184d Add Zeek-level configurability of Broker slow-peer disconnects
(cherry picked from commit 4c4eb4b8e2)
2025-04-08 15:09:44 -07:00
Christian Kreibich
11701d4734 No need to namespace Cluster:: functions in their own namespace
(cherry picked from e81856a4af)
2025-04-08 14:50:50 -07:00
Christian Kreibich
2ad80f8fb2 Telemetry framework: move BIFs to the primary-bif stage
This moves the Telemetry framework's BIF-defined functionalit from the
secondary-BIFs stage to the primary one. That is, this functionality is now
available from the end of init-bare.zeek, not only after the end of
init-frameworks-and-bifs.zeek.

This allows us to use script-layer telemetry in our Zeek's own code that get
pulled in during init-frameworks-and-bifs.

This change splits up the BIF features into functions, constants, and types,
because that's the granularity most workable in Func.cc and NetVar. It also now
defines the Telemetry::MetricsType enum once, not redundantly in BIFs and script
layer.

Due to subtle load ordering issues between the telemetry and cluster frameworks
this pushes the redef stage of Telemetry::metrics_port and address into
base/frameworks/telemetry/options.zeek, which is loaded sufficiently late in
init-frameworks-and-bifs.zeek to sidestep those issues. (When not doing this,
the effect is that the redef in telemetry/main.zeek doesn't yet find the
cluster-provided values, and Zeek does not end up listening on these ports.)

The need to add basic Zeek headers in script_opt/ZAM/ZBody.cc as a side-effect
of this is curious, but looks harmless.

Also includes baseline updates for the usual btests and adds a few doc strings.

(cherry picked from commit 71f7e89974)
2025-04-08 14:50:45 -07:00
Christian Kreibich
5503688758 Minor comment tweaks for init-frameworks-and-bifs.zeek
(cherry picked from acdd7a7934)
2025-04-08 14:50:28 -07:00
Tim Wojtulewicz
c30b835a14 Update mozilla-ca-list.zeek and ct-list.zeek to NSS 3.109 2025-03-18 17:59:01 -07:00
Tim Wojtulewicz
ed081212ae Merge remote-tracking branch 'origin/topic/timw/vntag-in-vlan'
* origin/topic/timw/vntag-in-vlan:
  Add analyzer registration from VLAN to VNTAG

(cherry picked from commit cb5e3d0054)
2025-03-18 16:18:13 -07:00
Arne Welzel
43ab74b70f Merge branch 'sqli-spaces-encode-to-plus' of https://github.com/cooper-grill/zeek
* 'sqli-spaces-encode-to-plus' of https://github.com/cooper-grill/zeek:
  account for spaces encoding to plus signs in sqli regex detection

(cherry picked from commit 5200b84fb3)
2024-11-19 09:33:22 -07:00
Arne Welzel
056b70bd2d Merge remote-tracking branch 'origin/topic/awelzel/community-id-new-connection'
* origin/topic/awelzel/community-id-new-connection:
  policy/community-id: Populate conn$community_id in new_connection()

(cherry picked from commit d3579c1f34)
2024-11-14 12:15:27 -07:00
Tim Wojtulewicz
88c37d0be8 Merge remote-tracking branch 'origin/topic/awelzel/3936-pop3-and-redis'
* origin/topic/awelzel/3936-pop3-and-redis:
  pop3: Remove unused headers
  pop3: Prevent unbounded state growth
  btest/pop3: Add somewhat more elaborate testing

(cherry picked from commit 702fb031a4)
2024-09-23 11:12:54 -07:00
Robin Sommer
15be682f63 Merge remote-tracking branch 'origin/topic/robin/gh-3881-spicy-ports'
* origin/topic/robin/gh-3881-spicy-ports:
  Spicy: Register well-known ports through an event handler.
  Revert "Remove deprecated port/ports fields for spicy analyzers"

(cherry picked from commit a2079bcda6)
2024-08-30 13:26:16 -07:00
Arne Welzel
6f65b88f1b Merge remote-tracking branch 'origin/topic/awelzel/ldap-extended-request-response-starttls'
* origin/topic/awelzel/ldap-extended-request-response-starttls:
  ldap: Add heuristic for wrap tokens
  ldap: Ignore ec/rrc for sealed wrap tokens
  ldap: Add LDAP sample with SASL-SRP mechanism
  ldap: Reintroduce encryption after SASL heuristic
  ldap: Fix assuming GSS-SPNEGO for all bindResponses
  ldap: Implement extended request/response and StartTLS support

(cherry picked from commit 6a6a5c3d0d)
2024-08-30 11:47:08 -07:00
Arne Welzel
0fd6672dde Merge branch 'fix-http-password-capture' of https://github.com/p-l-/zeek
* 'fix-http-password-capture' of https://github.com/p-l-/zeek:
  http: fix password capture when enabled

(cherry picked from commit c27e18631c)
2024-08-30 11:34:24 -07:00
Tim Wojtulewicz
dd4597865a Merge remote-tracking branch 'origin/topic/timw/telemetry-threading'
* origin/topic/timw/telemetry-threading:
  Process metric callbacks from the main-loop thread

(cherry picked from commit 3c3853dc7d)
2024-08-30 11:29:17 -07:00
Tim Wojtulewicz
746ae4d2cc Merge remote-tracking branch 'origin/topic/johanna/update-the-ct-list-and-the-ca-list-again'
* origin/topic/johanna/update-the-ct-list-and-the-ca-list-again:
  Update Mozilla CA list and CT list

(cherry picked from commit cb88f6316c)
2024-07-23 08:55:11 -07:00
Arne Welzel
8014c4b8c3 telemetry: Deprecate prometheus.zeek policy script
With Cluster::Node$metrics_port being optional, there's not really
a need for the extra script. New rule, if a metrics_port is set, the
node will attempt to listen on it.

Users can still redef Telemetry::metrics_port *after*
base/frameworks/telemetry was loaded to change the port defined
in cluster-layout.zeek.

(cherry picked from commit bf9704f339)
2024-07-23 10:05:46 +02:00
Jan Grashoefer
0c06c604ab Add logging of disabled analyzers to analyzer.log 2024-07-09 18:22:43 +02:00
Christian Kreibich
8a4fb0ee19 Management framework: augment deployed configs with instance IP addresses
The controller learns IP addresses from agents that peer with it, but that
information has so far gotten lost when resulting configs get pushed out to the
agents. This makes these updates include that information.
2024-07-08 23:05:24 -07:00
Christian Kreibich
742f7fe340 Management framework: add auto-enumeration of metrics ports
This is quite redundant with the enumeration for Broker ports,
unfortunately. But the logic is subtly different: all nodes obtain a telemetry
port, while not all nodes require a Broker port, for example, and in the metrics
port assignment we also cross-check selected Broker ports. I found more unified
code actually harder to read in the end.

The logic for the two sets remains the same: from a start point, ports get
enumerated sequentially that aren't otherwise taken. These ports are assumed
available; there's nothing that checks their availability -- for now.

The default start port is 9000. I considered 9090, to align with the Prometheus
default, but counting upward from there is likely to hit trouble with the Broker
default ports (9999/9997), used by the Supervisor. Counting downward is a bit
unnatural, and shifting the Broker default ports brings subtle ordering issues.

This also changes the node ordering logic slightly since it seems more intuitive
to keep sequential ports on a given instance, instead of striping across them.
2024-07-08 23:05:24 -07:00
Christian Kreibich
fa6361af56 Management framework: propagate metrics port from agent
This propagates the metrics port from the node config passed through the
supervisor all the way into the script layer.
2024-07-08 23:05:24 -07:00
Christian Kreibich
563704a26e Management framework: add metrics port in management & Supervisor node records
This allows setting a metrics port for creation in new nodes.
2024-07-08 23:05:24 -07:00
Christian Kreibich
3ecacf4f50 Comment-only tweaks for telemetry-related settings.
These weren't quite accurate any more.
2024-07-08 23:05:24 -07:00
Christian Kreibich
737b1a2013 Remove the Supervisor's internal ClusterEndpoint struct.
This eliminates one place in which we currently need to mirror changes to the
script-land Cluster::Node record. Instead of keeping an exact in-core equivalent, the
Supervisor now treats the data structure as opaque, and stores the whole cluster
table as a JSON string.

We may replace the script-layer Supervisor::ClusterEndpoint in the future, using
Cluster::Node directly. But that's a more invasive change that will affect how
people invoke Supervisor::create() and similars.

Relying on JSON for serialization has the side-effect of removing the
Supervisor's earlier quirk of using 0/tcp, not 0/unknown, to indicate unused
ports in the Supervisor::ClusterEndpoint record.
2024-07-02 14:52:17 -07:00
Christian Kreibich
a98ec6b08b Provide a script-layer equivalent to Supervisor::__init_cluster().
If the script layer is able to access the current node's config via
Supervisor::node(), it can handle populating Cluster::nodes. That code
is much more straightforward than an equivalent in-core implementation
(especially with the upcoming change to the cluster table's implementation).
This introduces base/frameworks/cluster/supervisor.zeek and
Cluster::Supervisor::__init_cluster_nodes() for that purpose.

The @load of the Supervisor API in cluster/main.zeek isn't technically
necessary since we already load it explicitly even in init-bare.zeek,
but being explicit seems better.
2024-07-02 14:52:13 -07:00
Robin Sommer
4fc57294f1
Spicy: Provide runtime API to access Zeek-side globals.
This allows to read Zeek global variables from inside Spicy code. The
main challenge here is supporting all of Zeek's data type in a
type-safe manner.

The most straight-forward API is a set of functions
`get_<type>(<id>)`, where `<type>` is the Zeek-side type
name (e.g., `count`, `string`, `bool`) and `<id>` is the fully scoped
name of the Zeek-side global (e.g., `MyModule::Boolean`). These
functions then return the corresponding Zeek value, converted in an
appropriate Spicy type. Example:

    Zeek:
        module Foo;

        const x: count = 42;
        const y: string = "xxx";

    Spicy:
        import zeek;

        assert zeek::get_count("Foo::x") == 42;
        assert zeek::get_string("Foo::y") == b"xxx"; # returns bytes(!)

For container types, the `get_*` function returns an opaque types that
can be used to access the containers' values. An additional set of
functions `as_<type>` allows converting opaque values of atomic
types to Spicy equivalents. Example:

    Zeek:
        module Foo;

        const s: set[count] = { 1, 2 };
        const t: table[count] of string = { [1] = "One", [2] = "Two" }

    Spicy:

        # Check set membership.
        local set_ = zeek::get_set("Foo::s");
        assert zeek::set_contains(set_, 1) == True

        # Look up table element.
        local table_ = zeek::get_table("Foo::t");
        local value = zeek::table_lookup(t, 1);
        assert zeek::as_string(value) == b"One"

There are also functions for accessing elements of Zeek-side vectors
and records.

If any of these `zeek::*` conversion functions fails (e.g., due to a
global of that name not existing), it will throw an exception.

Design considerations:

    - We support only reading Zeek variables, not writing. This is
      both to simplify the API, and also conceptually to avoid
      offering backdoors into Zeek state that could end up with a very
      tight coupling of Spicy and Zeek code.

    - We accept that a single access might be relatively slow due to
      name lookup and data conversion. This is primarily meant for
      configuration-style data, not for transferring lots of dynamic
      state over.

    - In that spirit, we don't support deep-copying complex data types
      from Zeek over to Spicy. This is (1) to avoid performance
      problems when accidentally copying large containers over,
      potentially even at every access; and (2) to avoid the two sides
      getting out of sync if one ends up modifying a container without
      the other being able to see it.
2024-06-20 12:02:54 +02:00
Robin Sommer
93dd9d6797
Spicy: Reformat zeek.spicy with spicy-format. 2024-06-19 10:22:36 +02:00
Tim Wojtulewicz
d549e3d56a Add Telemetry::metrics_address option 2024-06-07 09:28:27 -07:00
Tim Wojtulewicz
99e64aa113 Restore label_names field in MetricOpts record 2024-06-04 14:14:58 -07:00
Tim Wojtulewicz
433c257886 Move telmetry label names out of opts records, into main metric records 2024-06-04 14:14:58 -07:00
Tim Wojtulewicz
87717fed0a Remove prefix column from telemetry.log 2024-06-04 14:14:58 -07:00
Tim Wojtulewicz
93717ca8f8 Remove is_sum arguments from counters and gauges 2024-05-31 13:36:37 -07:00
Tim Wojtulewicz
46ff48c29a Change all instruments to only handle doubles 2024-05-31 13:36:37 -07:00
Tim Wojtulewicz
e3e806ca23 Remove all of the ZEEK_METRICS_ environment variables 2024-05-31 13:36:37 -07:00
Tim Wojtulewicz
635198793d Fix header comments in scripts/policy/frameworks/telemetry/prometheus.zeek 2024-05-31 13:36:37 -07:00
Tim Wojtulewicz
9fb952a5f3 Regenerate docs [nomail] 2024-05-31 13:30:32 -07:00
Tim Wojtulewicz
53c3d2032a Remove the is_sum argument from BIF histogram creation methods 2024-05-31 13:30:31 -07:00
Tim Wojtulewicz
4361880e09 Remove Telemetry::metrics_export_prefixes option 2024-05-31 13:30:31 -07:00
Tim Wojtulewicz
e195d3d778 Fix some determinism issues with btests 2024-05-31 13:30:31 -07:00
Tim Wojtulewicz
017ee4509c Update telemetry log policy due to the fact that unit will not be filled in anymore 2024-05-31 13:30:31 -07:00
Tim Wojtulewicz
84aa308527 Rework everything to access the prometheus-cpp objects more directly 2024-05-31 13:30:31 -07:00