This implements basic tracking of each peering's current fill level, the maximum
level over a recent time interval (via a new Broker::buffer_stats_reset_interval
tunable, defaulting to 1min), and the number of times a buffer overflows. For
the disconnect policy this is the number of depeerings, but for drop_newest and
drop_oldest it implies the number of messages lost.
This doesn't use "proper" telemetry metrics for a few reasons: this tracking is
Broker-specific, so we need to track each peering via endpoint_ids, while we
want the metrics to use Cluster node name labels, and the latter live in the
script layer. Using broker::endpoint_id directly as keys also means we rely on
their ability to hash in STL containers, which should be fast.
This does not track the buffer levels for Broker "clients" (as opposed to
"peers"), i.e. WebSockets, since we currently don't have a way to name these,
and we don't want to use ephemeral Broker IDs in their telemetry.
To make the stats accessible to the script layer the Broker manager (via a new
helper class that lives in the event_observer) maintains a TableVal mapping
Broker IDs to a new BrokerPeeringStats record. The table's members get updated
every time that table is requested. This minimizes new val instantiation and
allows the script layer to customize the BrokerPeeringStats record by redefing,
updating fields, etc. Since we can't use Zeek vals outside the main thread, this
requires some care so all table updates happen only in the Zeek-side table
updater, PeerBufferState::GetPeeringStatsTable().
This also includes some test baseline updates, due to recent QUIC
changes.
* origin/master: (39 commits)
Update doc submodule [nomail] [skip ci]
Bump cluster testsuite to pull in resilience to agent connection timing [skip ci]
IPv6 support for detect-external-names and testcase
Add `skip_resp_host_port_pairs` option.
util/init_random_seed: write_file implies deterministic
external/subdir-btest.cfg: Set OPENSSL_ENABLE_SHA1_SIGNATURES=1
btest/x509_verify: Drop OpenSSL 1.0 hack
testing/btest: Use OPENSSL_ENABLE_SHA1_SIGNATURES
Add ZAM baseline for new scripts.base.protocols.quic.analyzer-confirmations btest
QUIC/decrypt_crypto: Rename all_data to data
QUIC: Confirm before forwarding data to SSL
QUIC: Parse all QUIC packets in a UDP datagram
QUIC: Only slurp till packet end, not till &eod
Remove unused SupervisedNode::InitCluster declaration
Update doc submodule [nomail] [skip ci]
Bump cluster testsuite to pull in updated Prometheus tests
Make enc_part value from kerberos response available to scripts
Management framework: move up addition of agent IPs into deployable cluster configs
Support multiple instances per host addr in auto metrics generation
When auto-generating metrics ports for worker nodes, get them more uniform across instances.
...
This changes service set in the connection record, and thus also the
conn.log service field to being ordered. Speficically, the order of the
entries in the service field will be the same order in which protocols
will be confirmed. This means that it now is possible to see which
protocols were layered over each other in which order by looking at the
respective conn.log entry.
This commit revamps the handling of analyzer violations that happen
before an analyzer confirms the protocol.
The current state is that an analyzer is disabled after 5 violations, if
it has not been confirmed. If it has been confirmed, it is disabled
after a single violation.
The reason for this is a historic mistake. In Zeek up to versions 1.5,
analyzers were unconditianally removed when they raised the first
protocol violation.
When this script was ported to the new layout for Zeek 2.0 in
b4b990cfb5, a logic error was introduced
that caused analyzers to no longer be disabled if they were not
confirmed.
This was the state for ~8 years, till the DPD::max_violations options
was added, which instates the current approach of disabling unconfirmed
analyzers after 5 violations. Sadly, there is not much discussion about
this change - from my hazy memory, I think this was discovered during
performance tests and the new behavior was added without checking into
the history of previous changes.
This commit reinstates the originally intended behavior of DPD. When an
analyzer that has not been confirmed raises a protocol violation, it is
immediately removed from the connection. This also makes a lot of sense
- this allows the analyzer to be in a "tasting" phase at the beginning
of the connection, and to error out quickly once it realizes that it was
attached to a connection not containing the desired protocol.
This change also removes the DPD::max_violations option, as it no longer
serves any purpose after this change. (In practice, the option remains
with an &deprecated warning, but it is no longer used for anything).
There are relatively minimal test-baseline changes due to this; they are
mostly triggered by the removal of the data structure and by less
analyzer errors being thrown, as unconfirmed analyzers are disabled
after the first error.
This fixes instances where `zeek:see` was used incorrectly so it was not
rendered correctly. All these instances have been found by looking for
`zeek:see` in the generated HTML where it should not be visible anymore.
I also removed a doc reference to `paraglob_add` which never existed.
* topic/christian/telemetry-make-bifs-primary:
Telemetry framework: move BIFs to the primary-bif stage
Minor comment tweaks for init-frameworks-and-bifs.zeek
This stops invoking Telemetry::sync() via a scheduled event and instead
only invokes it on-demand. This makes metric collection network time
independent and lazier, too.
With Prometheus scrape requests being processed on Zeek's main thread
now, we can safely invoke the script layer Telemetry::sync() hook.
Closes#3947
This moves the Telemetry framework's BIF-defined functionalit from the
secondary-BIFs stage to the primary one. That is, this functionality is now
available from the end of init-bare.zeek, not only after the end of
init-frameworks-and-bifs.zeek.
This allows us to use script-layer telemetry in our Zeek's own code that get
pulled in during init-frameworks-and-bifs.
This change splits up the BIF features into functions, constants, and types,
because that's the granularity most workable in Func.cc and NetVar. It also now
defines the Telemetry::MetricsType enum once, not redundantly in BIFs and script
layer.
Due to subtle load ordering issues between the telemetry and cluster frameworks
this pushes the redef stage of Telemetry::metrics_port and address into
base/frameworks/telemetry/options.zeek, which is loaded sufficiently late in
init-frameworks-and-bifs.zeek to sidestep those issues. (When not doing this,
the effect is that the redef in telemetry/main.zeek doesn't yet find the
cluster-provided values, and Zeek does not end up listening on these ports.)
The need to add basic Zeek headers in script_opt/ZAM/ZBody.cc as a side-effect
of this is curious, but looks harmless.
Also includes baseline updates for the usual btests and adds a few doc strings.
Log flushing is currently triggered based on the threading heartbeat timer
of WriterBackends and the hard-coded WRITE_BUFFER_SIZE 1000.
This change introduces a separate timer that is managed by the logger
manager instead of piggy-backing on the heartbeat timer, as well as a
const &redef for the buffer size.
This allows to modify the log flush frequency and batch size independently
of the threading heartbeat interval. Later, this will allow to re-use the
buffering and flushing logic of writer frontends for non-Broker cluster
backends, too.
One change here is that even frontends that do not have a backend will
be flushed regularly. This is wanted for non-Broker backends and should be
very cheap. Possibly, Broker can piggy back on this timer down the road, too,
rather than using its own script-level timer (see Broker::log_flush()).
The cmds list may grow unbounded due to the POP3 analyzer being in
multiLine mode after seeing `AUTH` in a Redis connection, but never
a `.` terminator. This can easily be provoked by the Redis ping
command.
This adds two heuristics: 1) Forcefully process the oldest commands in
the cmds list and cap it at max_pending_commands. 2) Start raising
analyzer violations if the client has been using more than
max_unknown_client_commands commands (default 10).
Closes#3936
This avoids the callbacks from being processed on the worker thread
spawned by Civetweb. It fixes data race issues with lookups involving
global variables, amongst other threading issues.
The Spicy analyzer is added as a child analyzer when enabled and the
WebSocket.cc logic dispatches between the BinPac and Spicy version.
It substantially slower when tested against a somewhat artificial
2.4GB PCAP. The first flamegraph indicates that the unmask() function
stands out with 35% of all samples, and above it shared_ptr samples.
Don't log them, they are random and arbitrary in the normal case. Users
can do the following to log them if wanted.
redef += WebSocket::Info$client_key += { &log };
redef += WebSocket::Info$server_accept += { &log };
This adds a new WebSocket analyzer that is enabled with the HTTP upgrade
mechanism introduced previously. It is a first implementation in BinPac with
manual chunking of frame payload. Configuration of the analyzer is sketched
via the new websocket_handshake() event and a configuration BiF called
WebSocket::__configure_analyzer(). In short, script land collects WebSocket
related HTTP headers and can forward these to the analyzer to change its
parsing behavior at websocket_handshake() time. For now, however, there's
no actual logic that would change behavior based on agreed upon extensions
exchanged via HTTP headers (e.g. frame compression). WebSocket::Configure()
simply attaches a PIA_TCP analyzer to the WebSocket analyzer for dynamic
protocol detection (or a custom analyzer if set). The added pcaps show this
in action for tunneled ssh, http and https using wstunnel. One test pcap is
Broker's WebSocket traffic from our own test suite, the other is the
Jupyter websocket traffic from the ticket/discussion.
This commit further adds a basic websocket.log that aggregates the WebSocket
specific headers (Sec-WebSocket-*) headers into a single log.
Closes#3424
OSS-Fuzz managed to produce a MIME multipart message construction with
thousands of nested entities (or that's what Zeek makes out of it anyhow).
Prevent such deep analysis by capping at a nesting depth of 100,
preventing unnecessary resource usage. A new weird named exceeded_mime_max_depth
is reported when this limit is reached.
This change reduces the runtime of the OSS-Fuzz reproducer from ~45 seconds
to ~2.5 seconds.
The test PCAP was produced from a Python script using the email package
and sending the rendered version via POST to a HTTP server.
Closes#208
* origin/topic/christian/mmdb-configurability:
Modernize various C++/Zeek-isms in the MMDB code.
Fix MMDB code to re-open explicitly opened DBs correctly
Add btest to verify behavior of re-opened MMDBs opened directly via BIFs
Simplify MMDB code by moving more lookup functionality into MMDB class
Move MMDB logic out of mmdb.bif and into MMDB.cc/h.
Fix mmdb.temporary-error testcase when MMDBs are installed on system
Adapt MMDB BiF code to new script-layer variables
Update btest baselines to reflect introduction of mmdb.bif
Move MaxMind/GeoIP BiF functionality into separate file
Provide script-level configurability of MaxMind DB placement on disk
Sort toplevel .bif list in CMakeLists