zeek/doc/frameworks/telemetry.rst

.. _histogram_quantile(): https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
.. _Prometheus: https://prometheus.io
.. _Prometheus Getting Started Guide: https://prometheus.io/docs/prometheus/latest/getting_started/
.. _Prometheus Metric Types: https://prometheus.io/docs/concepts/metric_types/
.. _Prometheus HTTP Service Discovery: https://prometheus.io/docs/prometheus/latest/http_sd/
.. _prometheus-cpp: https://github.com/jupp0r/prometheus-cpp

.. _framework-telemetry:

===================
Telemetry Framework
===================

.. note::

   This framework changed considerably with Zeek 7, and is not API-compatible
   with earlier versions.  While earlier versions relied on an implementation
   in :ref:`Broker <broker-framework>`, Zeek now maintains its
   own implementation, building on `prometheus-cpp`_, with Broker adding its
   telemetry to Zeek's internal registry of metrics.

The telemetry framework continuously collects metrics during Zeek's operation,
and provides ways to export this telemetry to third-party consumers. Zeek ships
with a pre-defined set of metrics and allows you to add your own, via
script-layer and in-core APIs you use to instrument relevant parts of the
code. Metrics target Zeek's operational behavior, or track characteristics of
monitored traffic. Metrics are not an additional export vehicle for Zeek's
various regular logs. Zeek's telemetry data model closely resembles that of
`Prometheus`_, and supports its text-based exposition format for scraping by
third-party collectors.

This section outlines usage examples, and gives brief API examples for
composing your own metrics. Head to the :zeek:see:`Telemetry` API documentation
for more details.

Metric Types
============

Zeek supports the following metric types:

  Counter
    A continuously increasing value, resetting on process restart.
    Examples of counters are the number of log writes since process start,
    packets processed, or ``process_seconds`` representing CPU usage.

  Gauge
    A gauge metric is a numerical value that can increase and decrease
    over time. Examples are table sizes or the :zeek:see:`val_footprint`
    of Zeek script values over the lifetime of the process. More general
    examples include a temperature or memory usage.

  Histogram
    Pre-configured buckets of observations with corresponding counts.
    Examples of histograms are connection durations, delays, or transfer
    sizes. Generally, it is useful to know the expected range and distribution
    of such values, as the bounds of a histogram's buckets are defined when
    this metric gets created.

Zeek uses :zeek:type:`double` throughout to track metric values. Since
terminology around telemetry can be complex, it helps to know a few additional
terms:

  Labels
    A given metric sometimes doesn't exist in isolation, but comes with
    additional labeling to disambiguate related observations. For example, Zeek
    ships with gauge called ``zeek_active_sessions`` that labels counts for TCP,
    UDP, and other transport protocols separately. Labels have a name (for
    example, "protocol") to refer to value (such as "tcp"). A metric can have
    multiple labels. Labels are thus a way to associate textual information with
    the numerical values of metrics.

  Family
    The set of such metrics, differing only by their labeling, is a known as a
    Family. Zeek's script-layer metrics API lets you operate on individual
    metrics and families.

Zeek has no equivalent to Prometheus's Summary type. A good reference to
consult for more details is the official `Prometheus Metric Types`_
documentation.

Cluster Considerations
======================

When running Zeek as a cluster, every node maintains its own metrics registry,
independently of the other nodes. Zeek does not automatically synchronize,
centralize, or aggregate metrics across the cluster. Instead, it adds the name
of the node a particular metric originated from at collection time, leaving any
aggregation to post-processing where desired.

.. note::

   This is a departure from the design in earlier versions of Zeek, which could
   (either by default, or after activation) centralize metrics in the cluster's
   manager node.

Accordingly, the :zeek:see:`Telemetry::collect_metrics` and
:zeek:see:`Telemetry::collect_histogram_metrics` functions only return
node-local metrics.

Metrics Export
==============

Zeek supports two mechanisms for exporting telemetry: traditional logs, and
Prometheus-compatible endpoints for scraping by a third-party service. We cover
them in turn.

Zeek Logs
---------

Zeek can export current metrics continuously via :file:`telemetry.log` and
:file:`telemetry_histogram.log`. It does not do so by default. To enable, load the
policy script ``frameworks/telemetry/log`` on the command line, or via
``local.zeek``.

The :zeek:see:`Telemetry::Info` and :zeek:see:`Telemetry::HistogramInfo` records
define the logs.  Both records include a ``peer`` field that conveys the
cluster node the metric originated from.

By default, Zeek reports current telemetry every 60 seconds, as defined by the
:zeek:see:`Telemetry::log_interval`, which you're free to adjust.

Also, by default only metrics with the ``prefix`` (namespace) ``zeek`` and
``process`` are included in above logs. If you add new metrics with your own
prefix and expect these to be included, redefine the
:zeek:see:`Telemetry::log_prefixes` option::

    @load frameworks/telemetry/log

    redef Telemetry::log_prefixes += { "my_prefix" };

Clearing the set will cause all metrics to be logged. As with any logs, you may
employ :ref:`policy hooks <logging-filtering-log-records>`,
:zeek:see:`Telemetry::log_policy` and
:zeek:see:`Telemetry::log_policy_histogram`, to define potentially more granular
filtering.

Native Prometheus Export
------------------------

Every Zeek process, regardless of whether it's running long-term standalone or
as part of a cluster, can run an HTTP server that renders current telemetry in
Prometheus's `text-based exposition format
<https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#text-format-example>`_.

The :zeek:see:`Telemetry::metrics_port` variable controls this behavior. Its
default of ``0/unknown`` disables exposing the port; setting it to another TCP
port will enable it. In clusterized operation, the cluster topology can specify
each node's metrics port via the corresponding :zeek:see:`Cluster::Node` field,
and the framework will adjust ``Telemetry::metrics_port`` accordingly.  Both
zeekctl and the management framework let you define specific ports and can also
auto-populate their values, similarly to Broker's listening ports.

To query a node's telemetry, point an HTTP client or Prometheus scraper at the
node's metrics port::

  $ curl -s http://<node>:<node-metrics-port>/metrics
  # HELP exposer_transferred_bytes_total Transferred bytes to metrics services
  # TYPE exposer_transferred_bytes_total counter
  exposer_transferred_bytes_total 0
  ...
  # HELP zeek_event_handler_invocations_total Number of times the given event handler was called
  # TYPE zeek_event_handler_invocations_total counter
  zeek_event_handler_invocations_total{endpoint="manager",name="run_sync_hook"} 2
  ...

To simplify telemetry collection from all nodes in a cluster, Zeek supports
`Prometheus HTTP Service Discovery`_ on the manager node. Using this approach, the
endpoint ``http://<manager>:<manager-metrics-port>/services.json`` returns a
JSON data structure that itemizes all metrics endpoints in the
cluster. Prometheus scrapers supporting service discovery then proceed to
collect telemetry from the listed endpoints in turn.

The following is an example service discovery scrape config entry within
Prometheus server's ``prometheus.yml`` configuration file::

    ...
    scrape_configs:
      - job_name: zeek-discovery
        scrape_interval: 5s
        http_sd_configs:
          - url: http://localhost:9991/services.json
            refresh_interval: 10s

See the `Prometheus Getting Started Guide`_ for additional information.

.. note::

   .. versionchanged:: 7.0

   The built-in aggregation for Zeek telemetry to the manager node has been
   removed, in favor of the Prometheus-compatible service discovery
   endpoint. The new approach requires cluster administrators to manage access
   to the additional ports. However, it allows Prometheus to conduct the
   aggregation, instead of burdening the Zeek manager with it, which has
   historically proved expensive.

If these setups aren't right for your environment, there's the possibility to
redefine the options in ``local.zeek`` to something more suitable. For example,
the following snippet selects the metrics port of each Zeek process relative
to the cluster port used in ``cluster-layout.zeek``::

    @load base/frameworks/cluster

    global my_node = Cluster::nodes[Cluster::node];
    global my_metrics_port = count_to_port(port_to_count(my_node$p) - 1000, tcp);

    redef Telemetry::metrics_port = my_metrics_port;


Examples of Metrics Application
===============================

Counting Log Writes per Stream
------------------------------

In combination with the :zeek:see:`Log::log_stream_policy` hook, it is
straightforward to record :zeek:see:`Log::write` invocations over the dimension
of the :zeek:see:`Log::ID` value.  This section shows three different approaches
to do this. Which approach is most applicable depends mostly on the expected
script layer performance overhead for updating the metric.  For example, calling
:zeek:see:`Telemetry::counter_with` and :zeek:see:`Telemetry::counter_inc`
within a handler of a high-frequency event may be prohibitive, while for a
low-frequency event it's unlikely to matter.

Assuming a :zeek:see:`Telemetry::metrics_port` of 9090, querying the Prometheus
endpoint using ``curl`` provides output resembling the following for each of
the three approaches.

.. code-block::

   $ curl -s localhost:9090/metrics | grep log_writes
   # HELP zeek_log_writes_total Number of log writes per stream
   # TYPE zeek_log_writes_total counter
   zeek_log_writes_total{endpoint="zeek",log_id="packetfilter_log"} 1
   zeek_log_writes_total{endpoint="zeek",log_id="loadedscripts_log"} 477
   zeek_log_writes_total{endpoint="zeek",log_id="stats_log"} 1
   zeek_log_writes_total{endpoint="zeek",log_id="dns_log"} 200
   zeek_log_writes_total{endpoint="zeek",log_id="ssl_log"} 9
   zeek_log_writes_total{endpoint="zeek",log_id="conn_log"} 215
   zeek_log_writes_total{endpoint="zeek",log_id="captureloss_log"} 1

The above shows a family of 7 ``zeek_log_writes_total`` metrics, each with an
``endpoint`` label (here, ``zeek``, which would be a cluster node name if
scraped from a Zeek cluster) and a ``log_id`` one.

Immediate
^^^^^^^^^

The following example creates a global counter family object and uses
the :zeek:see:`Telemetry::counter_family_inc` helper to increment the
counter metric associated with a string representation of the :zeek:see:`Log::ID`
value.


.. literalinclude:: telemetry/log-writes-immediate.zeek
   :caption: log-writes-immediate.zeek
   :language: zeek
   :linenos:
   :tab-width: 4

With a few lines of scripting code, Zeek now track log writes per stream
ready to be scraped by a Prometheus server.


Cached
^^^^^^

For cases where creating the label value (stringification, :zeek:see:`gsub` and :zeek:see:`to_lower`)
and instantiating the label vector as well as invoking the
:zeek:see:`Telemetry::counter_family_inc` methods cause too much
performance overhead, the counter instances can also be cached in a lookup table.
The counters can then be incremented with :zeek:see:`Telemetry::counter_inc`
directly.

.. literalinclude:: telemetry/log-writes-cached.zeek
   :caption: log-writes-cached.zeek
   :language: zeek
   :linenos:
   :tab-width: 4


For metrics without labels, the metric instances can also be cached as global
variables directly. The following example counts the number of http requests.

.. literalinclude:: telemetry/global-http-counter.zeek
   :caption: global-http-counter.zeek
   :language: zeek
   :linenos:
   :tab-width: 4


Sync
^^^^

In case the scripting overhead of the previous approach is still too high,
individual writes (or events) can be tracked in a table or global variable
and then synchronized / mirrored to concrete counter and gauge instances
during execution of the :zeek:see:`Telemetry::sync` hook.

.. literalinclude:: telemetry/log-writes-sync.zeek
   :caption: log-writes-sync.zeek
   :language: zeek
   :linenos:
   :tab-width: 4

For tracking log writes, this is unlikely to be required (and Zeek exposes
various logging natively through the framework already), but for updating
metrics within high frequency events that otherwise have low script processing
overhead, it's a valuable approach.


.. versionchanged:: 7.1

The :zeek:see:`Telemetry::sync` hook is invoked on-demand only. Either when
one of the :zeek:see:`Telemetry::collect_metrics`
or :zeek:see:`Telemetry::collect_histogram_metrics` functions is invoked, or
when querying Prometheus endpoint. It's an error to call either of the
collection BiFs within the :zeek:see:`Telemetry::sync` hook and results
in a reporter warning.


.. note::

   In versions before Zeek 7.1, :zeek:see:`Telemetry::sync` was invoked on a
   fixed schedule, potentially resulting in stale metrics at collection time,
   as well as generating small runtime overhead when metrics are not collected.

Table Sizes
-----------

It can be useful to expose the size of tables as metrics, as they often
indicate the approximate amount of state maintained in memory.
As table sizes may increase and decrease, a :zeek:see:`Telemetry::Gauge`
is appropriate for this purpose.

The following example records the size of the :zeek:see:`Tunnel::active` table
and its footprint with two gauges. The gauges are updated during the
:zeek:see:`Telemetry::sync` hook. Note, there are no labels in use, both
gauge instances are simple globals.

.. literalinclude:: telemetry/table-size-tracking.zeek
   :caption: log-writes-sync.zeek
   :language: zeek
   :linenos:
   :tab-width: 4

Example representation of these metrics when querying the Prometheus endpoint:

.. code-block::

   $ curl -s localhost:9090/metrics | grep tunnel
   # HELP zeek_monitored_tunnels_active_footprint Footprint of the Tunnel::active table
   # TYPE zeek_monitored_tunnels_active_footprint gauge
   zeek_monitored_tunnels_active_footprint{endpoint="zeek"} 324
   # HELP zeek_monitored_tunnels_active Number of currently active tunnels as tracked in Tunnel::active
   # TYPE zeek_monitored_tunnels_active gauge
   zeek_monitored_tunnels_active{endpoint="zeek"} 12


Instead of tracking footprints per variable, :zeek:see:`global_container_footprints`,
could be leveraged to track all global containers at once, using the variable
name as label.

Connection Durations as Histogram
---------------------------------

To track the distribution of certain measurements, a :zeek:see:`Telemetry::Histogram`
can be used. The histogram's buckets have to be preconfigured.

The following example observes the duration of each connection that Zeek has
monitored.

.. literalinclude:: telemetry/connection-durations.zeek
   :caption: connection-durations.zeek
   :language: zeek
   :linenos:
   :tab-width: 4

Due to the way Prometheus represents histograms and the fact that durations
are broken down by protocol and service in the given example, the resulting
representation becomes rather verbose.

.. code-block::

   $ curl -s localhost:9090/metrics | grep monitored_connection_duration
   # HELP zeek_monitored_connection_duration_seconds Duration of monitored connections
   # TYPE zeek_monitored_connection_duration_seconds histogram
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="0.1"} 970
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="1"} 998
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="10"} 1067
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="30"} 1108
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="60"} 1109
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="+Inf"} 1109
   zeek_monitored_connection_duration_seconds_sum{endpoint="zeek",proto="udp",service="dns"} 1263.085691
   zeek_monitored_connection_duration_seconds_count{endpoint="zeek",proto="udp",service="dns"} 1109
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="0.1"} 16
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="1"} 54
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="10"} 56
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="30"} 57
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="60"} 57
   zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="+Inf"} 57


To work with histogram data, Prometheus provides specialized query functions.
For example `histogram_quantile()`_.

Note, when using data from :file:`conn.log` and post-processing, a proper
histogram of connection durations can be calculated and possibly preferred.
The above example is meant for demonstration purposes. Histograms may be
primarily be useful for Zeek operational metrics such as processing times
or queueing delays, response times to external systems, etc.


Exporting the Zeek Version
--------------------------

A common pattern in the Prometheus ecosystem is to expose the version
information of the running process as gauge metric with a value of 1.

The following example does just that with a Zeek script:

.. literalinclude:: telemetry/version.zeek
   :caption: version.zeek
   :language: zeek
   :linenos:
   :tab-width: 4

In Prometheus's exposition format, this turns into the following:

.. code-block::

   $ curl -s localhost:9090/metrics | grep version
   # HELP zeek_version_info The Zeek version
   # TYPE zeek_version_info gauge
   zeek_version_info{beta="true",commit="0",debug="true",major="7",minor="0",patch="0",version_number="70000",version_string="7.0.0-rc4-debug"} 1
   zeek_version_info{beta="false",commit="289",debug="true",endpoint="zeek",major="5",minor="1",patch="0",version_number="50100",version_string="5.1.0-dev.289-debug"} 1.000000


Zeek already ships with this gauge, via
:doc:`/scripts/base/frameworks/telemetry/main.zeek`. There is no need to add
above snippet to your site.