mirror of
https://github.com/zeek/zeek.git
synced 2025-10-02 06:38:20 +00:00

This is based on commit 2731def9159247e6da8a3191783c89683363689c from the zeek-docs repo.
403 lines
18 KiB
ReStructuredText
403 lines
18 KiB
ReStructuredText
.. _troubleshooting:
|
|
|
|
===============
|
|
Troubleshooting
|
|
===============
|
|
|
|
This page lists approaches and mentions logs and metrics available
|
|
to understand and debug Zeek's performance.
|
|
|
|
There may be assumptions about Linux deployments regarding kernel features
|
|
and tooling available.
|
|
|
|
Memory Leaks and State Growth
|
|
=============================
|
|
|
|
When memory of any Zeek process continuously grows in production or testing
|
|
settings, there might be a memory leak in Zeek's C++ core or it might be
|
|
scripting state growth. Examples of the latter are a global table that is
|
|
populated but elements aren't expired or removed. Containers attached to
|
|
connections may also cause unbounded state growth when these connections
|
|
are long-lived.
|
|
|
|
For such issues, using jemalloc's memory profiling can be invaluable. A good
|
|
introduction to this topic is `Justin Azoff's profiling presentation`_ (`slides`_).
|
|
|
|
|
|
Jemalloc Memory Profiling
|
|
-------------------------
|
|
|
|
For memory profiling with `jemalloc`_ you need jemalloc compiled with
|
|
profiling enabled. Some Linux distributions provide a libjemalloc package that
|
|
is configured this way. For example, Debian on the amd64 architecture
|
|
has it enabled while Fedora 38 does not. You're advised to verify the
|
|
``config.prof`` line in the jemalloc stats output as shown below.
|
|
|
|
.. note::
|
|
|
|
If your distribution does not provide a suitable libjemalloc package,
|
|
building jemalloc from source configured with the required options
|
|
is reasonably easy.
|
|
|
|
.. code-block:: console
|
|
|
|
$ git clone git@github.com:jemalloc/jemalloc.git
|
|
$ cd jemalloc
|
|
$ git checkout 5.2.1 # or newer releases
|
|
$ ./autogen.sh && ./configure --enable-prof
|
|
|
|
# Optionally, use LD_PRELOAD
|
|
$ export LD_PRELOAD=$(pwd)/lib/libjemalloc.so
|
|
# ...or install the custom libjemalloc build.
|
|
$ sudo make install
|
|
|
|
You can either build Zeek from source and pass the ``--enable-jemalloc`` flag
|
|
(possibly with ``--with-jemalloc=/usr/local/`` for a custom build) to always
|
|
use the jemalloc allocator (recommended), or set ``LD_PRELOAD`` as shown above.
|
|
Using ``LD_PRELOAD`` can be convenient if you're not
|
|
in a position to rebuild Zeek or you're consuming upstream binary packages that
|
|
did not use ``--enable-jemalloc``, or you want to use a custom ad-hoc/patched
|
|
jemalloc build.
|
|
|
|
To verify jemalloc profiling is functional, run the following command and
|
|
check that ``config.prof`` reports ``true``.
|
|
|
|
.. code-block:: console
|
|
|
|
$ MALLOC_CONF="stats_print:true" zeek -e 'event zeek_init() {}' 2>&1 | grep 'config.prof'
|
|
config.prof: true
|
|
config.prof_libgcc: true
|
|
config.prof_libunwind: false
|
|
...
|
|
|
|
|
|
If there is no output or ``config.prof`` says ``false``, verify your Zeek
|
|
and libjemalloc setup.
|
|
|
|
.. note::
|
|
|
|
Neither ``LD_PRELOAD`` nor ``MALLOC_CONF`` work with a setuid or setcap
|
|
``zeek`` binary and you might need to run as root or another privileged
|
|
user instead.
|
|
|
|
At this point you can run Zeek with a ``MALLOC_CONF`` setting that will dump
|
|
memory profiles roughly every 256MB of allocation activity (controlled by the
|
|
``lg_prof_interval`` setting - 2**28 = 256 MB).
|
|
|
|
.. code-block:: console
|
|
|
|
$ MALLOC_CONF="prof:true,prof_prefix:jeprof.out,prof_final:true,lg_prof_interval:28" zeek -C -i eth0
|
|
|
|
The files dumped by jemalloc will have a naming pattern of ``jeprof.out.<pid>...``
|
|
and can be postprocessed with the ``jeprof`` utility.
|
|
|
|
.. code-block:: console
|
|
|
|
$ jeprof $(which zeek) jeprof.out.*
|
|
Welcome to jeprof! For help, type 'help'.
|
|
(jeprof) top
|
|
Total: 1773.2 MB
|
|
381.8 21.5% 21.5% 381.8 21.5% __gnu_cxx::new_allocator::allocate
|
|
232.6 13.1% 34.6% 232.6 13.1% std::__cxx11::basic_string::_M_construct
|
|
147.5 8.3% 43.0% 265.0 14.9% zeek::make_intrusive
|
|
144.0 8.1% 51.1% 144.0 8.1% monitoring_thread_loop
|
|
135.0 7.6% 58.7% 135.0 7.6% zeek::util::safe_realloc
|
|
117.0 6.6% 65.3% 802.6 45.3% yyparse
|
|
63.0 3.6% 68.9% 108.0 6.1% zeeklex
|
|
54.0 3.0% 71.9% 54.0 3.0% zeek::Obj::SetLocationInfo
|
|
49.0 2.8% 74.7% 49.0 2.8% alloc_aligned_chunks
|
|
45.0 2.5% 77.2% 45.0 2.5% zeek::detail::EquivClass::EquivClass
|
|
|
|
|
|
It can be more insightful to generate a graph as SVG or GIF from the ``.heap`` files
|
|
as these make the call chain more visible directly (click image to enlarge).
|
|
|
|
.. code-block:: console
|
|
|
|
$ jeprof $(which zeek) --svg jeprof.out.3075061.* > out.svg
|
|
|
|
.. image:: /images/troubleshooting/http-fake-state-growth.gif
|
|
:alt: State growth in a ``std::vector<std::string>``
|
|
:scale: 10%
|
|
|
|
In above image, ``basic_string _M_construct`` called from ``HTTP_Analyzer DeliverStream``
|
|
is standing out as well as ``new_allocator allocate`` called from ``std::vector _M_realloc_insert``.
|
|
This memory growth was provoked by patching the HTTP analyzer such that all input
|
|
data passed to ``DeliverStream()`` was also copied into a single statically allocated
|
|
``std::vector<std::string>`` and never freed again.
|
|
|
|
ZeekControl Integration
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
When working in a ZeekControl based environment, the `zeek-jemalloc`_ plugin
|
|
can help with setting up the required environment variables. The ``.heap``
|
|
files will be located in a worker's individual spool directory and can be
|
|
processed with the ``jeprof`` utility as shown above.
|
|
|
|
.. _zeek-jemalloc: https://github.com/JustinAzoff/zeek-jemalloc-profiling/tree/master
|
|
.. _justin azoff's profiling presentation: https://www.youtube.com/watch?v=gWSXbqxnJfs
|
|
.. _slides: https://old.zeek.org/zeekweek2019/slides/justin-azoff-profiling.pdf
|
|
.. _jemalloc: https://jemalloc.net/
|
|
|
|
|
|
CPU Profiling
|
|
=============
|
|
|
|
When a Zeek worker is using close to all of a single CPU as seen via ``zeekctl top``
|
|
or ``top -p <pid>``, this usually means it is either receiving too many packets
|
|
and is simply overloaded, or there's a performance problem. Particularly at
|
|
low packet rates or with pathological packet streams it is worth debugging
|
|
|
|
Perf and Flame Graphs
|
|
---------------------
|
|
|
|
It can be valuable to leverage the `perf`_ tool on Linux and generate
|
|
`Flame Graphs`_ from the recorded data.
|
|
|
|
.. note::
|
|
|
|
For best results it's recommended to build Zeek and third-party libraries
|
|
used by Zeek with frame pointers enabled setting the ``-fno-omit-frame-pointer``
|
|
compile flag.
|
|
|
|
.. code-block:: console
|
|
|
|
$ CXXFLAGS="-fno-omit-frame-pointer" CFLAGS="-fno-omit-frame-pointer" ./configure --build-type=RelWithDebugInfo ...
|
|
|
|
Using ``-fno-omit-frame-pointer`` may have a performance impact. Therefore,
|
|
Linux distributions may or may not use it by default to compile libraries.
|
|
You're advised to test performance differences in your environment and whether
|
|
having frame pointers available for troubleshooting in production is more
|
|
important than any performance gains.
|
|
|
|
On Ubuntu you may explore using the ``libc6-prof`` for a glibc library
|
|
compiled with frame pointers enabled. On Fedora 38 on the other hand
|
|
most packages should be compiled with
|
|
`frame pointers enabled by default <https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer>`_.
|
|
|
|
Assuming the PID of a Zeek worker is 3639255, a perf profile with call graph
|
|
information can be collected as follows:
|
|
|
|
.. code-block:: console
|
|
|
|
$ perf record -g -p 3639255
|
|
^C[ perf record: Woken up 8 times to write data ]
|
|
[ perf record: Captured and wrote 2.893 MB perf.data (13865 samples) ]
|
|
|
|
The resulting ``perf.data`` file can be visualized and post-processed
|
|
via ``perf report``, ``perf script``, etc.
|
|
|
|
When Zeek workers are pinned to CPUs, it can also be useful to record all
|
|
activity on that CPU via ``perf record -g -C <cpu>`` instead.
|
|
|
|
To produce a flame graph ``perf.data``, run the following command pipeline,
|
|
assuming a git checkout of the `FlameGraph`_ repository at an appropriate
|
|
location.
|
|
|
|
.. code-block:: console
|
|
|
|
$ perf script | /path/to/FlameGraph/stackcollapse-perf.pl | /path/to/FlameGraph/flamegraph.pl > out.svg
|
|
|
|
The resulting flame graph may look as follows:
|
|
|
|
.. image:: /images/troubleshooting/flamegraph.png
|
|
:alt: Example flame graph.
|
|
:scale: 25%
|
|
|
|
Visualizing flame graphs this way removes the time dimension. `FlameScope`_ is
|
|
a project allowing exploration of different time ranges within the recorded data
|
|
which can be valuable if you observe Zeek processes freezing or hanging.
|
|
|
|
.. _perf: https://perf.wiki.kernel.org/index.php/Main_Page
|
|
.. _Flame Graphs: https://www.brendangregg.com/flamegraphs.html
|
|
.. _FlameGraph: https://github.com/brendangregg/FlameGraph
|
|
.. _FlameScope: https://github.com/Netflix/flamescope
|
|
.. _Fedora -fno-omit-framepointers: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer
|
|
|
|
|
|
Metrics and Stats
|
|
=================
|
|
|
|
Telemetry Framework and Prometheus
|
|
----------------------------------
|
|
|
|
Starting with Zeek 5.1, the script-level as well as C++ API of the :ref:`framework-telemetry`
|
|
is being leveraged more extensively to expose metrics about Zeek's operational behavior.
|
|
Generally we recommend consuming these metrics through the Prometheus endpoint
|
|
exposed on ``http://manager-ip:9911/metrics`` by default.
|
|
|
|
Currently, basic version information, network and process metrics, log records per
|
|
log stream and log writers, data about event invocations as well as Broker
|
|
subsystem metrics are exposed.
|
|
|
|
Below is an example of using ``curl`` to list some of the metrics. In a production
|
|
setup, usually a `Prometheus Server`_ is configured to scrape above endpoint
|
|
which then stores metrics data for later visualization.
|
|
|
|
.. code-block:: console
|
|
|
|
$ curl -s localhost:9911/metrics | grep -E '^(zeek_version|zeek_log|zeek_event|zeek_net|process_|zeek_active_sessions|zeek_total_sessions)'
|
|
zeek_version_info{beta="false",commit="622",debug="false",endpoint="",major="6",minor="0",patch="0",version_number="60000",version_string="6.0.0-dev.622"} 1.000000 1684826824560
|
|
zeek_event_handler_invocations_total{endpoint="",name="zeek_init"} 1 1684826824560
|
|
...
|
|
zeek_event_handler_invocations_total{endpoint="",name="dns_message"} 4 1684826824560
|
|
zeek_event_handler_invocations_total{endpoint="",name="dns_request"} 2 1684826824560
|
|
zeek_event_handler_invocations_total{endpoint="",name="dns_end"} 4 1684826824560
|
|
zeek_event_handler_invocations_total{endpoint="",name="connection_state_remove"} 547 1684826824560
|
|
...
|
|
zeek_event_handler_invocations_total{endpoint="",name="file_hash"} 1628 1684826824560
|
|
zeek_event_handler_invocations_total{endpoint="",name="file_state_remove"} 814 1684826824560
|
|
zeek_net_dropped_packets_total{endpoint=""} 0.000000 1684826824560
|
|
zeek_net_link_packets_total{endpoint=""} 19664.000000 1684826824560
|
|
zeek_net_received_bytes_total{endpoint=""} 1699891.000000 1684826824560
|
|
zeek_net_received_packets_total{endpoint=""} 9832.000000 1684826824560
|
|
...
|
|
zeek_log_writer_writes_total{endpoint="",filter_name="default",module="DNS",path="dns",stream="DNS::LOG",writer="Log::WRITER_ASCII"} 2 1684826824560
|
|
zeek_log_writer_writes_total{endpoint="",filter_name="default",module="HTTP",path="http",stream="HTTP::LOG",writer="Log::WRITER_ASCII"} 819 1684826824560
|
|
zeek_log_writer_writes_total{endpoint="",filter_name="default",module="Conn",path="conn",stream="Conn::LOG",writer="Log::WRITER_ASCII"} 547 1684826824560
|
|
zeek_log_writer_writes_total{endpoint="",filter_name="default",module="Files",path="files",stream="Files::LOG",writer="Log::WRITER_ASCII"} 814 1684826824560
|
|
...
|
|
zeek_log_stream_writes_total{endpoint="",module="DNS",stream="DNS::LOG"} 2 1684826824560
|
|
zeek_log_stream_writes_total{endpoint="",module="HTTP",stream="HTTP::LOG"} 819 1684826824560
|
|
zeek_log_stream_writes_total{endpoint="",module="Conn",stream="Conn::LOG"} 547 1684826824560
|
|
zeek_log_stream_writes_total{endpoint="",module="Files",stream="Files::LOG"} 814 1684826824560
|
|
zeek_active_sessions{endpoint="",protocol="tcp"} 0 1684829159305
|
|
...
|
|
zeek_total_sessions_total{endpoint="",protocol="tcp"} 45101 1684829159305
|
|
zeek_total_sessions_total{endpoint="",protocol="udp"} 39849 1684829159305
|
|
zeek_total_sessions_total{endpoint="",protocol="icmp"} 320 1684829159305
|
|
process_open_fds{endpoint=""} 62 1684826824560
|
|
process_cpu_seconds_total{endpoint=""} 1.950000 1684826824560
|
|
process_virtual_memory_bytes{endpoint=""} 1917345792 1684826824560
|
|
process_resident_memory_bytes{endpoint=""} 268935168 1684826824560
|
|
|
|
|
|
If you prefer to consume metrics via logs, the :file:`telemetry.log`
|
|
(:zeek:see:`Telemetry::Info`) may work. Its
|
|
format is a bit unusual, however. See the :ref:`framework-telemetry`'s
|
|
documentation for more details about the log and how to add further metrics
|
|
from your own Zeek scripts.
|
|
|
|
.. _Prometheus server: https://prometheus.io/
|
|
|
|
|
|
:file:`stats.log`
|
|
-----------------
|
|
|
|
The :file:`stats.log` is enabled when loading the :doc:`/scripts/policy/misc/stats.zeek` script.
|
|
This is the default with the stock :file:`local.zeek` included with Zeek. This
|
|
log provides stats about Zeek's operational behavior in a structured log format.
|
|
|
|
See the :zeek:see:`Stats::Info` record documentation for a description of
|
|
the individual fields.
|
|
|
|
The default reporting interval is 5 minutes. It can make sense to reduce
|
|
this interval for testing or during troubleshooting via
|
|
``redef Stats::report_interval=30sec``. Stats collection may have a
|
|
non-negligible impact on performance and running, for example,
|
|
every second may be detrimental.
|
|
|
|
For historic reasons, this log contains delta values for ``pkts_proc``,
|
|
``bytes_recv``, ``events_proc``, ``tcp_conns``, etc. This can make it
|
|
difficult to use the values as-is in metrics systems that expect counter
|
|
metrics to continuously grow and compute rates or delta values on the fly.
|
|
|
|
.. note::
|
|
|
|
If you're creating your own custom metrics or stats-like log, consider
|
|
using absolute values for counter metrics. Relative values can
|
|
always be derived from two absolute values. The inverse is not true.
|
|
Popular metrics systems usually assume absolute counter values, too.
|
|
|
|
Following an example of a :file:`stats.log` entry:
|
|
|
|
.. code-block:: console
|
|
|
|
$ zeek -C -i eth0 local Stats::report_interval=30sec LogAscii::use_json=T
|
|
$ jq < stats.log
|
|
...
|
|
{
|
|
"ts": 1684828680.616951,
|
|
"peer": "zeek",
|
|
"mem": 344,
|
|
"pkts_proc": 300000,
|
|
"bytes_recv": 78092228,
|
|
"pkts_dropped": 0,
|
|
"pkts_link": 299609,
|
|
"pkt_lag": 0.003422975540161133,
|
|
"events_proc": 448422,
|
|
"events_queued": 448422,
|
|
"active_tcp_conns": 2279,
|
|
"active_udp_conns": 2809,
|
|
"active_icmp_conns": 96,
|
|
"tcp_conns": 6747,
|
|
"udp_conns": 5954,
|
|
"icmp_conns": 48,
|
|
"timers": 67510,
|
|
"active_timers": 35086,
|
|
"files": 8165,
|
|
"active_files": 0,
|
|
"dns_requests": 218,
|
|
"active_dns_requests": 2,
|
|
"reassem_tcp_size": 7816,
|
|
"reassem_file_size": 0,
|
|
"reassem_frag_size": 0,
|
|
"reassem_unknown_size": 0
|
|
}
|
|
|
|
:file:`prof.log`
|
|
----------------
|
|
|
|
The :file:`prof.log` provides aggregated information about Zeek's runtime status
|
|
in a fairly non-structured text format.
|
|
Likely future metrics will be added through the Telemetry framework mentioned
|
|
above, but as of now it does contain information about queue sizes within
|
|
the threading subsystem and other details that are not yet exposed otherwise.
|
|
|
|
To enable :file:`prof.log`, load the :doc:`/scripts/policy/misc/profiling.zeek` script
|
|
in :file:`local.zeek` or start Zeek with ``misc/profiling`` on the command-line:
|
|
|
|
.. code-block:: console
|
|
|
|
$ zeek -C -i eth0 misc/profiling
|
|
|
|
The following provides an example of :file:`prof.log` content:
|
|
|
|
.. code-block:: console
|
|
|
|
$ cat prof.log
|
|
1684828232.344252 Comm: peers=0 stores=1 pending_queries=0 events_in=0 events_out=0 logs_in=0 logs_out=0 ids_in=0 ids_out=0 1684828262.344351 ------------------------
|
|
1684828262.344351 Memory: total=406480K total_adj=149536K malloced: 0K
|
|
1684828262.344351 Run-time: user+sys=53.2 user=44.6 sys=8.6 real=631.1
|
|
1684828262.344351 Conns: total=84712 current=6759/6759
|
|
1684828262.344351 Conns: tcp=3847/3860 udp=2815/2883 icmp=97/98
|
|
1684828262.344351 TCP-States: Inact. Syn. SA Part. Est. Fin. Rst.
|
|
1684828262.344351 TCP-States:Inact.
|
|
1684828262.344351 TCP-States:Syn. 76 36
|
|
1684828262.344351 TCP-States:SA
|
|
1684828262.344351 TCP-States:Part.
|
|
1684828262.344351 TCP-States:Est. 652 2214 36
|
|
1684828262.344351 TCP-States:Fin. 753
|
|
1684828262.344351 TCP-States:Rst. 16 64
|
|
1684828262.344351 Connections expired due to inactivity: 2426
|
|
1684828262.344351 Timers: current=47708 max=47896 lag=0.00s
|
|
1684828262.344351 DNS_Mgr: requests=1596 successful=1596 failed=0 pending=0 cached_hosts=0 cached_addrs=1207
|
|
1684828262.344351 Triggers: total=4900 pending=0
|
|
1684828262.344351 ConnectionDeleteTimer = 905
|
|
1684828262.344351 ConnectionInactivityTimer = 6759
|
|
1684828262.344351 DNSExpireTimer = 1840
|
|
1684828262.344351 FileAnalysisInactivityTimer = 32836
|
|
1684828262.344351 ScheduleTimer = 11
|
|
1684828262.344351 TableValTimer = 34
|
|
1684828262.344351 TCPConnectionAttemptTimer = 166
|
|
1684828262.344351 TCPConnectionExpireTimer = 5156
|
|
1684828262.344351 ThreadHeartbeat = 1
|
|
1684828262.344351 Threads: current=21
|
|
1684828262.344351 dns/Log::WRITER_ASCII in=586 out=258 pending=0/0 (#queue r/w: in=586/586 out=258/258)
|
|
1684828262.344351 known_hosts/Log::WRITER_ASCII in=475 out=258 pending=0/0 (#queue r/w: in=475/475 out=258/258)
|
|
1684828262.344351 software/Log::WRITER_ASCII in=478 out=258 pending=0/0 (#queue r/w: in=478/478 out=258/258)
|
|
...
|
|
1684828262.344351 files/Log::WRITER_ASCII in=483 out=258 pending=0/0 (#queue r/w: in=483/483 out=258/258)
|
|
1684828262.344351 http/Log::WRITER_ASCII in=483 out=258 pending=0/0 (#queue r/w: in=483/483 out=258/258)
|
|
1684828262.344351 weird/Log::WRITER_ASCII in=260 out=257 pending=0/0 (#queue r/w: in=260/260 out=257/257)
|
|
1684828262.344351 conn/Log::WRITER_ASCII in=486 out=257 pending=0/0 (#queue r/w: in=486/486 out=257/257)
|