.. _troubleshooting: =============== Troubleshooting =============== This page lists approaches and mentions logs and metrics available to understand and debug Zeek's performance. There may be assumptions about Linux deployments regarding kernel features and tooling available. Memory Leaks and State Growth ============================= When memory of any Zeek process continuously grows in production or testing settings, there might be a memory leak in Zeek's C++ core or it might be scripting state growth. Examples of the latter are a global table that is populated but elements aren't expired or removed. Containers attached to connections may also cause unbounded state growth when these connections are long-lived. For such issues, using jemalloc's memory profiling can be invaluable. A good introduction to this topic is `Justin Azoff's profiling presentation`_ (`slides`_). Jemalloc Memory Profiling ------------------------- For memory profiling with `jemalloc`_ you need jemalloc compiled with profiling enabled. Some Linux distributions provide a libjemalloc package that is configured this way. For example, Debian on the amd64 architecture has it enabled while Fedora 38 does not. You're advised to verify the ``config.prof`` line in the jemalloc stats output as shown below. .. note:: If your distribution does not provide a suitable libjemalloc package, building jemalloc from source configured with the required options is reasonably easy. .. code-block:: console $ git clone git@github.com:jemalloc/jemalloc.git $ cd jemalloc $ git checkout 5.2.1 # or newer releases $ ./autogen.sh && ./configure --enable-prof # Optionally, use LD_PRELOAD $ export LD_PRELOAD=$(pwd)/lib/libjemalloc.so # ...or install the custom libjemalloc build. $ sudo make install You can either build Zeek from source and pass the ``--enable-jemalloc`` flag (possibly with ``--with-jemalloc=/usr/local/`` for a custom build) to always use the jemalloc allocator (recommended), or set ``LD_PRELOAD`` as shown above. Using ``LD_PRELOAD`` can be convenient if you're not in a position to rebuild Zeek or you're consuming upstream binary packages that did not use ``--enable-jemalloc``, or you want to use a custom ad-hoc/patched jemalloc build. To verify jemalloc profiling is functional, run the following command and check that ``config.prof`` reports ``true``. .. code-block:: console $ MALLOC_CONF="stats_print:true" zeek -e 'event zeek_init() {}' 2>&1 | grep 'config.prof' config.prof: true config.prof_libgcc: true config.prof_libunwind: false ... If there is no output or ``config.prof`` says ``false``, verify your Zeek and libjemalloc setup. .. note:: Neither ``LD_PRELOAD`` nor ``MALLOC_CONF`` work with a setuid or setcap ``zeek`` binary and you might need to run as root or another privileged user instead. At this point you can run Zeek with a ``MALLOC_CONF`` setting that will dump memory profiles roughly every 256MB of allocation activity (controlled by the ``lg_prof_interval`` setting - 2**28 = 256 MB). .. code-block:: console $ MALLOC_CONF="prof:true,prof_prefix:jeprof.out,prof_final:true,lg_prof_interval:28" zeek -C -i eth0 The files dumped by jemalloc will have a naming pattern of ``jeprof.out....`` and can be postprocessed with the ``jeprof`` utility. .. code-block:: console $ jeprof $(which zeek) jeprof.out.* Welcome to jeprof! For help, type 'help'. (jeprof) top Total: 1773.2 MB 381.8 21.5% 21.5% 381.8 21.5% __gnu_cxx::new_allocator::allocate 232.6 13.1% 34.6% 232.6 13.1% std::__cxx11::basic_string::_M_construct 147.5 8.3% 43.0% 265.0 14.9% zeek::make_intrusive 144.0 8.1% 51.1% 144.0 8.1% monitoring_thread_loop 135.0 7.6% 58.7% 135.0 7.6% zeek::util::safe_realloc 117.0 6.6% 65.3% 802.6 45.3% yyparse 63.0 3.6% 68.9% 108.0 6.1% zeeklex 54.0 3.0% 71.9% 54.0 3.0% zeek::Obj::SetLocationInfo 49.0 2.8% 74.7% 49.0 2.8% alloc_aligned_chunks 45.0 2.5% 77.2% 45.0 2.5% zeek::detail::EquivClass::EquivClass It can be more insightful to generate a graph as SVG or GIF from the ``.heap`` files as these make the call chain more visible directly (click image to enlarge). .. code-block:: console $ jeprof $(which zeek) --svg jeprof.out.3075061.* > out.svg .. image:: /images/troubleshooting/http-fake-state-growth.gif :alt: State growth in a ``std::vector`` :scale: 10% In above image, ``basic_string _M_construct`` called from ``HTTP_Analyzer DeliverStream`` is standing out as well as ``new_allocator allocate`` called from ``std::vector _M_realloc_insert``. This memory growth was provoked by patching the HTTP analyzer such that all input data passed to ``DeliverStream()`` was also copied into a single statically allocated ``std::vector`` and never freed again. ZeekControl Integration ~~~~~~~~~~~~~~~~~~~~~~~ When working in a ZeekControl based environment, the `zeek-jemalloc`_ plugin can help with setting up the required environment variables. The ``.heap`` files will be located in a worker's individual spool directory and can be processed with the ``jeprof`` utility as shown above. .. _zeek-jemalloc: https://github.com/JustinAzoff/zeek-jemalloc-profiling/tree/master .. _justin azoff's profiling presentation: https://www.youtube.com/watch?v=gWSXbqxnJfs .. _slides: https://old.zeek.org/zeekweek2019/slides/justin-azoff-profiling.pdf .. _jemalloc: https://jemalloc.net/ CPU Profiling ============= When a Zeek worker is using close to all of a single CPU as seen via ``zeekctl top`` or ``top -p ``, this usually means it is either receiving too many packets and is simply overloaded, or there's a performance problem. Particularly at low packet rates or with pathological packet streams it is worth debugging Perf and Flame Graphs --------------------- It can be valuable to leverage the `perf`_ tool on Linux and generate `Flame Graphs`_ from the recorded data. .. note:: For best results it's recommended to build Zeek and third-party libraries used by Zeek with frame pointers enabled setting the ``-fno-omit-frame-pointer`` compile flag. .. code-block:: console $ CXXFLAGS="-fno-omit-frame-pointer" CFLAGS="-fno-omit-frame-pointer" ./configure --build-type=RelWithDebugInfo ... Using ``-fno-omit-frame-pointer`` may have a performance impact. Therefore, Linux distributions may or may not use it by default to compile libraries. You're advised to test performance differences in your environment and whether having frame pointers available for troubleshooting in production is more important than any performance gains. On Ubuntu you may explore using the ``libc6-prof`` for a glibc library compiled with frame pointers enabled. On Fedora 38 on the other hand most packages should be compiled with `frame pointers enabled by default `_. Assuming the PID of a Zeek worker is 3639255, a perf profile with call graph information can be collected as follows: .. code-block:: console $ perf record -g -p 3639255 ^C[ perf record: Woken up 8 times to write data ] [ perf record: Captured and wrote 2.893 MB perf.data (13865 samples) ] The resulting ``perf.data`` file can be visualized and post-processed via ``perf report``, ``perf script``, etc. When Zeek workers are pinned to CPUs, it can also be useful to record all activity on that CPU via ``perf record -g -C `` instead. To produce a flame graph ``perf.data``, run the following command pipeline, assuming a git checkout of the `FlameGraph`_ repository at an appropriate location. .. code-block:: console $ perf script | /path/to/FlameGraph/stackcollapse-perf.pl | /path/to/FlameGraph/flamegraph.pl > out.svg The resulting flame graph may look as follows: .. image:: /images/troubleshooting/flamegraph.png :alt: Example flame graph. :scale: 25% Visualizing flame graphs this way removes the time dimension. `FlameScope`_ is a project allowing exploration of different time ranges within the recorded data which can be valuable if you observe Zeek processes freezing or hanging. .. _perf: https://perf.wiki.kernel.org/index.php/Main_Page .. _Flame Graphs: https://www.brendangregg.com/flamegraphs.html .. _FlameGraph: https://github.com/brendangregg/FlameGraph .. _FlameScope: https://github.com/Netflix/flamescope .. _Fedora -fno-omit-framepointers: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer Metrics and Stats ================= Telemetry Framework and Prometheus ---------------------------------- Starting with Zeek 5.1, the script-level as well as C++ API of the :ref:`framework-telemetry` is being leveraged more extensively to expose metrics about Zeek's operational behavior. Generally we recommend consuming these metrics through the Prometheus endpoint exposed on ``http://manager-ip:9911/metrics`` by default. Currently, basic version information, network and process metrics, log records per log stream and log writers, data about event invocations as well as Broker subsystem metrics are exposed. Below is an example of using ``curl`` to list some of the metrics. In a production setup, usually a `Prometheus Server`_ is configured to scrape above endpoint which then stores metrics data for later visualization. .. code-block:: console $ curl -s localhost:9911/metrics | grep -E '^(zeek_version|zeek_log|zeek_event|zeek_net|process_|zeek_active_sessions|zeek_total_sessions)' zeek_version_info{beta="false",commit="622",debug="false",endpoint="",major="6",minor="0",patch="0",version_number="60000",version_string="6.0.0-dev.622"} 1.000000 1684826824560 zeek_event_handler_invocations_total{endpoint="",name="zeek_init"} 1 1684826824560 ... zeek_event_handler_invocations_total{endpoint="",name="dns_message"} 4 1684826824560 zeek_event_handler_invocations_total{endpoint="",name="dns_request"} 2 1684826824560 zeek_event_handler_invocations_total{endpoint="",name="dns_end"} 4 1684826824560 zeek_event_handler_invocations_total{endpoint="",name="connection_state_remove"} 547 1684826824560 ... zeek_event_handler_invocations_total{endpoint="",name="file_hash"} 1628 1684826824560 zeek_event_handler_invocations_total{endpoint="",name="file_state_remove"} 814 1684826824560 zeek_net_dropped_packets_total{endpoint=""} 0.000000 1684826824560 zeek_net_link_packets_total{endpoint=""} 19664.000000 1684826824560 zeek_net_received_bytes_total{endpoint=""} 1699891.000000 1684826824560 zeek_net_received_packets_total{endpoint=""} 9832.000000 1684826824560 ... zeek_log_writer_writes_total{endpoint="",filter_name="default",module="DNS",path="dns",stream="DNS::LOG",writer="Log::WRITER_ASCII"} 2 1684826824560 zeek_log_writer_writes_total{endpoint="",filter_name="default",module="HTTP",path="http",stream="HTTP::LOG",writer="Log::WRITER_ASCII"} 819 1684826824560 zeek_log_writer_writes_total{endpoint="",filter_name="default",module="Conn",path="conn",stream="Conn::LOG",writer="Log::WRITER_ASCII"} 547 1684826824560 zeek_log_writer_writes_total{endpoint="",filter_name="default",module="Files",path="files",stream="Files::LOG",writer="Log::WRITER_ASCII"} 814 1684826824560 ... zeek_log_stream_writes_total{endpoint="",module="DNS",stream="DNS::LOG"} 2 1684826824560 zeek_log_stream_writes_total{endpoint="",module="HTTP",stream="HTTP::LOG"} 819 1684826824560 zeek_log_stream_writes_total{endpoint="",module="Conn",stream="Conn::LOG"} 547 1684826824560 zeek_log_stream_writes_total{endpoint="",module="Files",stream="Files::LOG"} 814 1684826824560 zeek_active_sessions{endpoint="",protocol="tcp"} 0 1684829159305 ... zeek_total_sessions_total{endpoint="",protocol="tcp"} 45101 1684829159305 zeek_total_sessions_total{endpoint="",protocol="udp"} 39849 1684829159305 zeek_total_sessions_total{endpoint="",protocol="icmp"} 320 1684829159305 process_open_fds{endpoint=""} 62 1684826824560 process_cpu_seconds_total{endpoint=""} 1.950000 1684826824560 process_virtual_memory_bytes{endpoint=""} 1917345792 1684826824560 process_resident_memory_bytes{endpoint=""} 268935168 1684826824560 If you prefer to consume metrics via logs, the :file:`telemetry.log` (:zeek:see:`Telemetry::Info`) may work. Its format is a bit unusual, however. See the :ref:`framework-telemetry`'s documentation for more details about the log and how to add further metrics from your own Zeek scripts. .. _Prometheus server: https://prometheus.io/ :file:`stats.log` ----------------- The :file:`stats.log` is enabled when loading the :doc:`/scripts/policy/misc/stats.zeek` script. This is the default with the stock :file:`local.zeek` included with Zeek. This log provides stats about Zeek's operational behavior in a structured log format. See the :zeek:see:`Stats::Info` record documentation for a description of the individual fields. The default reporting interval is 5 minutes. It can make sense to reduce this interval for testing or during troubleshooting via ``redef Stats::report_interval=30sec``. Stats collection may have a non-negligible impact on performance and running, for example, every second may be detrimental. For historic reasons, this log contains delta values for ``pkts_proc``, ``bytes_recv``, ``events_proc``, ``tcp_conns``, etc. This can make it difficult to use the values as-is in metrics systems that expect counter metrics to continuously grow and compute rates or delta values on the fly. .. note:: If you're creating your own custom metrics or stats-like log, consider using absolute values for counter metrics. Relative values can always be derived from two absolute values. The inverse is not true. Popular metrics systems usually assume absolute counter values, too. Following an example of a :file:`stats.log` entry: .. code-block:: console $ zeek -C -i eth0 local Stats::report_interval=30sec LogAscii::use_json=T $ jq < stats.log ... { "ts": 1684828680.616951, "peer": "zeek", "mem": 344, "pkts_proc": 300000, "bytes_recv": 78092228, "pkts_dropped": 0, "pkts_link": 299609, "pkt_lag": 0.003422975540161133, "events_proc": 448422, "events_queued": 448422, "active_tcp_conns": 2279, "active_udp_conns": 2809, "active_icmp_conns": 96, "tcp_conns": 6747, "udp_conns": 5954, "icmp_conns": 48, "timers": 67510, "active_timers": 35086, "files": 8165, "active_files": 0, "dns_requests": 218, "active_dns_requests": 2, "reassem_tcp_size": 7816, "reassem_file_size": 0, "reassem_frag_size": 0, "reassem_unknown_size": 0 } :file:`prof.log` ---------------- The :file:`prof.log` provides aggregated information about Zeek's runtime status in a fairly non-structured text format. Likely future metrics will be added through the Telemetry framework mentioned above, but as of now it does contain information about queue sizes within the threading subsystem and other details that are not yet exposed otherwise. To enable :file:`prof.log`, load the :doc:`/scripts/policy/misc/profiling.zeek` script in :file:`local.zeek` or start Zeek with ``misc/profiling`` on the command-line: .. code-block:: console $ zeek -C -i eth0 misc/profiling The following provides an example of :file:`prof.log` content: .. code-block:: console $ cat prof.log 1684828232.344252 Comm: peers=0 stores=1 pending_queries=0 events_in=0 events_out=0 logs_in=0 logs_out=0 ids_in=0 ids_out=0 1684828262.344351 ------------------------ 1684828262.344351 Memory: total=406480K total_adj=149536K malloced: 0K 1684828262.344351 Run-time: user+sys=53.2 user=44.6 sys=8.6 real=631.1 1684828262.344351 Conns: total=84712 current=6759/6759 1684828262.344351 Conns: tcp=3847/3860 udp=2815/2883 icmp=97/98 1684828262.344351 TCP-States: Inact. Syn. SA Part. Est. Fin. Rst. 1684828262.344351 TCP-States:Inact. 1684828262.344351 TCP-States:Syn. 76 36 1684828262.344351 TCP-States:SA 1684828262.344351 TCP-States:Part. 1684828262.344351 TCP-States:Est. 652 2214 36 1684828262.344351 TCP-States:Fin. 753 1684828262.344351 TCP-States:Rst. 16 64 1684828262.344351 Connections expired due to inactivity: 2426 1684828262.344351 Timers: current=47708 max=47896 lag=0.00s 1684828262.344351 DNS_Mgr: requests=1596 successful=1596 failed=0 pending=0 cached_hosts=0 cached_addrs=1207 1684828262.344351 Triggers: total=4900 pending=0 1684828262.344351 ConnectionDeleteTimer = 905 1684828262.344351 ConnectionInactivityTimer = 6759 1684828262.344351 DNSExpireTimer = 1840 1684828262.344351 FileAnalysisInactivityTimer = 32836 1684828262.344351 ScheduleTimer = 11 1684828262.344351 TableValTimer = 34 1684828262.344351 TCPConnectionAttemptTimer = 166 1684828262.344351 TCPConnectionExpireTimer = 5156 1684828262.344351 ThreadHeartbeat = 1 1684828262.344351 Threads: current=21 1684828262.344351 dns/Log::WRITER_ASCII in=586 out=258 pending=0/0 (#queue r/w: in=586/586 out=258/258) 1684828262.344351 known_hosts/Log::WRITER_ASCII in=475 out=258 pending=0/0 (#queue r/w: in=475/475 out=258/258) 1684828262.344351 software/Log::WRITER_ASCII in=478 out=258 pending=0/0 (#queue r/w: in=478/478 out=258/258) ... 1684828262.344351 files/Log::WRITER_ASCII in=483 out=258 pending=0/0 (#queue r/w: in=483/483 out=258/258) 1684828262.344351 http/Log::WRITER_ASCII in=483 out=258 pending=0/0 (#queue r/w: in=483/483 out=258/258) 1684828262.344351 weird/Log::WRITER_ASCII in=260 out=257 pending=0/0 (#queue r/w: in=260/260 out=257/257) 1684828262.344351 conn/Log::WRITER_ASCII in=486 out=257 pending=0/0 (#queue r/w: in=486/486 out=257/257)