mirror of
https://github.com/zeek/zeek.git
synced 2025-10-05 16:18:19 +00:00
logging/writers/ascii: shadow files: Add fsync() before rename()
We're using shadow files for log rotation on systems with ext4 running Linux 4.19. We've observed zero-length shadow files in the logger's working directory after a power-outage. This leads to a broken/stuck logger process due to empty shadow files being considered invalid and the process exiting: error: failed to process leftover log 'conn.log.gz': Found leftover log, 'conn.log.gz', but the associated shadow file, '.shadow.conn.log.gz', required to process it is invalid PR #1137 introduced atomic renaming of shadow files and was supposed to handle this. However, after more investigation, the rename() has to be preceded by an fsync() in order to avoid zero-length files in the presence of hard-crashes or power-failures. This is generally operating system and filesystem dependent, but should not hurt to add. The performance impact can likely be neglected due to the low frequency and limited number of log streams. This has happened to others, too. Some references around this issue: * https://stackoverflow.com/questions/7433057/is-rename-without-fsync-safe * https://unix.stackexchange.com/questions/464382/which-filesystems-require-fsync-for-crash-safety-when-replacing-an-existing-fi * https://bugzilla.kernel.org/show_bug.cgi?id=15910 Reproducer This issue was reproduced artificially on Linux using the sysrq-trigger functionality to hard-reset the system shortly after a .shadow file was renamed to it's final destination with the following script watching for .shadow.conn.log.gz: #!/bin/bash set -eu dir=/data/logger-01/ # Allow everything via /proc/sysrq-trigger echo "1" > /proc/sys/kernel/sysrq inotifywait -m -e MOVED_TO --format '%e %w%f' "${dir}" | while read -r line; do if echo "${line}" | grep -q '^MOVED_TO .*/.shadow.conn.log.gz$'; then echo "RESET: $line" sleep 4 # Trigger a hard-reset without sync/unmount echo "b" > /proc/sysrq-trigger fi done This quite reliably (4 out of 4 times) yielded a system with zero-length shadow files and a broken logger after it came back online: $ ls -lha /data/logger-01/.shadow.* -rw-r--r-- 1 bro bro 0 Oct 14 02:26 .shadow.conn.log.gz -rw-r--r-- 1 bro bro 0 Oct 14 02:26 .shadow.dns.log.gz -rw-r--r-- 1 bro bro 0 Oct 14 02:26 .shadow.files.log.gz After this change while running the reproducer, the shadow files always contained content after a hard-reset. Rework with util::safe_fsync helper
This commit is contained in:
parent
4d48272adb
commit
dc6e21d6ae
3 changed files with 25 additions and 1 deletions
|
@ -461,6 +461,10 @@ extern bool safe_write(int fd, const char* data, int len);
|
|||
// Same as safe_write(), but for pwrite().
|
||||
extern bool safe_pwrite(int fd, const unsigned char* data, size_t len, size_t offset);
|
||||
|
||||
// Like fsync() but handles interrupted system calls by retrying and
|
||||
// aborts on unrecoverable errors.
|
||||
extern void safe_fsync(int fd);
|
||||
|
||||
// Wraps close(2) to emit error messages and abort on unrecoverable errors.
|
||||
extern void safe_close(int fd);
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue