[systemd-devel] journald reliability

Zbigniew Jędrzejewski-Szmek zbyszek at in.waw.pl
Thu Oct 18 07:10:23 PDT 2012


Hi,

I've been playing with journald and journalctl, and have a few
thoughts about making the logs more reliable.

First problem: it would be nice to tie more messages from systemd to
units, i.e. use more structured logging. I'll follow up with a patch
which does that, and I'm wondering if this is the right approach.

Unfortunately, currently not all messages can be usefully attached by
journald to processes, because the system unit information is lost
after the last process exits. This sucks, because sometimes the most
important messages are the last ones. E.g.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=665782 is about a
process failing to start. 'systemctl status' will probably not be able
to show this message, because by the time journald tries to query
cgroup information, it is too late.

But systemd has/had this information and could be queried for it. So I
propose to add a cache of recently dead PIDs in systemd, that could be
queried by journald. I'm not sure about the details, but probably
keeping the information for a few seconds should be enough. This would
lead to more reliable logging for messages close to the end of the
program.

Second problem: some messages are lost because of the small depth of
the unix datagram queue length. By default net.unix.max_dgram_qlen
appears to be 10, which is really really low. Messages get dropped
silently once this queue is full. I think that the goal should be
almost never loose messages. In extreme cases, if the logging system
if full or broken, messages must be lost, but otherwise this should
not happen. Otherwise things become very confusing for the user:
didn't this service start without giving any reason, or were just the
messages dropped?

Overflowing the queue can also be done on purpose by a rogue program:
send enough messages and break logging for legitimate processes.
Journald does rate-limiting, but only after receiving the messages,
so this does not help.

TODO contains an entry for this, suggesting (a) bumping sysctl,
(b) adding sockopt.

(a) would certainly help, but has the downside that this might
negatively influence other daemons. Also, it does not really solve the
problem, because if journald is not keeping up, the queue will
overflow anyway.

(b) would be nice, but will take time to get the kernel modified. And
again, increasing the queue depth can only be a part of a solution.

As an alternative to in-kernel queuing, I tried to implement a message
queue in the application. Messages are queued when the socket is busy,
and are resent using sendmmsg on next log statement. This adds
queueing that is under control of the application and helps if the
socket is _temporarily_ overloaded. To provide reliable logging
however, the application must block (or wait) at some point, until the
socket is free again. I'll post the patch as RFC, I'm not sure if it
is really useful.

I guess that some applications could simply block, but in case of e.g.
systemd itself it is not possible. journald should process the
messages faster. Should journald run reniced to higher priority? I
think that the problem is worse with journald than with syslog,
because journald tries to do more things: parse the message, query
cgroup information, forward the message to syslog, store it on disk.
syslog only does the last one, so necessarily must be more efficient.

Also, it seems that journald could be more efficient too: running
test-log (which queues sends two messages to joural before quitting)
show the following activity by journald:

epoll_wait(7, {?} 0x7fff4348ad40, 1, -1) = 1
ioctl(4, FIONREAD, [164])               = 0
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 164
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory)
open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
uname({sys="Linux", node="fedora-15", ...}) = 0
clock_gettime(CLOCK_REALTIME, {1350560576, 588798333}) = 0
clock_gettime(CLOCK_MONOTONIC, {184373, 816089357}) = 0
ftruncate(11, 29777920)                 = 0
ioctl(4, FIONREAD, [237])               = 0
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 237
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory)
open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
uname({sys="Linux", node="fedora-15", ...}) = 0
clock_gettime(CLOCK_REALTIME, {1350560576, 594694729}) = 0
clock_gettime(CLOCK_MONOTONIC, {184373, 821970285}) = 0
ftruncate(11, 29777920)                 = 0
ioctl(4, FIONREAD, [0])                 = 0
recvmsg(4, 0x7fff4348ab60, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(7, 

It seems that since the first sendmsg returning -ESRCH journald can
expect the the process it is trying to query is gone, and there is no
need to check /proc/19981/cgroup, /proc/19981/comm, etc. Would be
nice to short-circuit this. Is there a legitimate reason why some of
those files in /proc would be available, but not others?

Zbyszek


More information about the systemd-devel mailing list