[systemd-devel] journald reliability

Thu Oct 18 10:02:28 PDT 2012

On Thu, 18.10.12 16:10, Zbigniew Jędrzejewski-Szmek (zbyszek at in.waw.pl) wrote:

> Hi,
> 
> I've been playing with journald and journalctl, and have a few
> thoughts about making the logs more reliable.
> 
> First problem: it would be nice to tie more messages from systemd to
> units, i.e. use more structured logging. I'll follow up with a patch
> which does that, and I'm wondering if this is the right approach.

Yes, this is definitely desirable.

> Unfortunately, currently not all messages can be usefully attached by
> journald to processes, because the system unit information is lost
> after the last process exits. This sucks, because sometimes the most
> important messages are the last ones. E.g.
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=665782 is about a
> process failing to start. 'systemctl status' will probably not be able
> to show this message, because by the time journald tries to query
> cgroup information, it is too late.

This is a real problem.

> But systemd has/had this information and could be queried for it. So I
> propose to add a cache of recently dead PIDs in systemd, that could be
> queried by journald. I'm not sure about the details, but probably
> keeping the information for a few seconds should be enough. This would
> lead to more reliable logging for messages close to the end of the
> program.

This wouldn't really work I fear, since systemd only really tracks the
main and control PIds of a cgroup, and there might be much more that
log.

I think the only real way to solve this cleanly is to extend the kernel
to provide SCM_CREDENTIAL and SCM_SECURITY-like auxiliary messages that
carries the information we need. More specifically I am hoping that we
can get SCM_AUDIT (for the loginuid + sessionid of a process),
SCM_EXECINFO (for exe, comm, cmdline), SCM_CGROUP (for cgroup
membership). With that we should have all data we need in a safe and
secure way.

> Second problem: some messages are lost because of the small depth of
> the unix datagram queue length. By default net.unix.max_dgram_qlen
> appears to be 10, which is really really low. Messages get dropped
> silently once this queue is full. 

Note that this is only true for systemd's own log.c message. For the
stuff written with sd_printv we actually block on the socket.

> I think that the goal should be
> almost never loose messages. In extreme cases, if the logging system
> if full or broken, messages must be lost, but otherwise this should
> not happen. Otherwise things become very confusing for the user:
> didn't this service start without giving any reason, or were just the
> messages dropped?
>
> TODO contains an entry for this, suggesting (a) bumping sysctl,
> (b) adding sockopt.

This (and the SCM_xyz issue mentioned above) have been on our plumbers
wishlist for a while.

> (a) would certainly help, but has the downside that this might
> negatively influence other daemons. Also, it does not really solve the
> problem, because if journald is not keeping up, the queue will
> overflow anyway.

Yeah, this is not really desirable indeed.

> 
> (b) would be nice, but will take time to get the kernel modified. And
> again, increasing the queue depth can only be a part of a solution.
> 
> As an alternative to in-kernel queuing, I tried to implement a message
> queue in the application. Messages are queued when the socket is busy,
> and are resent using sendmmsg on next log statement. This adds
> queueing that is under control of the application and helps if the
> socket is _temporarily_ overloaded. To provide reliable logging
> however, the application must block (or wait) at some point, until the
> socket is free again. I'll post the patch as RFC, I'm not sure if it
> is really useful.

Hmm, but how would the app queue be that much better than an in-kernel
socket queue? Either way eventually messages need to be dropped. And the
kernel queue is maintained anyway, so I'd just make use of that.

> I guess that some applications could simply block, but in case of e.g.
> systemd itself it is not possible. journald should process the
> messages faster. Should journald run reniced to higher priority? I
> think that the problem is worse with journald than with syslog,
> because journald tries to do more things: parse the message, query
> cgroup information, forward the message to syslog, store it on disk.
> syslog only does the last one, so necessarily must be more efficient.

I remember that maemo used to have issues with syslog blocking. They
basically ran into a priority inversion problem, where RT threads ended
up waiting for the non-RT syslog. I think large qlen are a pretty good
solution for this.

> Also, it seems that journald could be more efficient too: running
> test-log (which queues sends two messages to joural before quitting)
> show the following activity by journald:
> 
> epoll_wait(7, {?} 0x7fff4348ad40, 1, -1) = 1
> ioctl(4, FIONREAD, [164])               = 0
> recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 164
> open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process)
> sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56
> open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory)
> open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
> capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
> open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
> capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
> open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> uname({sys="Linux", node="fedora-15", ...}) = 0
> clock_gettime(CLOCK_REALTIME, {1350560576, 588798333}) = 0
> clock_gettime(CLOCK_MONOTONIC, {184373, 816089357}) = 0
> ftruncate(11, 29777920)                 = 0
> ioctl(4, FIONREAD, [237])               = 0
> recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 237
> open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process)
> sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56
> open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory)
> open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
> capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
> open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
> capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
> open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
> uname({sys="Linux", node="fedora-15", ...}) = 0
> clock_gettime(CLOCK_REALTIME, {1350560576, 594694729}) = 0
> clock_gettime(CLOCK_MONOTONIC, {184373, 821970285}) = 0
> ftruncate(11, 29777920)                 = 0
> ioctl(4, FIONREAD, [0])                 = 0
> recvmsg(4, 0x7fff4348ab60, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
> epoll_wait(7, 
> 
> It seems that since the first sendmsg returning -ESRCH journald can
> expect the the process it is trying to query is gone, and there is no
> need to check /proc/19981/cgroup, /proc/19981/comm, etc. Would be
> nice to short-circuit this. Is there a legitimate reason why some of
> those files in /proc would be available, but not others?

This would go away if we had SCM_xyz as mentioned above. The big
advantage would be that we'd always have the data from the precise point
the message was queued.

Optimizing the ENOENTs away feels like optimizing the error path,
i.e. polishing failure. I am pretty sure the right fix is to get the
kernel extended to be nicer to us...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.