[systemd-devel] default service restart action?

Wed May 11 17:07:00 UTC 2016

On Wed, 11.05.16 11:27, Brian Kroth (bpkroth at gmail.com) wrote:

> Hi all, I'm in the midst of steeping myself in systemd docs as I prepare to
> face lift a slew of services for Debian Jessie updates.
> 
> As I read through things I'm starting to think through a number of new ways
> I could potentially reorganize some of our services, which is cool. With my
> ideas though I think I'm finding a few gaps in either my understanding or
> systemd capabilities, so I wanted to send a few questions to the list.
> Hopefully this is the right place.
> 
> The first should hopefully be a bit of a softball:
> 
> With .service units one can specify OnFailure and other sorts of restart
> behaviors, including thresholds and backoffs for when to stop retrying and
> what to do then. Essentially a lightweight service problem escalation
> procedure.
> 
> However, in reading systemd-system.conf, I don't see any way to specify
> something like DefaultOnFailure behavior for what to do on failure, perhaps
> after some simple restart attempts, for all services.  Seems like it can
> only be done on a per unit basis, no?

That is correct, yes.

> Ideally, I'd like to be able to do something very simply like, declare
> if any service fails to restart itself or does so too often and enters a
> hard failure state, then systemd should (attempt to) fire off an
> escalation procedure unit like send a passive check status to Nagios or
> send an email, accepting that such procedures may depend upon network
> connectivity which may or may not be available (so maybe there's some
> circular dependency issues to work through in such a scenario, but I
> presume systemd already has facilities for handling that case, maybe via
> OnFailureJobMode= settings).
> 
> Thoughts?

That sounds like it goes towards service monitoring?

I figure our theory there was that monitoring systems should probably
keep an eye on the journal stream generated, where there are events
generated about these issues. These log entries are recognizable by
their message ID and carry both human readable as well as structured
metadta that let you know what's going on. Our plan was originally to
then add a concept of "activation-by-log-event" to systemd, so that
you could activate some service each time a log event of a certain
kind happens. However, we never came around to actually hack that up,
it's still on the TODO list.

I think OnFailure= and stuff are pretty useful for some things, but
for the monitoring case such a journal-based logic would be nicer,
because it can cover events triggered in a quick pace and during early
boot nicer, as they processing of this can happen serially and
asynchronously... Also, it would allow much nicer filtering for any
kind of event on the system, and we wouldn't happen to hook up every
kind of failure of each service with a OnFailure= like dependency.

So yeah, I think we should have better support for what you are trying
to do, but I think we should best do that by delivering the
activate-by-log-message feature after all...

Lennart

-- 
Lennart Poettering, Red Hat