[systemd-devel] Support for large applications

Wed Mar 16 16:16:30 UTC 2016

On 02/18/2016 08:28 PM, Lennart Poettering wrote:
> On Wed, 17.02.16 14:35, Avi Kivity (avi at scylladb.com) wrote:
>
>> We are using systemd to supervise our NoSQL database and are generally
>> happy.
> Thank you for the feedback! We are always interested in good feedback
> like yours.
>
>> A few things will help even more:
>>
>> 1. log core dumps immediately rather than after the dump completes
>>
>> A database will often consume all memory on the machine; dumping 120GB can
>> take a lot of time, especially if compression is enabled. As the situation
>> is now, there is a period of time where it is impossible to know what is
>> happening.
>>
>> (I saw that 229 improves core dumps, but did not see this
>> specifically)
> With 229 the coredump hook will collect a bit information and then
> pass things off (including the pipe the coredump is streamed in on) to
> a mini service that then processes the crash, extracts the stacktrace
> and writes it to disk. This means you should see the coredump
> processing as a normal service in "systemctl" and "systemd-cgtop" and
> similar tools. You should see normal logs about this service being
> started now, and you can do resource management on it.

What I'm most worried about is whether a non-expert user will be able to 
tell what happened.

>
>> 2. parallel compression of core dumps
>>
>> As well as consuming all of memory, we also consume all cpus.  Once we dump
>> core we may as well use those cores for compressing the huge dump.
> We get the stuff via a pipe from the kernel. I am not sure whether gz
> or lz4 can distribute work on multiple CPUs if the data is flowing in
> strictly sequentially and there's no random access to the input data.
>
> But if the compressors support that then we should definitely make use
> of it!

They don't.  But this command line works as expected:

time dd if=/dev/zero bs=1M count=1000 |  pigz --stdout | wc -c
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.678507 s, 1.5 GB/s
1144017

real    0m0.685s
user    0m9.879s
sys    0m1.412s

Note real vs. user.

>
>> 3. watchdog during startup
>>
>> Sometimes we need to perform expensive operations during startup (log
>> replay, rebuild from network replica) before we can start serving. Rather
>> than configure a huge start timeout, I'd prefer to have the service report
>> progress to systemd so that it knows that startup is still in
>> progress.
> Interesting. How would you suggest this precisely looks like? I mean,
> you say "report progress", does this mean you want a textual string
> like "STATUS=" in sd_notify() – which you already have really?

I have (and use it), but it does not affect the startup timeout logic, 
AFAICT.

>   Or do
> you mean behaviour like the existing "WATCHDOG=1" logic, i.e. that
> start-up is aborted if the keep-alive messages are missing?

This is what I want.

>
> I think adding a WatchdogMode= setting that allows optional
> configuration to require regular WATCHDOG=1 notifications even in the
> start and stop phase of a service certainly makes sense, if that's
> what you are asking for.

It is.

>
>> Hope this is useful,
> Yes, it is! Thanks!
>
> Lennart
>