[systemd-devel] Odp: Re: BUG: several bugs in core/main.c (v218)

Wed Jan 28 14:14:13 PST 2015

On Mon, 26.01.15 00:33, Tomasz Pawlak (tomazzi at wp.pl) wrote:

First of all, Iv'e disabled automatic replies from systemd-devel, because my e-mail account gets flooded with hundreds of e-mails. The side effect was, that I haven't been informed about this reply.

> You are right, but it's not as simple as it may look at first sight:
> 
> 1. If we allow the process to continue without sig handlers
> installed, then results can be just catastrophic: kernel panic with
> all the services launched -> broken transanctions, half-written
> records/files, etc -> total mess, corrupted or lost data etc.  So,
> since successfull installation of the sig handlers is one of the
> most critical parts of initialisation, it is actualy safer to just
> quit. This is just a critical fault (and is currently completely
> ignored).

Hmm? no. if PID 1 dies then either the kernel halts PID 1 or we do.

---->
Well, apparently the above is not so clear as I would like it to be, so here's an explanation:
What You'are sayng is a systemd' point of view: no clients crashed, so everything is fine.
But it's not. If the init sytem fails, then also f.e. non-activated serivices can cause failures in client applications, but what's worse: You can't predict how it fails. If it'll crash the kernel, or just some part of it ( like a device driver ) then the results are at best unpredictable.

Therefore I say: if something such important as sigaction fails, then it should be treated as a sign that there's something really wrong with the system, and there's no point to continue, maybe excluding a try to make a log entry. 
-----<

> 2. Another thing is, that those signals are not equivalently
> important, f.e. SIGABRT can be throwed by thousants lines of code in
> this project (by abort()), so it is much more likely that assertion
> checking will prevent segfaults, throwing SIGABRT instead. This
> means that SIGABRT is actually far more probable than SIGSEGV.  This
> in turn leads to simple solution: the process should unconditionally
> exit if hander for SIGABRT have failed to install, but with other
> sig handlers failed, we may take a risk and continue.  In any case,
> such situation should be logged as soon as possible.  Ignoring this
> is just asking for catastrophe.

The only thing you can do to recover from SIGABRT or SIGSEGV,
reexec()ing yourself from the sig handler. That' something the kernel
doesn't allow for PID 1 however...

It's illusionary to believe that you could just do some magic, and
return from SIGSEGV and continue running your program. You
cannot. SIGSEGV is more often than not an indication for a memory
corruption, and if that happens, there's no way to bring back the
memory to a state where things are good again, because memory doesn't
tell you if its in a good or bad state.

--->
Yes and no.
What You say is an "old school" point of view.
The key problem with critical signals in classic approach is that the program doesn't have any valuable informations about the context in which the signal was triggered. (essentially because siginfo is not provideing any valuable informations, from the runtime point of view)
There are generally 3 kinds of segfaults: read operation, write operation and a failed access to protected memory (whether it is protected by MMU or by other means does not matter here.)

Only the segfaults caused by write ops are hard (but not impossible) to recover from, because such ops can damage heap data, especially the heap chunk/fragmentaion data, what will cause errors during further dynamic heap allocations.

However, when the program knows the context of the signal, then it is rather simple to fix this at runtime, but it needs several additional conditions to be fulfilled:
The program must not fail in main thread (what is only a matter of careful design), and all the functionality of the prog must be splitted into threads running with custom stacks. Then, it is possible to restart any part of the prog, using the fact that TLS variables are mapped to physically different memory areas.

Of course, this is just a quick reply, and it would need far more detailed explanations, but I simply have no time for this.

However, this (among some other things) is why I think that systemd's exceptions handling leaves much to desire - the policy is to just quit on any error.
---<

> 3. SIGFPE: how often the code uses FPU? -> I mean, that handler for
> this sig can be dynamically installed/unistalled when needed,
> probably only on a thread level, not for the whole process. This
> will allow to completely safely report failed sigaction by assertion
> checking.

SIGFPE is also triggered by integer divisions by zero (yeah, the name
is misleading). 

Catching SIGFPE, SIGSEGV, SIGABRT and so on are for software problems
that we don't exptect. If we expected them then we could certainly
handle them in a nicer way than getting a signal thrown...

--->
Of course You are right. As an excuse, I can say that I'm working not only with Linux kernel, and it was 2:00 am when I was writting this. I feel really stupid when I'm reading this again.

Nevertheless, the point was, that not neccessarily all the handlers are needed at initialisation time...
---<

> 4. So, sigaction_many() should be removed (also because it is a
> vararg function, what is rather bad idea), and a function for

Ahum? vararg is bad now? I must have missed that memo. Why would it be
bad? Do you write C code without printf() (which is varargs)?

--->
Well, that was a side note, but if You want an explanation, then here it is:
You are using vararg fn only for handling a *preprocessor symbol*, which is constant at a runtime. If You really need a funcion which implentaion is dependant on a preprocessor symbol, then it would be simpler to just use a macro...

Nevertheless, any number of standard signals (non realtime ones) can be passed as just single argument, like this:
(SIGSEGV << 1) | (SIGABORT << 1), i.e as an (u)int32_t type arg.
---<

> registering one sig handler at a time should be used. Then, we can
> tell (log) which signals were not registered by sigaction, and take
> conscious decision what to do next.

We actually want to handle failure of installing these crash handlers
all the same way: by mostly ignoring them, and proceeding.

--->
I won't  comment on this here
---<

Lennart