[systemd-devel] Significant performance loss caused by commit a65f06b: journal: return -ECHILD after a fork

Mon Jul 10 12:04:59 UTC 2017

On Mon, 10.07.17 21:51, Michael Chapman (mike at very.puzzling.org) wrote:

> > This all stems from my experiences with PulseAudio back in the day:
> > People do not grok the effect of fork(): it only duplicates the
> > invoking thread, not any other threads of the process, moreover all
> > data structures are copied as they are, and that's a time bomb really:
> > consider one of our context objects is being used by one thread at the
> > moment another thread invokes fork(): the thread using the object is
> > busy making changes to the object, rearranging some datastructure (for
> > example, rehashing a hash table, because it hit its fill limit) and
> > suchlike. Now the fork() happens while it is doing that: the data
> > structure will be copied in its half-written, half-updated status quo,
> > and in the child process there's no thread that could finish what has
> > been started, and there's neither a way to rollback the changes that
> > are in progress.
> [...]
> 
> Thanks, that really does clear things up.
> 
> It's a pity glibc doesn't provide an equivalent for pthread_atfork() outside
> of the pthread library. Having a notification that a fork has just occurred
> would allow us to do the PID caching ourselves.

Well, pthread_atfork() is probably more a source of problems than a solution
for them.

Mutexes and fork() do not mix well: if you have a thread that acquired
a mutex right before a fork() then it will cease to exist but the
mutex remains locked. Now, you could use pthread_atfork() to unlock
it, but that really works only in trivial cases, with trivial data
structures, and otherwise creates ABBA problems and similar. I mean,
mutexes are supposed to make pieces of code atomic from the outside
view: but if you duplicate a process without the thread it will appear
aborted to the outside, and that's quite far from "atomic"...

> Of course, there's still a problem with people calling the clone syscall
> directly... but I think once people start doing that we have to trust them
> to know what they're doing.

Yes: if you invoke clone() directly, you should really invoke execve()
too soon, and in the time between these two syscalls you should not
invoke getpid() and limit yourself to known safe calls.

Lennart

-- 
Lennart Poettering, Red Hat