[systemd-devel] [PATCH 1/3] shared: add generic IPC barrier

Mon Jul 14 00:25:59 PDT 2014

On Sun, Jul 13, 2014 at 5:37 PM, David Herrmann <dh.herrmann at gmail.com> wrote:
> Hi
>
> On Sun, Jul 13, 2014 at 1:30 PM, Tom Gundersen <teg at jklm.no> wrote:
>> Couple of random nitpicks below.
>>
>> On Sun, Jul 13, 2014 at 12:37 PM, David Herrmann <dh.herrmann at gmail.com> wrote:
>>> The "Barrier" object is a simple inter-process barrier implementation. It
>>> allows placing synchronization points and waiting for the other side to
>>> reach it. Additionally, it has an abortion-mechanism as second-layer
>>> synchronization to send abortion-events asynchronously to the other side.
>>>
>>> The API is usually used to synchronize processes during fork(). However,
>>> it can be extended to pass state through execve() so you could synchronize
>>> beyond execve().
>>>
>>> Usually, it's used like this (error-handling replaced by assert() for
>>> simplicity):
>>>
>>>     Barrier b;
>>>
>>>     r = barrier_init(&b);
>>>     assert_se(r >= 0);
>>>
>>>     pid = fork();
>>>     assert_se(pid >= 0);
>>>     if (pid == 0) {
>>>             barrier_set_role(&b, BARRIER_CHILD);
>>>
>>>             ...do child post-setup...
>>>             if (CHILD_SETUP_FAILED)
>>>                        exit(1);
>>>             ...child setup done...
>>>
>>>             barrier_place(&b);
>>>             if (!barrier_sync(&b)) {
>>>                     /* parent setup failed */
>>>                     exit(1);
>>>             }
>>>
>>>             barrier_destroy(&b); /* redundant as execve() and exit() imply this */
>>>
>>>             /* parent & child setup successful */
>>>             execve(...);
>>>     }
>>>
>>>     barrier_set_role(&b, BARRIER_PARENT);
>>>
>>>     ...do parent post-setup...
>>>     if (PARENT_SETUP_FAILED) {
>>>             barrier_abort(&b);          /* send abortion event */
>>>             barrier_wait_abortion(&b);  /* wait for child to abort (exit() implies abortion) */
>>>             barrier_destroy(&b);
>>>            ...bail out...
>>>     }
>>>     ...parent setup done...
>>>
>>>     barrier_place(&b);
>>>     if (!barrier_sync(&b)) {
>>>             ...child setup failed... ;
>>>             barrier_destroy(&b);
>>>             ...bail out...
>>>     }
>>>
>>>     barrier_destroy(&b);
>>>
>>>     ...child setup successfull...
>>>
>>> This is the most basic API. Using barrier_place() to place barriers and
>>> barrier_sync() to perform a full synchronization between both processes.
>>> barrier_abort() places an abortion barrier which superceeds any other
>>> barriers, exit() (or barrier_destroy()) places an abortion-barrier that
>>> queues behind existing barriers (thus *not* replacing existing barriers
>>> unlike barrier_abort()).
>>>
>>> This example uses hard-synchronization with wait_abortion(), sync() and
>>> friends. These are all optional. Barriers are highly dynamic and can be
>>> used for one-way synchronization or even no synchronization at all
>>> (postponing it for later). The sync() call performs a full two-way
>>> synchronization.
>>>
>>> The API is documented and should be fairly self-explanatory. A test-suite
>>> shows some special semantics regarding abortion, wait_next() and exit().
>>>
>>> Internally, barriers use two eventfds and a pipe. The pipe is used to
>>> detect exit()s of the remote side as eventfds do not allow that. The
>>> eventfds are used to place barriers, one for each side. Barriers itself
>>> are numbered, but the numbers are reused once both sides reached the same
>>> barrier, thus you cannot address barriers by the index. Moreover, the
>>> numbering is implicit and we only store a counter. This makes the
>>> implementation itself very lightweight, which is probably negligible
>>> considering that we need 3 FDs for a barrier..
>>>
>>> Last but not least: This barrier implementation is quite heavy. It's
>>> definitely not meant for fast IPC synchronization. However, it's very easy
>>> to use. And given the *HUGE* overhead of fork(), the barrier-overhead
>>> should be negligible.
>>> ---
>>>  .gitignore              |   1 +
>>>  Makefile.am             |   9 +
>>>  src/shared/barrier.c    | 442 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  src/shared/barrier.h    |  97 ++++++++++
>>>  src/test/test-barrier.c | 460 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  5 files changed, 1009 insertions(+)
>>>  create mode 100644 src/shared/barrier.c
>>>  create mode 100644 src/shared/barrier.h
>>>  create mode 100644 src/test/test-barrier.c
>>>
>>> diff --git a/.gitignore b/.gitignore
>>> index 31cd8f8..5289f0e 100644
>>> --- a/.gitignore
>>> +++ b/.gitignore
>>> @@ -123,6 +123,7 @@
>>>  /tags
>>>  /test-architecture
>>>  /test-async
>>> +/test-barrier
>>>  /test-boot-timestamp
>>>  /test-bus-chat
>>>  /test-bus-cleanup
>>> diff --git a/Makefile.am b/Makefile.am
>>> index 2b1484f..039a83e 100644
>>> --- a/Makefile.am
>>> +++ b/Makefile.am
>>> @@ -828,6 +828,8 @@ libsystemd_shared_la_SOURCES = \
>>>         src/shared/login-shared.h \
>>>         src/shared/ring.c \
>>>         src/shared/ring.h \
>>> +       src/shared/barrier.c \
>>> +       src/shared/barrier.h \
>>>         src/shared/async.c \
>>>         src/shared/async.h \
>>>         src/shared/eventfd-util.c \
>>> @@ -1238,6 +1240,7 @@ tests += \
>>>         test-ellipsize \
>>>         test-util \
>>>         test-ring \
>>> +       test-barrier \
>>>         test-tmpfiles \
>>>         test-namespace \
>>>         test-date \
>>> @@ -1408,6 +1411,12 @@ test_ring_SOURCES = \
>>>  test_ring_LDADD = \
>>>         libsystemd-core.la
>>>
>>> +test_barrier_SOURCES = \
>>> +       src/test/test-barrier.c
>>> +
>>> +test_barrier_LDADD = \
>>> +       libsystemd-core.la
>>> +
>>>  test_tmpfiles_SOURCES = \
>>>         src/test/test-tmpfiles.c
>>>
>>> diff --git a/src/shared/barrier.c b/src/shared/barrier.c
>>> new file mode 100644
>>> index 0000000..e7b4ead
>>> --- /dev/null
>>> +++ b/src/shared/barrier.c
>>> @@ -0,0 +1,442 @@
>>> +/*-*- Mode: C; c-basic-offset: 8; indent-tabs-mode: nil -*-*/
>>> +
>>> +/***
>>> +  This file is part of systemd.
>>> +
>>> +  Copyright 2014 David Herrmann <dh.herrmann at gmail.com>
>>> +
>>> +  systemd is free software; you can redistribute it and/or modify it
>>> +  under the terms of the GNU Lesser General Public License as published by
>>> +  the Free Software Foundation; either version 2.1 of the License, or
>>> +  (at your option) any later version.
>>> +
>>> +  systemd is distributed in the hope that it will be useful, but
>>> +  WITHOUT ANY WARRANTY; without even the implied warranty of
>>> +  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>>> +  Lesser General Public License for more details.
>>> +
>>> +  You should have received a copy of the GNU Lesser General Public License
>>> +  along with systemd; If not, see <http://www.gnu.org/licenses/>.
>>> +***/
>>> +
>>> +#include <errno.h>
>>> +#include <fcntl.h>
>>> +#include <limits.h>
>>> +#include <poll.h>
>>> +#include <stdbool.h>
>>> +#include <stdint.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/eventfd.h>
>>> +#include <sys/types.h>
>>> +#include <unistd.h>
>>> +
>>> +#include "barrier.h"
>>> +#include "macro.h"
>>> +#include "util.h"
>>> +
>>> +/**
>>> + * Barriers
>>> + * This barrier implementation provides a simple synchronization method based
>>> + * on file-descriptors that can safely be used between threads and processes. A
>>> + * barrier object contains 2 shared counters based on eventfd. Both processes
>>> + * can now place barriers and wait for the other end to reach a random or
>>> + * specific barrier.
>>> + * Barriers are numbered, so you can either wait for the other end to reach any
>>> + * barrier or the last barrier that you placed. This way, you can use barriers
>>> + * for one-way *and* full synchronization. Note that even-though barriers are
>>> + * numbered, these numbers are internal and recycled once both sides reached the
>>> + * same barrier (implemented as a simple signed counter). It is thus not
>>> + * possible to address barriers by their ID.
>>> + *
>>> + * Barrier-API: Both ends can place as many barriers via barrier_place() as
>>> + * they want and each pair of barriers on both sides will be implicitly linked.
>>> + * Each side can use the barrier_wait/sync_*() family of calls to wait for the
>>> + * other side to place a specific barrier. barrier_wait_next() waits until the
>>> + * other side calls barrier_place(). No links between the barriers are
>>> + * considered and this simply serves as most basic asynchronous barrier.
>>> + * barrier_sync_next() is like barrier_wait_next() and waits for the other side
>>> + * to place their next barrier via barrier_place(). However, it only waits for
>>> + * barriers that are linked to a barrier we already placed. If the other side
>>> + * already placed more barriers than we did, barrier_sync_next() returns
>>> + * immediately.
>>> + * barrier_sync() extends barrier_sync_next() and waits until the other end
>>> + * placed as many barriers via barrier_place() as we did. If they already placed
>>> + * as many as we did (or more), it returns immediately.
>>> + *
>>> + * Additionally to basic barriers, an abortion event is available.
>>> + * barrier_abort() places an abortion event that cannot be undone. An abortion
>>> + * immediately cancels all placed barriers and replaces them. Any running and
>>> + * following wait/sync call besides barrier_wait_abortion() will immediately
>>> + * return false on both sides (otherwise, they always return true).
>>> + * barrier_abort() can be called multiple times on both ends and will be a
>>> + * no-op if already called on this side.
>>> + * barrier_wait_abortion() can be used to wait for the other side to call
>>> + * barrier_abort() and is the only wait/sync call that does not return
>>> + * immediately if we aborted outself. It only returns once the other side
>>> + * called barrier_abort().
>>> + *
>>> + * Barriers can be used for in-process and inter-process synchronization.
>>> + * However, for in-process synchronization you could just use mutexes.
>>> + * Therefore, main target is IPC and we require both sides to *not* share the FD
>>> + * table. If that's given, barriers provide target tracking: If the remote side
>>> + * exit()s, an abortion event is implicitly queued on the other side. This way,
>>> + * a sync/wait call will be woken up if the remote side crashed or exited
>>> + * unexpectedly. However, note that these abortion events are only queued if the
>>> + * barrier-queue has been drained. Therefore, it is safe to place a barrier and
>>> + * exit. The other side can safely wait on the barrier even though the exit
>>> + * queued an abortion event. Usually, the abortion event would overwrite the
>>> + * barrier, however, that's not true for exit-abortion events. Those are only
>>> + * queued if the barrier-queue is drained (thus, the receiving side has placed
>>> + * more barriers than the remote side).
>>> + */
>>> +
>>> +/**
>>> + * barrier_init() - Initialize a barrier object
>>> + * @obj: barrier to initialize
>>> + *
>>> + * This initializes a barrier object. The caller is responsible of allocating
>>> + * the memory and keeping it valid. The memory does not have to be zeroed
>>> + * beforehand.
>>> + * Two eventfd objects are allocated for each barrier. If allocation fails, an
>>> + * error is returned.
>>> + *
>>> + * If this function fails, the barrier is reset to an invalid state so it is
>>> + * safe to call barrier_destroy() on the object regardless whether the
>>> + * initialization succeeded or not.
>>> + *
>>> + * The caller is responsible to destroy the object via barrier_destroy() before
>>> + * releasing the underlying memory.
>>> + *
>>> + * Returns: 0 on success, negative error code on failure.
>>> + */
>>> +int barrier_init(Barrier *obj) {
>>> +        _barrier_destroy_ptr_ Barrier *b = NULL;
>>> +        int r;
>>> +
>>> +        assert_return(obj, -EINVAL);
>>> +
>>> +        b = obj;
>>> +        zero(*b);
>>> +
>>> +        b->me = eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK);
>>> +        if (b->me < 0)
>>> +                return -errno;
>>> +
>>> +        b->them = eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK);
>>> +        if (b->them < 0)
>>> +                return -errno;
>>> +
>>> +        r = pipe2(b->pipe, O_CLOEXEC | O_NONBLOCK);
>>> +        if (r < 0)
>>> +                return -errno;
>>> +
>>> +        b = NULL;
>>> +        return 0;
>>
>> Usually we don't mangle obj unless the call succeeds. You could just
>> assign the eventfds and pipe to local variables and then
>> zero+initialize obj once you know it all worked.
>
> This was part of the API. It guarantees that you can run
> barrier_destroy() on the object regardless whether barrier_init() was
> successful or not. A simple memzero() doesn't work with barriers as
> they have embedded FDs. And given that barriers are not dynamically
> allocated (but embedded), I thought this was a handy feature.

Hm, shouldn't we better make sure that you can call barrier_destroy()
even if barrier_init() was not called yet (I'm thinking of the case
where you use the cleanup macro and something else fails before you
reach barrier_init())? If you just (contrary to the current comment)
require that the object sholud be zeroed before calling barrier_init()
on it that should "just work", no? If I read correctly, this is
already what is being done in the users you ported.

> Anyhow, I changed barrier_destroy() to do nothing if all is 0 (b->me
> can never be the same as b->them, so it's invalid state). Therefore, a
> memzero() now works just fine and there's no need to keep that
> behavior. I can change this to:

Yup, sounds good.

> Barrier b = { };
>
> ...init...
>
> memcpy(obj, &b, sizeof(*obj));
> return 0;
>
>>> +}
>>> +
>>> +/**
>>> + * barrier_destroy() - Destroy a barrier object
>>> + * @b: barrier to destroy or NULL
>>> + *
>>> + * This destroys a barrier object that has previously been initialized via
>>> + * barrier_destroy(). The object is released and reset to invalid state.
>>
>> barrier_init()?
>
> Whoops, nice catch.
>
>>> + * Therefore, it is safe to call barrier_destroy() multiple times or even if
>>> + * barrier_init() failed. However, you must not call barrier_destroy() if you
>>> + * never called barrier_init() on the object before.
>>> + *
>>> + * It is safe to initialize a barrier via zero() / memset(.., 0, ...). Even
>>> + * though it has embedded FDs, barrier_destroy() can deal with zeroed objects
>>> + * just fine.
>>> + *
>>> + * If @b is NULL, this is a no-op.
>>> + */
>>> +void barrier_destroy(Barrier *b) {
>>> +        if (!b)
>>> +                return;
>>> +
>>> +        /* @me and @them cannot be both FD 0. Lets be pedantic and check the
>>> +         * pipes and barriers, too. If all are 0, the object was zero()ed and
>>> +         * is invalid. This allows users to use zero(barrier) to reset the
>>> +         * backing memory. */
>>> +        if (b->me == 0 &&
>>> +            b->them == 0 &&
>>> +            b->pipe[0] == 0 &&
>>> +            b->pipe[1] == 0 &&
>>> +            b->barriers == 0)
>>> +                return;
>>> +
>>> +        b->me = safe_close(b->me);
>>> +        b->them = safe_close(b->them);
>>> +        b->pipe[0] = safe_close(b->pipe[0]);
>>> +        b->pipe[1] = safe_close(b->pipe[1]);
>>> +        b->barriers = 0;
>>> +}
>>> +
>>> +/**
>>> + * barrier_set_role() - Set the local role of the barrier
>>> + * @b: barrier to operate on
>>> + * @role: role to set on the barrier
>>> + *
>>> + * This sets the roles on a barrier object. This is needed to know which
>>> + * side of the barrier you're on. Usually, the parent creates the barrier via
>>> + * barrier_init() and then calls fork() or clone(). Therefore, the FDs are
>>> + * duplicated and the child retains the same barrier object.
>>> + *
>>> + * Both sides need to call barrier_set_role() after fork() or clone() are done.
>>> + * If this is not done, barriers will not work correctly.
>>> + *
>>> + * Note that barriers could be supported without fork() or clone(). However,
>>> + * this is currently not needed so it hasn't been implemented.
>>> + */
>>> +void barrier_set_role(Barrier *b, unsigned int role) {
>>> +        int fd;
>>> +
>>> +        assert(b);
>>> +        assert(role == BARRIER_PARENT || role == BARRIER_CHILD);
>>> +        /* make sure this is only called once */
>>> +        assert(b->pipe[1] >= 0 && b->pipe[1] >= 0);
>>> +
>>> +        if (role == BARRIER_PARENT) {
>>> +                b->pipe[1] = safe_close(b->pipe[1]);
>>> +        } else {
>>> +                b->pipe[0] = safe_close(b->pipe[0]);
>>> +
>>> +                /* swap me/them for childs */
>>
>> children
>
> Gnah, I will never get that right. Fixed.
>
>>> +                fd = b->me;
>>> +                b->me = b->them;
>>> +                b->them = fd;
>>> +        }
>>> +}
>>> +
>>> +/* places barrier; returns false if we aborted, otherwise true */
>>> +static bool barrier_write(Barrier *b, uint64_t buf) {
>>> +        ssize_t len;
>>> +
>>> +        /* prevent new sync-points if we already aborted */
>>> +        if (barrier_i_aborted(b))
>>> +                return false;
>>> +
>>> +        do {
>>> +                len = write(b->me, &buf, sizeof(buf));
>>> +        } while (len < 0 && (errno == EAGAIN || errno == EINTR));
>>> +
>>> +        if (len != sizeof(buf))
>>> +                goto error;
>>> +
>>> +        /* lock if we aborted */
>>> +        if (buf >= (uint64_t)BARRIER_ABORTION) {
>>> +                if (barrier_they_aborted(b))
>>> +                        b->barriers = BARRIER_WE_ABORTED;
>>> +                else
>>> +                        b->barriers = BARRIER_I_ABORTED;
>>> +        } else if (!barrier_is_aborted(b)) {
>>> +                b->barriers += buf;
>>> +        }
>>> +
>>> +        return !barrier_i_aborted(b);
>>> +
>>> +error:
>>> +        /* If there is an unexpected error, we have to make this fatal. There
>>> +         * is no way we can recover from sync-errors. Therefore, we close the
>>> +         * pipe-ends and treat this as abortion. The other end will notice the
>>> +         * pipe-close and treat it as abortion, too. */
>>> +
>>> +        b->pipe[0] = safe_close(b->pipe[0]);
>>> +        b->pipe[1] = safe_close(b->pipe[1]);
>>> +        b->barriers = BARRIER_WE_ABORTED;
>>> +        return false;
>>> +}
>>> +
>>> +/* waits for barriers; returns false if they aborted, otherwise true */
>>> +static bool barrier_read(Barrier *b, int64_t comp) {
>>> +        uint64_t buf;
>>> +        ssize_t len;
>>> +        struct pollfd pfd[2] = { };
>>> +        int r;
>>> +
>>> +        if (barrier_they_aborted(b))
>>> +                return false;
>>> +
>>> +        while (b->barriers > comp) {
>>> +                pfd[0].fd = (b->pipe[0] >= 0) ? b->pipe[0] : b->pipe[1];
>>> +                pfd[0].events = POLLHUP;
>>> +                pfd[0].revents = 0;
>>> +                pfd[1].fd = b->them;
>>> +                pfd[1].events = POLLIN;
>>> +                pfd[1].revents = 0;
>>> +
>>> +                r = poll(pfd, 2, -1);
>>> +                if (r < 0 && (errno == EAGAIN || errno == EINTR))
>>> +                        continue;
>>> +                else if (r < 0)
>>> +                        goto error;
>>> +
>>> +                if (pfd[1].revents) {
>>> +                        /* events on @them signal us new data */
>>> +                        len = read(b->them, &buf, sizeof(buf));
>>> +                        if (len < 0 && (errno == EAGAIN || errno == EINTR))
>>> +                                continue;
>>> +
>>> +                        if (len != sizeof(buf))
>>> +                                goto error;
>>> +                } else if (pfd[0].revents & (POLLHUP | POLLERR | POLLNVAL)) {
>>> +                        /* POLLHUP on the pipe tells us the other side exited.
>>> +                         * We treat this as implicit abortion. But we only
>>> +                         * handle it if there's no event on the eventfd. This
>>> +                         * guarantees that exit-abortions do not overwrite real
>>> +                         * barriers. */
>>> +                        buf = BARRIER_ABORTION;
>>> +                }
>>> +
>>> +                /* lock if they aborted */
>>> +                if (buf >= (uint64_t)BARRIER_ABORTION) {
>>> +                        if (barrier_i_aborted(b))
>>> +                                b->barriers = BARRIER_WE_ABORTED;
>>> +                        else
>>> +                                b->barriers = BARRIER_THEY_ABORTED;
>>> +                } else if (!barrier_is_aborted(b)) {
>>> +                        b->barriers -= buf;
>>> +                }
>>> +        }
>>> +
>>> +        return !barrier_they_aborted(b);
>>> +
>>> +error:
>>> +        /* If there is an unexpected error, we have to make this fatal. There
>>> +         * is no way we can recover from sync-errors. Therefore, we close the
>>> +         * pipe-ends and treat this as abortion. The other end will notice the
>>> +         * pipe-close and treat it as abortion, too. */
>>> +
>>> +        b->pipe[0] = safe_close(b->pipe[0]);
>>> +        b->pipe[1] = safe_close(b->pipe[1]);
>>> +        b->barriers = BARRIER_WE_ABORTED;
>>> +        return false;
>>> +}
>>> +
>>> +/**
>>> + * barrier_place() - Place a new barrier
>>> + * @b: barrier object
>>> + *
>>> + * This places a new barrier on the barrier object. If either side already
>>> + * aborted, this is a no-op and returns "false". Otherwise, the barrier is
>>> + * placed and this returns "true".
>>> + *
>>> + * Returns: true if barrier was placed, false if either side aborted.
>>> + */
>>> +bool barrier_place(Barrier *b) {
>>> +        assert(b);
>>> +
>>> +        if (barrier_is_aborted(b))
>>> +                return false;
>>> +
>>> +        barrier_write(b, BARRIER_SINGLE);
>>> +        return true;
>>> +}
>>> +
>>> +/**
>>> + * barrier_abort() - Abort the synchronization
>>> + * @b: barrier object to abort
>>> + *
>>> + * This aborts the barrier-synchronization. If barrier_abort() was already
>>> + * called on this side, this is a no-op. Otherwise, the barrier is put into the
>>> + * ABORT-state and will stay there. The other side is notified about the
>>> + * abortion. Any following attempt to place normal barriers or to wait on normal
>>> + * barriers will return immediately as "false".
>>> + *
>>> + * You can wait for the other side to call barrier_abort(), too. Use
>>> + * barrier_wait_abortion() for that.
>>> + *
>>> + * Returns: false if the other side already aborted, true otherwise.
>>> + */
>>> +bool barrier_abort(Barrier *b) {
>>> +        assert(b);
>>> +
>>> +        barrier_write(b, BARRIER_ABORTION);
>>> +        return !barrier_they_aborted(b);
>>> +}
>>> +
>>> +/**
>>> + * barrier_wait_next() - Wait for the next barrier of the other side
>>> + * @b: barrier to operate on
>>> + *
>>> + * This waits until the other side places its next barrier. This is independent
>>> + * of any barrier-links and just waits for any next barrier of the other side.
>>> + *
>>> + * If either side aborted, this returns false.
>>> + *
>>> + * Returns: false if either side aborted, true otherwise.
>>> + */
>>> +bool barrier_wait_next(Barrier *b) {
>>> +        assert(b);
>>> +
>>> +        if (barrier_is_aborted(b))
>>> +                return false;
>>> +
>>> +        barrier_read(b, b->barriers - 1);
>>> +        return !barrier_is_aborted(b);
>>> +}
>>> +
>>> +/**
>>> + * barrier_wait_abortion() - Wait for the other side to abort
>>> + * @b: barrier to operate on
>>> + *
>>> + * This waits until the other side called barrier_abort(). This can be called
>>> + * regardless whether the local side already called barrier_abort() or not.
>>> + *
>>> + * If the other side has already aborted, this returns immediately.
>>> + *
>>> + * Returns: false if the local side aborted, true otherwise.
>>> + */
>>> +bool barrier_wait_abortion(Barrier *b) {
>>> +        assert(b);
>>> +
>>> +        barrier_read(b, BARRIER_THEY_ABORTED);
>>> +        return !barrier_i_aborted(b);
>>> +}
>>> +
>>> +/**
>>> + * barrier_sync_next() - Wait for the other side to place a next linked barrier
>>> + * @b: barrier to operate on
>>> + *
>>> + * This is like barrier_wait_next() and waits for the other side to call
>>> + * barrier_place(). However, this only waits for linked barriers. That means, if
>>> + * the other side already placed more barriers than (or as much as) we did, this
>>> + * returns immediately instead of waiting.
>>> + *
>>> + * If either side aborted, this returns false.
>>> + *
>>> + * Returns: false if either side aborted, true otherwise.
>>> + */
>>> +bool barrier_sync_next(Barrier *b) {
>>> +        assert(b);
>>> +
>>> +        if (barrier_is_aborted(b))
>>> +                return false;
>>> +
>>> +        barrier_read(b, MAX((int64_t)0, b->barriers - 1));
>>> +        return !barrier_is_aborted(b);
>>> +}
>>> +
>>> +/**
>>> + * barrier_sync() - Wait for the other side to place as many barriers as we did
>>> + * @b: barrier to operate on
>>> + *
>>> + * This is like barrier_sync_next() but waits for the other side to call
>>> + * barrier_place() as often as we did (in total). If they already placed as much
>>> + * as we did (or more), this returns immediately instead of waiting.
>>> + *
>>> + * If either side aborted, this returns false.
>>> + *
>>> + * Returns: false if either side aborted, true otherwise.
>>> + */
>>> +bool barrier_sync(Barrier *b) {
>>> +        assert(b);
>>> +
>>> +        if (barrier_is_aborted(b))
>>> +                return false;
>>> +
>>> +        barrier_read(b, 0);
>>> +        return !barrier_is_aborted(b);
>>> +}
>>> diff --git a/src/shared/barrier.h b/src/shared/barrier.h
>>> new file mode 100644
>>> index 0000000..2d98e0d
>>> --- /dev/null
>>> +++ b/src/shared/barrier.h
>>> @@ -0,0 +1,97 @@
>>> +/*-*- Mode: C; c-basic-offset: 8; indent-tabs-mode: nil -*-*/
>>> +
>>> +#pragma once
>>> +
>>> +/***
>>> +  This file is part of systemd.
>>> +
>>> +  Copyright 2014 David Herrmann <dh.herrmann at gmail.com>
>>> +
>>> +  systemd is free software; you can redistribute it and/or modify it
>>> +  under the terms of the GNU Lesser General Public License as published by
>>> +  the Free Software Foundation; either version 2.1 of the License, or
>>> +  (at your option) any later version.
>>> +
>>> +  systemd is distributed in the hope that it will be useful, but
>>> +  WITHOUT ANY WARRANTY; without even the implied warranty of
>>> +  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>>> +  Lesser General Public License for more details.
>>> +
>>> +  You should have received a copy of the GNU Lesser General Public License
>>> +  along with systemd; If not, see <http://www.gnu.org/licenses/>.
>>> +***/
>>> +
>>> +#include <errno.h>
>>> +#include <inttypes.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/types.h>
>>> +
>>> +#include "macro.h"
>>> +#include "util.h"
>>> +
>>> +/* See source file for an API description. */
>>> +
>>> +typedef struct Barrier Barrier;
>>> +
>>> +enum {
>>> +        BARRIER_SINGLE                  = 1LL,
>>> +        BARRIER_ABORTION                = INT64_MAX,
>>> +
>>> +        /* bias values to store state; keep @WE < @THEY < @I */
>>> +        BARRIER_BIAS                    = INT64_MIN,
>>> +        BARRIER_WE_ABORTED              = BARRIER_BIAS + 1LL,
>>> +        BARRIER_THEY_ABORTED            = BARRIER_BIAS + 2LL,
>>> +        BARRIER_I_ABORTED               = BARRIER_BIAS + 3LL,
>>> +};
>>> +
>>> +enum {
>>> +        BARRIER_PARENT,
>>> +        BARRIER_CHILD,
>>> +};
>>> +
>>> +struct Barrier {
>>> +        int me;
>>> +        int them;
>>
>> Maybe use us/them (and not we/they) consistently everywhere?
>
> Hm. No idea how that should work. The member-names are dative case,
> the enums are nominative case. I doesn't make much sense to me to use
> the same for both.

Right, that's fine.

> I use 1st person singular + 3rd person plural (me+them and I+they).

This is what confused me. Why mix first and third person? I would have
said us+them and we+they. Obviously that means you'd have to rename
BARRIER_WE_ABORTED as you suggested below. This is safely into
bikeshed territory though, so feel free to ignore.

> The only exception is BARRIER_WE_ABORTED, but that is a combination of
> BARRIER_THEY_ABORTED and BARRIER_I_ABORTED (therefore, 'we'). I don't
> know how I could make it more consistent? Did you just mix up the
> BARRIER_WE_ABORTED constant? I could rename it to
> BARRIER_EVERYONE_ABORTED to avoid any confusion.

Yeah, sounds clearer.

>>> +        int pipe[2];
>>> +        int64_t barriers;
>>> +};
>>> +
>>> +int barrier_init(Barrier *obj);
>>> +void barrier_destroy(Barrier *b);
>>> +
>>> +#define _barrier_destroy_ _cleanup_(barrier_destroy_on_stack)
>>> +static inline void barrier_destroy_on_stack(Barrier *b) { if (b) barrier_destroy(b); }
>>> +#define _barrier_destroy_ptr_ _cleanup_(barrier_destroyp)
>>> +DEFINE_TRIVIAL_CLEANUP_FUNC(Barrier*, barrier_destroy);
>>> +
>>> +void barrier_set_role(Barrier *b, unsigned int role);
>
> Regarding your idea to drop that function:
> I can add this to all the other functions:
>
> if (b->pid > 0) {
>         barrier_set_role(b, b->pid != getpid());
>         b->pid = 0;
> }
>
> ..and then initialize b->pid to getpid(). This would cause any
> barrier_*() function to implicitly call barrier_set_role() and we
> could make the function an implementation detail.
>
> I'm still not sure I like this. Yeah, it does remove the explicit
> function-call, but at the same time it adds internal magic that is not
> really obvious from the outside. Imagine you have this code:
>
> int foobar(Barrier *b) {
>         pid_t pid;
>
>         foobar_init_A(b);
>
>         pid = fork();
>         if (pid < 0)
>                 return -errno;
>
>         if (pid > 0) {
>                 barrier_set_role(b, BARRIER_CHILD);
>                 foobar_init_B(b);
>
>                 if (!barrier_place_and_sync(b))
>                         _exit(1);
>
>                 execve(...);
>         }
>
>         barrier_set_role(b, BARRIER_PARENT);
>         foobar_init_C(b);
>
>         if (!barrier_place_and_sync(b)) {
>                 waitpid(pid, NULL, 0);
>                 return -EIO;
>         }
>
>         return 0;
> }
>
> Now I'm not saying this is a beauty. But one nice thing about the
> barrier API is, any of the foobar_init_*() functions can call
> barrier_abort() on failure and just return. There's no need to forward
> the error-code (even though you could do that here). But even if the
> caller continues, the barrier_place_and_sync() call will always fail
> later on.
>
> Now if we add the magic getpid() handling, then we would have to make
> sure foobar_init_A() never uses the barrier. Easy to fix in this code,
> but might get ugly if the order of these foobar_init_*() calls is
> dynamic and each of them can be called pre-fork, as-child or
> as-parent.
>
> Yes, there might not be a reason to use the barrier API before
> forking, but there's also no reason to call free() on a NULL pointer.
> It's just a convenience API. And it's a safety net we can rely on.
>
> Last but not least: If we hard-code getpid(), we are incompatible to
> clone(CLONE_THREAD). If we hard-code gettid(), we might break
> clone(CLONE_THREAD | CLONE_FILES) setups where two threads access the
> barrier to sync against a fork()ed child. Ok, so far we only have real
> fork() use-cases, but I hate limiting the API.
>
> Long story short: I'm still not sure I like this shortcut.. But maybe
> that's just me being pedantic..

Thanks for the explanation. As long as calling barriers before fork is
a valid thing to do (regardless of beauty), I agree that we should
keep the explicit role calls (no point in trying to hide things if the
user anyway needs to know what is going on in order to use the API
correctly).

-t