[HarfBuzz] Fwd: On atomic ops and text stack thread-safety

Thu Aug 2 06:42:26 UTC 2018

Since I've been hacking on the atomic stuff past few days (It's over!) and
Werner asked, I wrote this.  Might be of interest to others.

---------- Forwarded message ----------
From: Behdad Esfahbod <behdad at behdad.org>
Date: Wed, Aug 1, 2018 at 10:16 PM
Subject: On atomic ops and text stack thread-safety
To: Werner LEMBERG <wl at gnu.org>

Ok let's see if I can explain this succinctly.

For what I'm doing, performance is definitely of utmost important, but only
while maintaining correctness, and portability has been a huge part of it.

There's different issues at play.  Let's discuss them in three parts:

  1. Atomic operations in presence of multiple threads,
  2. Memory ordering in regard to compiler and CPU optimizations,
  3. Memory / cache coherency in presence of multiple CPUs.

== Atomic operations ==

By default, read and write operations on properly-aligned char, short, int,
and long on all current CPUs are atomic.  C11 and C++11 guarantee that
IIRC.  What we mean by atomic, is that, eg, if we have "static x = 0" and
one thread sets it to 42, other threads reading this same variable will
only either see 0 or 42, but no other value.  That's atomicity of read /
write (also known as load / store).

Now, when we get to more involved operations, namely, "read-modify-write",
atomicity is not guaranteed.  Take FT_Reference_Library:

  FT_EXPORT_DEF( FT_Error )
  FT_Reference_Library( FT_Library  library )
  {
    if ( !library )
      return FT_THROW( Invalid_Library_Handle );

    library->refcount++;

    return FT_Err_Ok;
  }

As long as there's only one thread of execution, this code is fine and
correct.  Compiler and CPU make sure that what gets executed has the same
semantics of what you wrote.  But what if there are multiple threads
involved?  Things get out of hand and programmer needs to intervene.  In
this case, the "refcount++" is a read-modify-write operation.  That is, in
machine code / pseudo-assembler, it involves:

  load r, refcount
  inc r, 1
  store refcount, r

Now, what happens if the first thread calling this function performs the
load, reading, eg. refcount of 1, but before performing the rest of the
operations it's pre-empted and second thread runs the same code, also
reading refcount of 1, incrementing it, and storing 2.  Now the first
thread resumes, incrementing r to 2 and store it into refcount.  What just
happened is that we started with a refcount of 1, two threads incremented
refcount by one each, but as a result we got 2 stored in refcount instead
of 3.

That's the first issue.  To resolve it, we need an atomic "fetch_and_add"
operation.  That is, one that does the "load, inc, store" atomically
without another thread interrupting.  That's exactly the kind of thing that
atomic operations in CPUs, compilers, and standard libraries provide.

You can read more at Linux's documentation:

  https://github.com/torvalds/linux/blob/master/
Documentation/core-api/atomic_ops.rst

== Memory ordering and optimizations ==

Imagine this pseudo code:

static int x = 0;
static bool ready = false;

One thread running:

  x = expensive_computation_producing_42 ();
  ready = true;

Other thread:

  if (ready)
    use (x);

Now, as discussed, we already know that read and write of int and bool
types *are* atomic.  So *that* is not the issue here.  However, you would
expect that if the second thread get into the body of the "if", then the
value of x read must be 42.  However, that's not what always will happen,
because of legitimate compiler and CPU optimizations.

In particular, the compiler "is within its rights" to reorder the
assignment of x and ready in the first thead.  The CPU, is within its
rights to execute them out of order, or in parallel.  These freedoms /
rights allow for tons of optimizations / parallelism.  They are allowed
because to single-threaded programs they don't make any observable
difference.  The program results are always the same.  But if you have
another thread reading those, this can wreak havoc.  What's worse, the
second thread's compiler / CPU is also within its rights to, eg, read "x"
before / in parallel to reading "ready".  Because to a single-threaded
program those do NOT make a visible difference, but allow for faster code.

Imagine the code above being transformed in machine code to:

Thread 1:

  ready = true;
  x = expensive_computation_producing_42 ();

Other thread:

  r = x;
  if (ready)
    use (r);

Obviously there's a race condition there now.  We might be reading old
value of x and using it because "ready" is true.  That's a problem.

To fix this problem, we use what's called memory-barriers.  There's many
types of them, offering different guarantees.  For now, let's just discuss
what's known as a full memory barrier, or simply memory-barrier.  What a
memory barrier does, in this, context, is to make sure compiler / CPU do
NOT reorder reads / writes from before and after it.  Ie. a read / write
after the memory barrier will be executed *after* a read / write before the
memory barrier.

This, in general, is called memory-ordering.  Many atomic operations come
with memory-ordering guarantees built in.  C11 / C++11 provide full set of
atomic operations with selectable memory-ordering guarantees:

  https://en.cppreference.com/w/c/atomic
  https://en.cppreference.com/w/cpp/atomic

The release-acquire memory-ordering sequence is of special interest to us.
Read:

  https://en.cppreference.com/w/c/atomic/memory_order

It takes a while to fully digest that.  For full-on deep understanding,
read the first third of:

  https://www.kernel.org/doc/Documentation/memory-barriers.txt

Though maybe read that last one after you finish the third part of this
message.

== Multiple CPUs ==

Ok so let's consider the case of multiple CPUs.  While it's natural for one
to expect that a store A performed on CPU 1 before store B be visible in
that same order in CPU 2, this is not guaranteed by some architectures,
namely DEC Alpha.

Let's see what this means in practice.  Imagine one thread initializes some
data that other threads are going to consume:

static void *P = NULL;

Thread 1 on CPU 1:

  void *X = malloc (...);
  setup (X);
  memory_barrier ();
  P = X;

Thread 2 on CPU 2:

  void * Y = P;
  memory_barrier ();
  if (Y)
    Y->use_it ();

The memory_barrier()s exist to address optimization issues discussed
earlier.  Now, it would be a pit if thread 2 reads P into Y and it's
non-NULL, but the memory pointed to by Y is NOT yet visible to CPU 2.
This, can happen.

Fortunately, memory-barrier's, when correctly used, also make sure that
this situation does NOT happen.  That's what acquire-release semantics is
about.  In particular, if the threads do:

Thread 1 on CPU 1:

  void *X = malloc (...);
  setup (X);
  release_barrier ();
  P = X;

Thread 2 on CPU 2:

  void * Y = P;
  acquire_barrier ();
  if (Y)
    Y->use_it ();

Then it's guaranteed that if Y sees the value of P pointing to X, then any
stores done on CPU 1 before the release_barrier() calls are also visible to
CPU 2 after the acquire_barrier() call.  This is exactly what we want.

For full details of different types of barriers, read first third of:

  https://www.kernel.org/doc/Documentation/memory-barriers.txt

But just know that a read barrier provides stronger guarantees than an
acquire barrier, and a write barrier provides stronger guarantees than a
release barrier, and a full memory barrier provides stronger guarantees
than either read or write barrier.

That's the gist of it.  Hope it makes sense.  In another message I'll
discuss how I use these in HarfBuzz, FontConfig, Cairo, Pango, etc, to
provide lock-free threadsafety in the text stack.  FreeType, indeed, needs
some work.

Cheers,

behdad

On Wed, Aug 1, 2018 at 9:01 PM, Behdad Esfahbod <behdad at behdad.org> wrote:

> Hi Werner,
>
> Glad you asked.  Yes, there's work to be done in FreeType as well.  Give
> me some time to write down my thoughts.
>
> Cheers,
> b
>
> On Wed, Aug 1, 2018 at 12:29 AM, Werner LEMBERG <wl at gnu.org> wrote:
>
>>
>> Behdad,
>>
>>
>> >  typedef int hb_atomic_int_impl_t;
>>
>> I always wondered what this `atomic' stuff is good for.  It seems that
>> you are working on a very low level, almost mimicking CPU opcodes.  Is
>> this for speed?  Or for thread safety?  And do you think that
>> something similar to your work would be beneficial for FreeType also?
>>
>>
>>     Werner
>>
>
>
>
> --
> behdad
> http://behdad.org/
>

-- 
behdad
http://behdad.org/

-- 
behdad
http://behdad.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/harfbuzz/attachments/20180801/cd566ca6/attachment.html>