[Bug 750397] CRITICAL: Race condition in GstBus

Sat Jun 25 08:00:57 UTC 2016

https://bugzilla.gnome.org/show_bug.cgi?id=750397

--- Comment #32 from Matt Gruenke <mgruenke at tycoint.com> ---
(In reply to Sebastian Dröge (slomo) from comment #30)
> (In reply to Matt Gruenke from comment #29)
> > (In reply to Sebastian Dröge (slomo) from comment #28)
> > > I think except for GstBufferPool (bug #767979), this is all fixed now.
> > 
> > I disagree.  I think your diff solves nothing, as you're acquiring the mutex
> > too late.  The entire problem comes down to an emergent discrepancy between
> > set->control_pending and the pipe occupancy.
> > 
> > If a mutex were to solve this problem, you'd need to lock it before
> > testing/modifying set->control_pending.  And then, what would be the point
> > of set->control_pending_locked?
> 
> The point of the control_pending_locked is to actually detect this
> discrepancy properly. control_pending is always correct at the time the
> atomic operation happens, after that things go out of sync. So we now have
> additionally a locked counter that basically is the "fill level" of the
> socket.

But that's not the problem.  The problem is inconsistency between
set->control_pending and the socket.  Adding another counter doesn't help. 
Again, if you want to solve this with a mutex, you need to lock it before
testing/modifying set->control_pending and release it after
WAKE_EVENT()/RELEASE_EVENT().

set->control_pending did a fine job of tracking the fill level of the socket -
the issue was a race condition that your mutex is being acquired too late to
prevent.

> By this we make sure that always the socket contains exactly one byte when
> it should, and none otherwise. Which is needed especially for the Windows
> part, as there you only have an on/off switch basically.

The socketpair is also being used as an on/off switch, since it should contain
only 0 or 1 bytes.

> > But, I'm not sure even that would fix the problem, so long as gstpoll.c
> > contains the following:
> > 
> >     fcntl (control_sock[0], F_SETFL, O_NONBLOCK);
> >     fcntl (control_sock[1], F_SETFL, O_NONBLOCK);
> > 
> > Because, even if a mutex is locked before set->control_pending is
> > tested/modified, is the pipe guaranteed to be atomic?  In other words, if
> > one thread reads the set->control_read_fd.fd immediately after another
> > thread's write has returned, is the reader guaranteed to get the byte?  On
> > all platforms that support socketpair()?  If not, no mutex would completely
> > avoid the need to make it blocking.
> 
> GstPoll API requires that a read() only happens after a write() has
> returned, just that GstBus ignored that condition. Now we always first wait
> on the socket with poll() before doing anything else (in GstBus).

Even requiring write to complete before reads won’t necessarily guarantee that
the read won’t fail.  What guarantees are there on the propagation time between
a write to control_write_fd.fd and when it can be read from control_read_fd.fd?

I sense you're reluctant to remove O_NONBLOCK?  Do you think it’s needed, or
even beneficial?  Why?

> Making the socket blocking is not going to solve anything here additional to
> that. And it won't work on Windows because there the signalling is
> inherently non-blocking.

Actually, because Events are synchronization primitives, and this one is
created with bManualReset=TRUE, it needn't be so.  In the patch I just
submitted, RELEASE_EVENT() now blocks, on both platforms.

> Can you give me an example sequence that fails with my patch? This one works
> in both variants that are now possible due to the mutex.

Just as without the patch:

0. set->control_pending is initially 0.
1. Writer increments set->control_pending
2. Reader decrements set->control_pending
3. Reader gets the mutex, finds that set->control_pending_locked is
   still 0, and skips the read.
4. Writer gets the mutex, increments set->control_pending_locked, and
   performs the write.

Now, set->control_pending is 0, but set->control_pending_locked is 1 and a byte
is waiting in set->control_read_fd.fd that will never be read, due to the skew
between them.  Consequently, the fd/Event will remain signaled, causing
select()/WSAEventSelect() always to immediately return.  This is equivalent to
the original bug.

> > As I said, I've got probably thousands of pipeline years of runtime on this
> > fix, and probably hundreds of thousands of pipeline setups & teardowns. 
> > Given how quickly I encountered the original problem (within days and maybe
> > a few hundred pipeline setups/teardows), any inadequacies of my fix
> > should've become apparent, by now.
> 
> Nobody noticed that GstBus was broken like this for more than 7 years (if my
> GIT archaeology was correct), and even now that we know the problem it only
> happens for a few people. I don't think "it works for me" is a good argument
> in this case :)

I don't know when it was introduced, but we didn't encounter the bug in
gstreamer-0.10.25, which we’d been using until last year.  When we upgraded to
1.2.4, we noticed it quite easily.

We use gst_bus_timed_pop(), though I haven’t investigated whether it occurs
with g_main_loop().   Since the bug is intermittent and merely causes excessive
CPU utilization (at worst), it’s not nearly as noticeable or problematic as
bugs which result in segfaults, connection loss, or data corruption.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.