[gst-devel] Multi-threaded queueing locking issues

Mon Mar 19 15:51:49 CET 2001

On Mon, 19 Mar 2001, Erik Walthinsen wrote:

> OK, I think I've got the locking code to work properly, and even be quite
> efficient.  This is accomplished with the use of interleaved signals and
> cond-waits.  You have something like this:
> 
> thread A:				thread B:
> g_mutex_lock(lock);			g_mutex_lock(lock);
> g_cond_signal(cond);			g_cond_wait(cond,lock);
> g_cond_wait(cond,lock);			g_cond_signal(lock);
> g_mutex_unlock(lock);			g_mutex_unlock(lock);

The above is good, makes it all atomic.

> 
> This is the best way to do interlocking sync points, afaict.  Thanks to
> matth for teaching me about it!  How exactly it works is left as an
> excercise for the reader <g>

What happens, in thread A is that a lock is requested, a signal is sent
(notifying of some change in something) and then it sleeps until it
receives a signal (while waiting for a signal, the lock is released) and
then unlocks the lock.

In thread B, the lock is requested and then it sleeps until it receives a
signal and then sends a signal out and then unlocks the lock.

The problem with this approach, is that a signal on a condition variable
is only caught if some thread is already waiting on that signal.

So if thread B is not on the piece of code given above when thread A
signals then a deadlock WILL occur.  I think you are trying to address it
in some of the paragraphs below.

> 
> Anyway, that's one problem down, a few more to tackle.  Specifically, I've
> run into the fact that if you have any queue's in your thread, they are
> generally going to be stuck waiting on a condition variable for the other
> side of the queue to put something in or take something out.
> 
> When a change_state happens on the thread, it must tell the thread to stop
> spinning and change the state of all the children.  This is done by simply
> clearing the SPINNING flag on the thread and waiting for it to change
> state and signal back.  This only works if the while(SPINNING) loop has a
> chance to exit, which means that the bin_iterate() function must exit at
> some point.
> 
> When you have queues on thread boundaries, you're going to end up stuck in
> a cond-wait such that the iterate() will never exit.  This is a
> significant problem, because this puts us in deadlock.
> 
> The solution I've developed in my head is this:  after setting the state
> to !SPINNING, do a cond_timed_wait with some reasonable, configurable
> timeout.  If this timed-wait times out, we could assume that the thread is
> stuck somewhere.  This can be clarified by having the elements that might
> block set a flag on the thread during the event that might block, so we
> know whether it's blocking or just taking a long time.

I dont think the above is elegant.

> 
> When we find ourselves stuck with a blocked thread, we have to somehow
> unblock it in a way that doesn't cause significant pain.  My thought would
> be to fire a signal() at it, which the thread would catch with a custom
> signal handler.  This handler would simply cause a cothread_switch() back
> to the 0th thread, which puts it right back into the middle of
> bin_iterate().  A flag can tell iterate() to stop everything and return
> from that point.

Again, it doesnt seem elegant IMO.

> 
> Now, there are quite a few concerns here.  First of all, this signal is
> going to come right in the middle of the queue's cond_wait.  From my quick
> look through the linuxthreads code and minimal understanding of POSIX
> signals, I really can't say whether this is a problem or not.  I'd guess
> not.

POSIX signals and threads mixing generally is not a good idea, also
hampers portability to other non POSIX OS's or broken POSIX
implementations.

> 
> Next, doing a cothread switch in the middle of this mangled context could
> be very scary.  I've noticed sigsetjmp(3), which claims to save the
> blocked-signals state.  This seems apropos, since the linuxthreads code
> for a suspend is:
> 
>   sigdelset(&mask, __pthread_sig_restart); /* Unblock the restart signal */
>   do {
>     sigsuspend(&mask);                   /* Wait for signal */
>   } ...
> 
> If setjmp/longjmp mangle the blocked-signals list, this could cause
> significant problems.  If the restart signal (SIGUSR1) ends up blocked
> when we switch back from the setjmp to cothread 0, we'll be stuck in the
> cond_wait forever (possibly even beyond parent death, ick).
> Experimentation is needed to determine if we have to do
> sigsetjmp/siglongjmp in these cases.  If so, we need to determine the
> overhead of sigsetjmp/siglongjmp every time vs. checking to see if we need
> to use it for each jump.
> 
> Next comes the question of what happens if when we try to perform a
> cond-wait in the thread-interlock code in cothread 0 while we have an
> interrupted cond-wait sitting there waiting for the queue to signal.  Is
> the thread cond-wait going to somehow trigger the queue's cond-wait, and
> if so I really don't want to think about the mayhem that will cause when
> it comes back on the wrong stack.  I'm guessing it will work though,
> because linuxthreads uses queues internally, and afaik the signal handler
> installed for the thread's cond-wait will always trigger and remove itself
> before the thread's cond-wait even has a chance to pop back to the top of
> the signal-handler stack.

It is not defined which thread will wake up if more than one thread is
waiting on a condition variable,

Unfortunately, my head's not up to looking through the enxt few paragraphs
now :)

Once I have had a few drops of caffeine and a deadline out of the way,
then I will look through the rest of the paragraphs.

Zaheer

> 
> Then, when we restart the thread, we want to jump right back where we came
> from, which would be the signal handler that interrupted the queue's
> cond-wait in the first place.  I would assume that the signal context
> would cleanly unwrap, and the code would end up back in the middle of the
> sigsuspend in the code shown above.  It may do a cycle through the
> do/while, but that's what it's there for.
> 
> The problem is: what if the other side of the queue signals while the
> thread is shut down?  First thing is that it's going to trigger the
> cond-wait signal handler that actually belongs to the thread at that
> point.  This isn't such a big deal from the thread's point of view, since
> it's just a spurious wakeup.  It'll go back to sleep since it wasn't woken
> up for the reason it was waiting for.  The side-effect of this is that the
> signal that the queue was waiting for to unblock itself gets lost.  The
> other side of the queue has no idea that this is the case, and since the
> queue is no longer either empty or full, it won't *ever* signal again.
> Oops.
> 
> That means that somehow we have to work in an interlock in queue
> signaling, such that the signaling side knows whether the waiting side has
> actually woken up.  So we end up with a protected boolean that the waiting
> side sets before it goes to sleep, and unsets as it wakes up.  The
> signaling side would keep signaling until it sees that this boolean has
> been cleared, indicating that the waiting side got what it wanted.
> 
> This fails quite rapidly, since when the thread is sitting waiting to
> start spinning again, it'll be doing cond-waits.  Since the signaling side
> of the queue presumably could signal during this, and it does so with the
> same SIGUSR1 that the thread's parent would, it would spuriously awaken
> the thread.  Since the signal succeeded, but the 'waiting' boolean doesn't
> get reset, the signaling side of the queue tries again.  And again.  And
> again.  The machine grinds to a halt with massive switching and
> mis-signaling.
> 
> On solution is for the signaling side of the queue to wait for some amount
> of time after each failed signal.  This would have the effect of
> limiting, if not eliminating, the spurious wakeups.  Another solution is
> to make the queue smarter, and if the signal comes back without the
> 'waiting' boolean having been cleared, it checks the state of the queue
> before trying again.
> 
> Hrm, that's an interesting problem, that extends outside the scope of this
> issue, but may also solve it.  The queue only has one state variable,
> since it's only a single element.  This also means that there's only one
> parent, which could either be the thread or its parent.  This element
> state could be sufficient to keep the signaling of the queue from
> happening when it might get lost.
> 
> If the queue is inside the thread, it will get interrupted, and set to
> !PLAYING.  If the other side of the queue gets called at that point and
> has reason to signal, it can simply check the state of the element.  It
> would still want to check the 'waiting' boolean after signaling, in case
> there's a race (or some other locking can be done).  The problem then is
> what to wait on?  One possiblity is to just give up and finish the push or
> pull and leave the 'waiting' boolean set.  This would mean that every push
> or pull after that would cause another attempt to wake up the other half
> of the thread.  This isn't so bad, necessarily.
> 
> If the queue is outside the thread in question, we have a harder problem.
> The state of the queue won't change when the thread leaves PLAYING, but at
> least the queue can still check the state of the thread in a very
> roundabout way by checking the state of the element peered to the other
> side of the queue.  Hmmmm...
> 
> You'll notice that I never suggested that we simply trick the queue into
> waking up and just go from there.  We have to interrupt it in the middle
> for one simple reason: the queue isn't the only thing that might block.
> Anything that talks to the kernel could theoretically block.  This mostly
> includes things that go over the network, or even to disk in some cases
> (though I'd rather assume that disk accesses are reasonably bounded).
> Unless we also have a way of dealing with these kinds of elements and
> interrupting their reads or writes, we can't special-case the queue.  And
> besides, that starts to put a lot more smarts into these kinds of
> elements.
> 
> Another option might be to require that anything that might block be
> written with a timeout on any block-able call.  This could work, but again
> puts the burden on the plugin-writer to check with the scheduler (or
> something, the problem is: what exactly?) and see if it should continue or
> punt, and if it punts, how does it punt?
> 
> Anyway, I need to go to sleep now.  If anyone can follow that whole mess
> and has comments, please write them up.  If I'm making stupid assumptions,
> tell me <g>
> 
>       Erik Walthinsen <omega at temple-baptist.com> - System Administrator
>         __
>        /  \                GStreamer - The only way to stream!
>       |    | M E G A        ***** http://gstreamer.net/ *****
>       _\  /_
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> gstreamer-devel mailing list
> gstreamer-devel at lists.sourceforge.net
> http://lists.sourceforge.net/lists/listinfo/gstreamer-devel
>