[gst-devel] Multi-threaded queueing locking issues

Tue Mar 20 20:54:06 CET 2001

note comments inline...

> > OK, I think I've got the locking code to work properly, and even be quite
> > efficient.  This is accomplished with the use of interleaved signals and
> > cond-waits.  You have something like this:
> >
> > thread A:                             thread B:
> > g_mutex_lock(lock);                   g_mutex_lock(lock);
> > g_cond_signal(cond);                  g_cond_wait(cond,lock);
> > g_cond_wait(cond,lock);                       g_cond_signal(lock);
> > g_mutex_unlock(lock);                 g_mutex_unlock(lock);
>
> The above is good, makes it all atomic.
>
> >
> > This is the best way to do interlocking sync points, afaict.  Thanks to
> > matth for teaching me about it!  How exactly it works is left as an
> > excercise for the reader <g>
>
> What happens, in thread A is that a lock is requested, a signal is sent
> (notifying of some change in something) and then it sleeps until it
> receives a signal (while waiting for a signal, the lock is released) and
> then unlocks the lock.
>
> In thread B, the lock is requested and then it sleeps until it receives a
> signal and then sends a signal out and then unlocks the lock.
>
> The problem with this approach, is that a signal on a condition variable
> is only caught if some thread is already waiting on that signal.
>
> So if thread B is not on the piece of code given above when thread A
> signals then a deadlock WILL occur.  I think you are trying to address it
> in some of the paragraphs below.

you're right: if B isn't waiting on cond, and A signals cond, you'll get out
of sync.  you can only use it when you know someone's waiting.  e.g.,

thread A (creator):
{ g_mutex_lock(lock);
  g_thread_create(B);
  g_cond_wait(cond, lock); // wait for thread B to get started up
  // B has started
  // do A stuff
}

thread B (created):
{ g_mutex_lock(lock); // blocks thread B until thread A calls wait
  // startup stuff
  g_signal(cond); // signal B has done startup... this works only in those cases

                            // when you know someone is waiting
  g_mutex_unlock(lock); // now A can return from wait(cond, lock)

  while(...) // do B stuff
}

the deadlock problem in gstreamer queue code is something different.
(or at least that what it looks like to me ;-)
it's not that 2 threads are out of sync with their wait / signal stuff.
instead, signal needs to be called more often, and queue_get() and
queue_chain() need to check for a pending state change in their
while(cond not true, wait) loops and break out.

> > Anyway, that's one problem down, a few more to tackle.  Specifically, I've
> > run into the fact that if you have any queue's in your thread, they are
> > generally going to be stuck waiting on a condition variable for the other
> > side of the queue to put something in or take something out.
> >
> > When a change_state happens on the thread, it must tell the thread to stop
> > spinning and change the state of all the children.  This is done by simply
> > clearing the SPINNING flag on the thread and waiting for it to change
> > state and signal back.  This only works if the while(SPINNING) loop has a
> > chance to exit, which means that the bin_iterate() function must exit at
> > some point.
> >
> > When you have queues on thread boundaries, you're going to end up stuck in
> > a cond-wait such that the iterate() will never exit.  This is a
> > significant problem, because this puts us in deadlock.
> >
> > The solution I've developed in my head is this:  after setting the state
> > to !SPINNING, do a cond_timed_wait with some reasonable, configurable
> > timeout.  If this timed-wait times out, we could assume that the thread is
> > stuck somewhere.  This can be clarified by having the elements that might
> > block set a flag on the thread during the event that might block, so we
> > know whether it's blocking or just taking a long time.
>
> I dont think the above is elegant.

i agree.  you shouldn't have to use the timed-wait stuff in this case.  also,
you don't want to use system kill/signal to fix thread scheduling.

i was thinking something like this: (pseudo-code :)
gstpad.c::queue_get() - to the while loop, add
  // somehow, get pendingState
  if (pendingState == PAUSED) { GST_UNLOCK(lock); return NULL; }

gstpad.c::queue_chain() - to the while loop, add the same thing
  if (pendingState == PAUSED) { GST_UNLOCK(lock); return NULL; }

gstpad.c::change_state() - add
  if (GST_STATE_PENDING (element) == GST_STATE_PAUSED)
  {
    GST_LOCK(queue);
    signal(emptycond); // wake up waiting queue_get()'s, if any
    signal(fullcond); // wake up waiting queue_chain()'s, if any
    GST_UNLOCK(queue);
  }

gstscheduler.c::chain_wrapper() - i'm not sure it's necessary, but probably add:

  if (pendingState == PAUSED) break;  // stop calling pad_pull()'s,

there may be other places in gstscheduler.c code to do similar.

thoughts?

matt.

> > When we find ourselves stuck with a blocked thread, we have to somehow
> > unblock it in a way that doesn't cause significant pain.  My thought would
> > be to fire a signal() at it, which the thread would catch with a custom
> > signal handler.  This handler would simply cause a cothread_switch() back
> > to the 0th thread, which puts it right back into the middle of
> > bin_iterate().  A flag can tell iterate() to stop everything and return
> > from that point.
>
> Again, it doesnt seem elegant IMO.
>
> >
> > Now, there are quite a few concerns here.  First of all, this signal is
> > going to come right in the middle of the queue's cond_wait.  From my quick
> > look through the linuxthreads code and minimal understanding of POSIX
> > signals, I really can't say whether this is a problem or not.  I'd guess
> > not.
>
> POSIX signals and threads mixing generally is not a good idea, also
> hampers portability to other non POSIX OS's or broken POSIX
> implementations.
>
> >
> > Next, doing a cothread switch in the middle of this mangled context could
> > be very scary.  I've noticed sigsetjmp(3), which claims to save the
> > blocked-signals state.  This seems apropos, since the linuxthreads code
> > for a suspend is:
> >
> >   sigdelset(&mask, __pthread_sig_restart); /* Unblock the restart signal */
> >   do {
> >     sigsuspend(&mask);                   /* Wait for signal */
> >   } ...
> >
> > If setjmp/longjmp mangle the blocked-signals list, this could cause
> > significant problems.  If the restart signal (SIGUSR1) ends up blocked
> > when we switch back from the setjmp to cothread 0, we'll be stuck in the
> > cond_wait forever (possibly even beyond parent death, ick).
> > Experimentation is needed to determine if we have to do
> > sigsetjmp/siglongjmp in these cases.  If so, we need to determine the
> > overhead of sigsetjmp/siglongjmp every time vs. checking to see if we need
> > to use it for each jump.
> >
> > Next comes the question of what happens if when we try to perform a
> > cond-wait in the thread-interlock code in cothread 0 while we have an
> > interrupted cond-wait sitting there waiting for the queue to signal.  Is
> > the thread cond-wait going to somehow trigger the queue's cond-wait, and
> > if so I really don't want to think about the mayhem that will cause when
> > it comes back on the wrong stack.  I'm guessing it will work though,
> > because linuxthreads uses queues internally, and afaik the signal handler
> > installed for the thread's cond-wait will always trigger and remove itself
> > before the thread's cond-wait even has a chance to pop back to the top of
> > the signal-handler stack.
>
> It is not defined which thread will wake up if more than one thread is
> waiting on a condition variable,
>
> Unfortunately, my head's not up to looking through the enxt few paragraphs
> now :)
>
> Once I have had a few drops of caffeine and a deadline out of the way,
> then I will look through the rest of the paragraphs.
>
> Zaheer
>
> >
> > Then, when we restart the thread, we want to jump right back where we came
> > from, which would be the signal handler that interrupted the queue's
> > cond-wait in the first place.  I would assume that the signal context
> > would cleanly unwrap, and the code would end up back in the middle of the
> > sigsuspend in the code shown above.  It may do a cycle through the
> > do/while, but that's what it's there for.
> >
> > The problem is: what if the other side of the queue signals while the
> > thread is shut down?  First thing is that it's going to trigger the
> > cond-wait signal handler that actually belongs to the thread at that
> > point.  This isn't such a big deal from the thread's point of view, since
> > it's just a spurious wakeup.  It'll go back to sleep since it wasn't woken
> > up for the reason it was waiting for.  The side-effect of this is that the
> > signal that the queue was waiting for to unblock itself gets lost.  The
> > other side of the queue has no idea that this is the case, and since the
> > queue is no longer either empty or full, it won't *ever* signal again.
> > Oops.
> >
> > That means that somehow we have to work in an interlock in queue
> > signaling, such that the signaling side knows whether the waiting side has
> > actually woken up.  So we end up with a protected boolean that the waiting
> > side sets before it goes to sleep, and unsets as it wakes up.  The
> > signaling side would keep signaling until it sees that this boolean has
> > been cleared, indicating that the waiting side got what it wanted.
> >
> > This fails quite rapidly, since when the thread is sitting waiting to
> > start spinning again, it'll be doing cond-waits.  Since the signaling side
> > of the queue presumably could signal during this, and it does so with the
> > same SIGUSR1 that the thread's parent would, it would spuriously awaken
> > the thread.  Since the signal succeeded, but the 'waiting' boolean doesn't
> > get reset, the signaling side of the queue tries again.  And again.  And
> > again.  The machine grinds to a halt with massive switching and
> > mis-signaling.
> >
> > On solution is for the signaling side of the queue to wait for some amount
> > of time after each failed signal.  This would have the effect of
> > limiting, if not eliminating, the spurious wakeups.  Another solution is
> > to make the queue smarter, and if the signal comes back without the
> > 'waiting' boolean having been cleared, it checks the state of the queue
> > before trying again.
> >
> > Hrm, that's an interesting problem, that extends outside the scope of this
> > issue, but may also solve it.  The queue only has one state variable,
> > since it's only a single element.  This also means that there's only one
> > parent, which could either be the thread or its parent.  This element
> > state could be sufficient to keep the signaling of the queue from
> > happening when it might get lost.
> >
> > If the queue is inside the thread, it will get interrupted, and set to
> > !PLAYING.  If the other side of the queue gets called at that point and
> > has reason to signal, it can simply check the state of the element.  It
> > would still want to check the 'waiting' boolean after signaling, in case
> > there's a race (or some other locking can be done).  The problem then is
> > what to wait on?  One possiblity is to just give up and finish the push or
> > pull and leave the 'waiting' boolean set.  This would mean that every push
> > or pull after that would cause another attempt to wake up the other half
> > of the thread.  This isn't so bad, necessarily.
> >
> > If the queue is outside the thread in question, we have a harder problem.
> > The state of the queue won't change when the thread leaves PLAYING, but at
> > least the queue can still check the state of the thread in a very
> > roundabout way by checking the state of the element peered to the other
> > side of the queue.  Hmmmm...
> >
> > You'll notice that I never suggested that we simply trick the queue into
> > waking up and just go from there.  We have to interrupt it in the middle
> > for one simple reason: the queue isn't the only thing that might block.
> > Anything that talks to the kernel could theoretically block.  This mostly
> > includes things that go over the network, or even to disk in some cases
> > (though I'd rather assume that disk accesses are reasonably bounded).
> > Unless we also have a way of dealing with these kinds of elements and
> > interrupting their reads or writes, we can't special-case the queue.  And
> > besides, that starts to put a lot more smarts into these kinds of
> > elements.
> >
> > Another option might be to require that anything that might block be
> > written with a timeout on any block-able call.  This could work, but again
> > puts the burden on the plugin-writer to check with the scheduler (or
> > something, the problem is: what exactly?) and see if it should continue or
> > punt, and if it punts, how does it punt?
> >
> > Anyway, I need to go to sleep now.  If anyone can follow that whole mess
> > and has comments, please write them up.  If I'm making stupid assumptions,
> > tell me <g>
> >
> >       Erik Walthinsen <omega at temple-baptist.com> - System Administrator
> >         __
> >        /  \                GStreamer - The only way to stream!
> >       |    | M E G A        ***** http://gstreamer.net/ *****
> >       _\  /_
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > gstreamer-devel mailing list
> > gstreamer-devel at lists.sourceforge.net
> > http://lists.sourceforge.net/lists/listinfo/gstreamer-devel
> >
>
> _______________________________________________
> gstreamer-devel mailing list
> gstreamer-devel at lists.sourceforge.net
> http://lists.sourceforge.net/lists/listinfo/gstreamer-devel