[gst-devel] Multi-threaded queueing locking issues
Matt Howell
matth at ridgerun.com
Tue Mar 20 20:54:06 CET 2001
note comments inline...
> > OK, I think I've got the locking code to work properly, and even be quite
> > efficient. This is accomplished with the use of interleaved signals and
> > cond-waits. You have something like this:
> >
> > thread A: thread B:
> > g_mutex_lock(lock); g_mutex_lock(lock);
> > g_cond_signal(cond); g_cond_wait(cond,lock);
> > g_cond_wait(cond,lock); g_cond_signal(lock);
> > g_mutex_unlock(lock); g_mutex_unlock(lock);
>
> The above is good, makes it all atomic.
>
> >
> > This is the best way to do interlocking sync points, afaict. Thanks to
> > matth for teaching me about it! How exactly it works is left as an
> > excercise for the reader <g>
>
> What happens, in thread A is that a lock is requested, a signal is sent
> (notifying of some change in something) and then it sleeps until it
> receives a signal (while waiting for a signal, the lock is released) and
> then unlocks the lock.
>
> In thread B, the lock is requested and then it sleeps until it receives a
> signal and then sends a signal out and then unlocks the lock.
>
> The problem with this approach, is that a signal on a condition variable
> is only caught if some thread is already waiting on that signal.
>
> So if thread B is not on the piece of code given above when thread A
> signals then a deadlock WILL occur. I think you are trying to address it
> in some of the paragraphs below.
you're right: if B isn't waiting on cond, and A signals cond, you'll get out
of sync. you can only use it when you know someone's waiting. e.g.,
thread A (creator):
{ g_mutex_lock(lock);
g_thread_create(B);
g_cond_wait(cond, lock); // wait for thread B to get started up
// B has started
// do A stuff
}
thread B (created):
{ g_mutex_lock(lock); // blocks thread B until thread A calls wait
// startup stuff
g_signal(cond); // signal B has done startup... this works only in those cases
// when you know someone is waiting
g_mutex_unlock(lock); // now A can return from wait(cond, lock)
while(...) // do B stuff
}
the deadlock problem in gstreamer queue code is something different.
(or at least that what it looks like to me ;-)
it's not that 2 threads are out of sync with their wait / signal stuff.
instead, signal needs to be called more often, and queue_get() and
queue_chain() need to check for a pending state change in their
while(cond not true, wait) loops and break out.
> > Anyway, that's one problem down, a few more to tackle. Specifically, I've
> > run into the fact that if you have any queue's in your thread, they are
> > generally going to be stuck waiting on a condition variable for the other
> > side of the queue to put something in or take something out.
> >
> > When a change_state happens on the thread, it must tell the thread to stop
> > spinning and change the state of all the children. This is done by simply
> > clearing the SPINNING flag on the thread and waiting for it to change
> > state and signal back. This only works if the while(SPINNING) loop has a
> > chance to exit, which means that the bin_iterate() function must exit at
> > some point.
> >
> > When you have queues on thread boundaries, you're going to end up stuck in
> > a cond-wait such that the iterate() will never exit. This is a
> > significant problem, because this puts us in deadlock.
> >
> > The solution I've developed in my head is this: after setting the state
> > to !SPINNING, do a cond_timed_wait with some reasonable, configurable
> > timeout. If this timed-wait times out, we could assume that the thread is
> > stuck somewhere. This can be clarified by having the elements that might
> > block set a flag on the thread during the event that might block, so we
> > know whether it's blocking or just taking a long time.
>
> I dont think the above is elegant.
i agree. you shouldn't have to use the timed-wait stuff in this case. also,
you don't want to use system kill/signal to fix thread scheduling.
i was thinking something like this: (pseudo-code :)
gstpad.c::queue_get() - to the while loop, add
// somehow, get pendingState
if (pendingState == PAUSED) { GST_UNLOCK(lock); return NULL; }
gstpad.c::queue_chain() - to the while loop, add the same thing
if (pendingState == PAUSED) { GST_UNLOCK(lock); return NULL; }
gstpad.c::change_state() - add
if (GST_STATE_PENDING (element) == GST_STATE_PAUSED)
{
GST_LOCK(queue);
signal(emptycond); // wake up waiting queue_get()'s, if any
signal(fullcond); // wake up waiting queue_chain()'s, if any
GST_UNLOCK(queue);
}
gstscheduler.c::chain_wrapper() - i'm not sure it's necessary, but probably add:
if (pendingState == PAUSED) break; // stop calling pad_pull()'s,
there may be other places in gstscheduler.c code to do similar.
thoughts?
matt.
> > When we find ourselves stuck with a blocked thread, we have to somehow
> > unblock it in a way that doesn't cause significant pain. My thought would
> > be to fire a signal() at it, which the thread would catch with a custom
> > signal handler. This handler would simply cause a cothread_switch() back
> > to the 0th thread, which puts it right back into the middle of
> > bin_iterate(). A flag can tell iterate() to stop everything and return
> > from that point.
>
> Again, it doesnt seem elegant IMO.
>
> >
> > Now, there are quite a few concerns here. First of all, this signal is
> > going to come right in the middle of the queue's cond_wait. From my quick
> > look through the linuxthreads code and minimal understanding of POSIX
> > signals, I really can't say whether this is a problem or not. I'd guess
> > not.
>
> POSIX signals and threads mixing generally is not a good idea, also
> hampers portability to other non POSIX OS's or broken POSIX
> implementations.
>
> >
> > Next, doing a cothread switch in the middle of this mangled context could
> > be very scary. I've noticed sigsetjmp(3), which claims to save the
> > blocked-signals state. This seems apropos, since the linuxthreads code
> > for a suspend is:
> >
> > sigdelset(&mask, __pthread_sig_restart); /* Unblock the restart signal */
> > do {
> > sigsuspend(&mask); /* Wait for signal */
> > } ...
> >
> > If setjmp/longjmp mangle the blocked-signals list, this could cause
> > significant problems. If the restart signal (SIGUSR1) ends up blocked
> > when we switch back from the setjmp to cothread 0, we'll be stuck in the
> > cond_wait forever (possibly even beyond parent death, ick).
> > Experimentation is needed to determine if we have to do
> > sigsetjmp/siglongjmp in these cases. If so, we need to determine the
> > overhead of sigsetjmp/siglongjmp every time vs. checking to see if we need
> > to use it for each jump.
> >
> > Next comes the question of what happens if when we try to perform a
> > cond-wait in the thread-interlock code in cothread 0 while we have an
> > interrupted cond-wait sitting there waiting for the queue to signal. Is
> > the thread cond-wait going to somehow trigger the queue's cond-wait, and
> > if so I really don't want to think about the mayhem that will cause when
> > it comes back on the wrong stack. I'm guessing it will work though,
> > because linuxthreads uses queues internally, and afaik the signal handler
> > installed for the thread's cond-wait will always trigger and remove itself
> > before the thread's cond-wait even has a chance to pop back to the top of
> > the signal-handler stack.
>
> It is not defined which thread will wake up if more than one thread is
> waiting on a condition variable,
>
> Unfortunately, my head's not up to looking through the enxt few paragraphs
> now :)
>
> Once I have had a few drops of caffeine and a deadline out of the way,
> then I will look through the rest of the paragraphs.
>
> Zaheer
>
> >
> > Then, when we restart the thread, we want to jump right back where we came
> > from, which would be the signal handler that interrupted the queue's
> > cond-wait in the first place. I would assume that the signal context
> > would cleanly unwrap, and the code would end up back in the middle of the
> > sigsuspend in the code shown above. It may do a cycle through the
> > do/while, but that's what it's there for.
> >
> > The problem is: what if the other side of the queue signals while the
> > thread is shut down? First thing is that it's going to trigger the
> > cond-wait signal handler that actually belongs to the thread at that
> > point. This isn't such a big deal from the thread's point of view, since
> > it's just a spurious wakeup. It'll go back to sleep since it wasn't woken
> > up for the reason it was waiting for. The side-effect of this is that the
> > signal that the queue was waiting for to unblock itself gets lost. The
> > other side of the queue has no idea that this is the case, and since the
> > queue is no longer either empty or full, it won't *ever* signal again.
> > Oops.
> >
> > That means that somehow we have to work in an interlock in queue
> > signaling, such that the signaling side knows whether the waiting side has
> > actually woken up. So we end up with a protected boolean that the waiting
> > side sets before it goes to sleep, and unsets as it wakes up. The
> > signaling side would keep signaling until it sees that this boolean has
> > been cleared, indicating that the waiting side got what it wanted.
> >
> > This fails quite rapidly, since when the thread is sitting waiting to
> > start spinning again, it'll be doing cond-waits. Since the signaling side
> > of the queue presumably could signal during this, and it does so with the
> > same SIGUSR1 that the thread's parent would, it would spuriously awaken
> > the thread. Since the signal succeeded, but the 'waiting' boolean doesn't
> > get reset, the signaling side of the queue tries again. And again. And
> > again. The machine grinds to a halt with massive switching and
> > mis-signaling.
> >
> > On solution is for the signaling side of the queue to wait for some amount
> > of time after each failed signal. This would have the effect of
> > limiting, if not eliminating, the spurious wakeups. Another solution is
> > to make the queue smarter, and if the signal comes back without the
> > 'waiting' boolean having been cleared, it checks the state of the queue
> > before trying again.
> >
> > Hrm, that's an interesting problem, that extends outside the scope of this
> > issue, but may also solve it. The queue only has one state variable,
> > since it's only a single element. This also means that there's only one
> > parent, which could either be the thread or its parent. This element
> > state could be sufficient to keep the signaling of the queue from
> > happening when it might get lost.
> >
> > If the queue is inside the thread, it will get interrupted, and set to
> > !PLAYING. If the other side of the queue gets called at that point and
> > has reason to signal, it can simply check the state of the element. It
> > would still want to check the 'waiting' boolean after signaling, in case
> > there's a race (or some other locking can be done). The problem then is
> > what to wait on? One possiblity is to just give up and finish the push or
> > pull and leave the 'waiting' boolean set. This would mean that every push
> > or pull after that would cause another attempt to wake up the other half
> > of the thread. This isn't so bad, necessarily.
> >
> > If the queue is outside the thread in question, we have a harder problem.
> > The state of the queue won't change when the thread leaves PLAYING, but at
> > least the queue can still check the state of the thread in a very
> > roundabout way by checking the state of the element peered to the other
> > side of the queue. Hmmmm...
> >
> > You'll notice that I never suggested that we simply trick the queue into
> > waking up and just go from there. We have to interrupt it in the middle
> > for one simple reason: the queue isn't the only thing that might block.
> > Anything that talks to the kernel could theoretically block. This mostly
> > includes things that go over the network, or even to disk in some cases
> > (though I'd rather assume that disk accesses are reasonably bounded).
> > Unless we also have a way of dealing with these kinds of elements and
> > interrupting their reads or writes, we can't special-case the queue. And
> > besides, that starts to put a lot more smarts into these kinds of
> > elements.
> >
> > Another option might be to require that anything that might block be
> > written with a timeout on any block-able call. This could work, but again
> > puts the burden on the plugin-writer to check with the scheduler (or
> > something, the problem is: what exactly?) and see if it should continue or
> > punt, and if it punts, how does it punt?
> >
> > Anyway, I need to go to sleep now. If anyone can follow that whole mess
> > and has comments, please write them up. If I'm making stupid assumptions,
> > tell me <g>
> >
> > Erik Walthinsen <omega at temple-baptist.com> - System Administrator
> > __
> > / \ GStreamer - The only way to stream!
> > | | M E G A ***** http://gstreamer.net/ *****
> > _\ /_
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > gstreamer-devel mailing list
> > gstreamer-devel at lists.sourceforge.net
> > http://lists.sourceforge.net/lists/listinfo/gstreamer-devel
> >
>
> _______________________________________________
> gstreamer-devel mailing list
> gstreamer-devel at lists.sourceforge.net
> http://lists.sourceforge.net/lists/listinfo/gstreamer-devel
More information about the gstreamer-devel
mailing list