[gst-devel] Multi-threaded queueing locking issues
zaheer at grid9.net
zaheer at grid9.net
Mon Mar 19 15:51:49 CET 2001
On Mon, 19 Mar 2001, Erik Walthinsen wrote:
> OK, I think I've got the locking code to work properly, and even be quite
> efficient. This is accomplished with the use of interleaved signals and
> cond-waits. You have something like this:
>
> thread A: thread B:
> g_mutex_lock(lock); g_mutex_lock(lock);
> g_cond_signal(cond); g_cond_wait(cond,lock);
> g_cond_wait(cond,lock); g_cond_signal(lock);
> g_mutex_unlock(lock); g_mutex_unlock(lock);
The above is good, makes it all atomic.
>
> This is the best way to do interlocking sync points, afaict. Thanks to
> matth for teaching me about it! How exactly it works is left as an
> excercise for the reader <g>
What happens, in thread A is that a lock is requested, a signal is sent
(notifying of some change in something) and then it sleeps until it
receives a signal (while waiting for a signal, the lock is released) and
then unlocks the lock.
In thread B, the lock is requested and then it sleeps until it receives a
signal and then sends a signal out and then unlocks the lock.
The problem with this approach, is that a signal on a condition variable
is only caught if some thread is already waiting on that signal.
So if thread B is not on the piece of code given above when thread A
signals then a deadlock WILL occur. I think you are trying to address it
in some of the paragraphs below.
>
> Anyway, that's one problem down, a few more to tackle. Specifically, I've
> run into the fact that if you have any queue's in your thread, they are
> generally going to be stuck waiting on a condition variable for the other
> side of the queue to put something in or take something out.
>
> When a change_state happens on the thread, it must tell the thread to stop
> spinning and change the state of all the children. This is done by simply
> clearing the SPINNING flag on the thread and waiting for it to change
> state and signal back. This only works if the while(SPINNING) loop has a
> chance to exit, which means that the bin_iterate() function must exit at
> some point.
>
> When you have queues on thread boundaries, you're going to end up stuck in
> a cond-wait such that the iterate() will never exit. This is a
> significant problem, because this puts us in deadlock.
>
> The solution I've developed in my head is this: after setting the state
> to !SPINNING, do a cond_timed_wait with some reasonable, configurable
> timeout. If this timed-wait times out, we could assume that the thread is
> stuck somewhere. This can be clarified by having the elements that might
> block set a flag on the thread during the event that might block, so we
> know whether it's blocking or just taking a long time.
I dont think the above is elegant.
>
> When we find ourselves stuck with a blocked thread, we have to somehow
> unblock it in a way that doesn't cause significant pain. My thought would
> be to fire a signal() at it, which the thread would catch with a custom
> signal handler. This handler would simply cause a cothread_switch() back
> to the 0th thread, which puts it right back into the middle of
> bin_iterate(). A flag can tell iterate() to stop everything and return
> from that point.
Again, it doesnt seem elegant IMO.
>
> Now, there are quite a few concerns here. First of all, this signal is
> going to come right in the middle of the queue's cond_wait. From my quick
> look through the linuxthreads code and minimal understanding of POSIX
> signals, I really can't say whether this is a problem or not. I'd guess
> not.
POSIX signals and threads mixing generally is not a good idea, also
hampers portability to other non POSIX OS's or broken POSIX
implementations.
>
> Next, doing a cothread switch in the middle of this mangled context could
> be very scary. I've noticed sigsetjmp(3), which claims to save the
> blocked-signals state. This seems apropos, since the linuxthreads code
> for a suspend is:
>
> sigdelset(&mask, __pthread_sig_restart); /* Unblock the restart signal */
> do {
> sigsuspend(&mask); /* Wait for signal */
> } ...
>
> If setjmp/longjmp mangle the blocked-signals list, this could cause
> significant problems. If the restart signal (SIGUSR1) ends up blocked
> when we switch back from the setjmp to cothread 0, we'll be stuck in the
> cond_wait forever (possibly even beyond parent death, ick).
> Experimentation is needed to determine if we have to do
> sigsetjmp/siglongjmp in these cases. If so, we need to determine the
> overhead of sigsetjmp/siglongjmp every time vs. checking to see if we need
> to use it for each jump.
>
> Next comes the question of what happens if when we try to perform a
> cond-wait in the thread-interlock code in cothread 0 while we have an
> interrupted cond-wait sitting there waiting for the queue to signal. Is
> the thread cond-wait going to somehow trigger the queue's cond-wait, and
> if so I really don't want to think about the mayhem that will cause when
> it comes back on the wrong stack. I'm guessing it will work though,
> because linuxthreads uses queues internally, and afaik the signal handler
> installed for the thread's cond-wait will always trigger and remove itself
> before the thread's cond-wait even has a chance to pop back to the top of
> the signal-handler stack.
It is not defined which thread will wake up if more than one thread is
waiting on a condition variable,
Unfortunately, my head's not up to looking through the enxt few paragraphs
now :)
Once I have had a few drops of caffeine and a deadline out of the way,
then I will look through the rest of the paragraphs.
Zaheer
>
> Then, when we restart the thread, we want to jump right back where we came
> from, which would be the signal handler that interrupted the queue's
> cond-wait in the first place. I would assume that the signal context
> would cleanly unwrap, and the code would end up back in the middle of the
> sigsuspend in the code shown above. It may do a cycle through the
> do/while, but that's what it's there for.
>
> The problem is: what if the other side of the queue signals while the
> thread is shut down? First thing is that it's going to trigger the
> cond-wait signal handler that actually belongs to the thread at that
> point. This isn't such a big deal from the thread's point of view, since
> it's just a spurious wakeup. It'll go back to sleep since it wasn't woken
> up for the reason it was waiting for. The side-effect of this is that the
> signal that the queue was waiting for to unblock itself gets lost. The
> other side of the queue has no idea that this is the case, and since the
> queue is no longer either empty or full, it won't *ever* signal again.
> Oops.
>
> That means that somehow we have to work in an interlock in queue
> signaling, such that the signaling side knows whether the waiting side has
> actually woken up. So we end up with a protected boolean that the waiting
> side sets before it goes to sleep, and unsets as it wakes up. The
> signaling side would keep signaling until it sees that this boolean has
> been cleared, indicating that the waiting side got what it wanted.
>
> This fails quite rapidly, since when the thread is sitting waiting to
> start spinning again, it'll be doing cond-waits. Since the signaling side
> of the queue presumably could signal during this, and it does so with the
> same SIGUSR1 that the thread's parent would, it would spuriously awaken
> the thread. Since the signal succeeded, but the 'waiting' boolean doesn't
> get reset, the signaling side of the queue tries again. And again. And
> again. The machine grinds to a halt with massive switching and
> mis-signaling.
>
> On solution is for the signaling side of the queue to wait for some amount
> of time after each failed signal. This would have the effect of
> limiting, if not eliminating, the spurious wakeups. Another solution is
> to make the queue smarter, and if the signal comes back without the
> 'waiting' boolean having been cleared, it checks the state of the queue
> before trying again.
>
> Hrm, that's an interesting problem, that extends outside the scope of this
> issue, but may also solve it. The queue only has one state variable,
> since it's only a single element. This also means that there's only one
> parent, which could either be the thread or its parent. This element
> state could be sufficient to keep the signaling of the queue from
> happening when it might get lost.
>
> If the queue is inside the thread, it will get interrupted, and set to
> !PLAYING. If the other side of the queue gets called at that point and
> has reason to signal, it can simply check the state of the element. It
> would still want to check the 'waiting' boolean after signaling, in case
> there's a race (or some other locking can be done). The problem then is
> what to wait on? One possiblity is to just give up and finish the push or
> pull and leave the 'waiting' boolean set. This would mean that every push
> or pull after that would cause another attempt to wake up the other half
> of the thread. This isn't so bad, necessarily.
>
> If the queue is outside the thread in question, we have a harder problem.
> The state of the queue won't change when the thread leaves PLAYING, but at
> least the queue can still check the state of the thread in a very
> roundabout way by checking the state of the element peered to the other
> side of the queue. Hmmmm...
>
> You'll notice that I never suggested that we simply trick the queue into
> waking up and just go from there. We have to interrupt it in the middle
> for one simple reason: the queue isn't the only thing that might block.
> Anything that talks to the kernel could theoretically block. This mostly
> includes things that go over the network, or even to disk in some cases
> (though I'd rather assume that disk accesses are reasonably bounded).
> Unless we also have a way of dealing with these kinds of elements and
> interrupting their reads or writes, we can't special-case the queue. And
> besides, that starts to put a lot more smarts into these kinds of
> elements.
>
> Another option might be to require that anything that might block be
> written with a timeout on any block-able call. This could work, but again
> puts the burden on the plugin-writer to check with the scheduler (or
> something, the problem is: what exactly?) and see if it should continue or
> punt, and if it punts, how does it punt?
>
> Anyway, I need to go to sleep now. If anyone can follow that whole mess
> and has comments, please write them up. If I'm making stupid assumptions,
> tell me <g>
>
> Erik Walthinsen <omega at temple-baptist.com> - System Administrator
> __
> / \ GStreamer - The only way to stream!
> | | M E G A ***** http://gstreamer.net/ *****
> _\ /_
>
>
>
>
>
>
>
>
>
> _______________________________________________
> gstreamer-devel mailing list
> gstreamer-devel at lists.sourceforge.net
> http://lists.sourceforge.net/lists/listinfo/gstreamer-devel
>
More information about the gstreamer-devel
mailing list