[Intel-gfx] [PATCH] drm/i915: fix wait_for_pending_flips vs gpu hang deadlock

Sun Sep 8 14:24:28 CEST 2013

On Sun, Sep 08, 2013 at 02:12:35PM +0200, Daniel Vetter wrote:
> On Sun, Sep 08, 2013 at 12:56:22PM +0100, Chris Wilson wrote:
> > On Sun, Sep 08, 2013 at 01:30:17PM +0200, Daniel Vetter wrote:
> [Snip]
> 
> I'll frob the comments a bit as suggested.
> 
> > > +		 * doesn't deadlock trying to grab various locks. We must do
> > > +		 * this for _all_ gpu waiters that might hold locks that the
> > > +		 * reset work needs to acquire.
> > >  		 */
> > >  		for_each_ring(ring, dev_priv, i)
> > >  			wake_up_all(&ring->irq_queue);
> > > +
> > > +		wake_up_all(&dev_priv->pending_flip_queue);
> > 
> > i915_error_wake_up();
> > 
> > or i915_error_wake_up_mutexes()? (Though this is slightly misleading,
> > but it is the intent.) Not 100% convinced that making using a common
> > function is right semantically...
> 
> Yeah, I'm also a bit unhappy how unstructured this magic dance is.
> Semantically I think the two wake-ups aren't the same:
> 
> - The first wakeups (after we've set the pending reset flag) is just to
>   get lock holders to bail out and drop their locks.
> 
> - The second set of wakupes (after the reset has completed) is to signal
>   waiters that a state change has happened. I'm not really sure whether we
>   need this second set of wake-ups (with the exceptoin of
>   gpu_erro.reset_queue of cours) since waiters shouldn't be able to miss
>   the reset pending state. And if they block after the reset has completed
>   they should see the consistent post-reset state and nothing else.
> 
>   Hm, on second thought we need the wake-ups to serve as an implicit
>   barrier.
> 
> Another thing I've tried to ponder is whether we could enlist the help of
> lockdep somehow to mark up these critical sections. But I dit not see a
> way to do this so that lockdep would complain if we'd miss a wakup for a
> special lock ...

To elaborate a bit on why I think lockdep would be suitable: The entire
time the pending reset flag is set is in a way a special critical section.
Lockdep doesn't support acquire/release from different threads, but we
could easily fix that by shuffling a few things around.

Blocking on the gpu otoh is like a read-lock on that gpu reset critical
section. So to make locking sound we need to make sure we always have
trylock semantics.

One way to do this would be by using a specila wait queue in addition to
all the ones we already have just to signal gpu resets. And a special
macro like i915_wait_event that adds the process to both waitqueues and
encapsulates the reset pending handling.

With that lockdep would complain if we acquire a new lock in the reset
handler or if we add a new gpu waiter. But atm lockdep doesn correctly
report read vs. write lock deadlocks, so meh.

Also there's kinda no way to ensure waiters don't use one of the plaine
wait_event macros. So I think overall not worth the effort.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch