[Nouveau] deadlock possiblity introduced by "drm/nouveau: use drm_mm in preference to custom code doing the same thing"

Mon Jul 26 18:23:14 PDT 2010

On Mon, Jul 26, 2010 at 07:15:47PM +0200, Francisco Jerez wrote:
> Marcin Slusarz <marcin.slusarz at gmail.com> writes:
> 
> > On Sun, Jul 11, 2010 at 11:02:12AM +1000, Ben Skeggs wrote:
> >> On Sun, 2010-07-11 at 01:24 +0200, Marcin Slusarz wrote:
> >> > Hi
> >> > 
> >> > Patch "drm/nouveau: use drm_mm in preference to custom code doing the same thing"
> >> > in nouveau tree introduced new deadlock possibility, for which lockdep complains loudly:
> >> > 
> >> > (...)
> >> > 
> >> Hey,
> >> 
> >> Thanks for the report, I'll look at this more during the week.
> >> 
> >> > Deadlock scenario looks like this:
> >> > CPU1                                        CPU2
> >> > nouveau code calls some drm_mm.c
> >> > function which takes mm->unused_lock
> >> > 
> >> >                                             nouveau_channel_free disables irqs and takes dev_priv->context_switch_lock
> >> >                                                             calls nv50_graph_destroy_context which
> >> >                                                             (... backtrace above)
> >> >                                                             calls drm_mm_put_block which tries to take mm->unused_lock (spins)
> >> > nouveau interrupt raises
> >> > 
> >> > nouveau_irq_handler tries to take
> >> > dev_priv->context_switch_lock (spins)
> >> > 
> >> > deadlock
> >> It's important to note that the drm_mm referenced eventually by
> >> nv50_graph_destroy_context is per-channel on the card, so for the
> >> deadlock to happen it'd have to be multiple threads from a single
> >> process, one thread creating/destroying and object on the channel while
> >> another was trying to destroy the channel.
> >> 
> 
> Yeah, and that situation is impossible ATM because those functions are
> called with the BKL held.

Yeah, but BKL is on the way out of kernel.

> >> > 
> >> > Possible solutions:
> >> > - reverting "drm/nouveau: use drm_mm in preference to custom code doing the same thing"
> >> > - disabling interrupts before calling drm_mm functions - unmaintainable and still
> >> >   deadlockable in multicard setups (nouveau and eg radeon)
> >> Agreed it's unmaintainable, but, as mentioned above, the relevant locks
> >> can't be touched by radeon.
> >> 
> >> > - making mm->unused_lock HARDIRQ-safe (patch below) - simple but with slight overhead
> >> I'll look more during the week, there's other solutions to be explored.
> >
> > So, did you find other solution?
> 
> Some random ideas:
> 
>  - Make context_switch_lock HARDIRQ-unsafe. To avoid racing with the IRQ
>    handler we'd have to disable interrupt dispatch on the card before
>    taking context_switch_lock (i.e. at channel creation and destruction
>    time), and the interrupt control registers would have to be protected
>    with a IRQ safe spinlock.

Pretty heavy weight...

>  - Split the current destroy_context() hooks in two halves, the first
>    one would be in charge of the PFIFO/PGRAPH-poking tasks (e.g.
>    disable_context()), and the second one would take care of releasing
>    the allocated resources (and it wouldn't need locking).

Doable, but it might complicate this code significantly...

I didn't dig into this, but maybe adding lockdep annotation will be enough?

Marcin