[Nouveau] [PATCH v2 4/4] drm/nouveau: gpu lockup recovery

Ben Skeggs bskeggs at redhat.com
Wed May 2 06:48:13 PDT 2012


On Wed, 2012-05-02 at 15:33 +0200, Martin Peres wrote:
> On 02/05/2012 13:28, Ben Skeggs wrote:
> > Right, again, I don't disagree :)  I think we can improve a lot on the
> > big-hammer-suspend-the-gpu solution though, and instead reset only the
> > faulting engine.  It's (in theory) almost possible for us to do now, but
> > I have a couple of reworks to areas related to this pending (basically,
> > making the various driver subsystems more independent), which should be
> > ready soon.  This'll go a long way to making it very easy to reset a
> > single engine, and likely result in *far* faster recovery from hangs.
> Hey,
> 
> What about kicking a channel that put the card in a bad state? Wouldn't 
> that be possible?
> 
> This way, we don't loose the context of other channels and only the 
> application that hang the card will be exited.
That's pretty much the idea.  The trouble comes in where PFIFO will hang
waiting for the stuck engine to report that it's done (eg. it will wait
for PGRAPH to go "i've finished unloading my context now" after it's
told PGRAPH to do so).

Hence why it's important to be able to (preferably) un-stick the stuck
engine (usually handling the appropriate interrupts properly will
achieve this), and failing that, reset it and lose the context for just
that channel.

The work I'm doing at the moment will, among other nice things, make
handling all of this a lot nicer.  And it should be nice and speedy in
comparison to the suspend/resume option, we won't have to evict all
buffers from vram without accel, which can take quite a while (not to
mention that it might not even be possible to get to the VRAM not mapped
into the FB BAR on earlier chipsets if accel dies).

> 
> I wonder how pfifo handles commands sent to a non-existing channel, but 
> I'm sure it shouldn't hang or anything.
It can't happen anyway, if we destroyed the fifo context for a channel
we wouldn't be telling it to execute commands still :)

> 
> Anyway, if this is not possible to only kick one channel, then what 
> about kicking all channels, rePOSTING the card and using KMS to output 
> the lockup report (and send a notification of the report through udev 
> and store the report in a sysfs file)?
> 
> Let's not try to be perfect, let us just be able to do better bug reports.
I'm still skeptical about how useful any kind of generic "lockup report"
can possibly be, beyond kernel logs..  However, as part of the work I'm
working on, there may be some additional information available via
debugfs..  I don't wan't to elaborate on this too much yet until I wrap
my head around what exactly I want to achieve, but I'll give you a
heads-up once I do :)

Ben.

> 
> Martin
> _______________________________________________
> Nouveau mailing list
> Nouveau at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau




More information about the Nouveau mailing list