[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling

Wed Sep 2 03:01:09 PDT 2015

On 01/09/15 16:26, Ben Skeggs wrote:
> On 31 August 2015 at 21:38, Konsta Hölttä <kholtta at nvidia.com> wrote:
>> Hi there,
>>
>> Resending these now that they've had some more polish and testing, and I heard
>> that Ben's vacation is over :-)
>>
>> These patches work as a starting point for more explicit error mechanisms and
>> better robustness. At the moment, when a job hangs or faults, it seems that
>> nouveau doesn't quite know how to handle the situation and often results in a
>> hang. Some of these situations would require either completely resetting the
>> gpu, and/or a complex path for only recovering the broken channel.
> Yes, this has been an issue that's been on my "really nice to
> implement" list for quite a long time now, and only gotten around to
> getting it partially done (as you can see).  Thanks for looking at
> this!
>
>> To start, I worked on support for letting userspace know what exactly happened.
>> Proper recovery would come later. The "error notifier" in the first patch is a
>> simple shared buffer between kernel and userspace. Its error codes match
>> nvgpu's. Alternatively, the status could be queried with an ioctl, but that
>> would be considerably more heavyweight. I'd like to know if the event mechanism
>> is meant for these kinds of events at all (engines notify errors upwards to the
>> drm layer). Another alternative would probably be to register the same buffer
>> to all necessary engines separately in nvif method calls? Or register it to
>> just one (e.g., fifo) and get that engine when errors happen in others (e.g.,
>> gr)? And drm handles probably wouldn't fit there? Please comment on this; I
>> wrote this before understanding the mthd mechanism.
> For the moment, I've just got a couple of high-level
> comments/questions on these patches so far before we worry about
> nit-picking the details:
>
> Is there any real need to have a full-blown notifier-style object to
> pass this info back to userspace?  The DRM has an event mechanism
> where userspace could poll on its fd to get events, which is what I'd
> assume you'd be doing inside of fence waits with the error notifier?
> NVIF events are actually hooked up to this mechanism already, so
> userspace could directly request them and cut quite a lot out of that
> first patch.  But, the NVKM-level passing of these events back over
> NVIF is pretty much where I was going with the design too.  So, good
> to see you've done that :)
I hadn't heard about that event mechanism yet - do you mean just 
drm_poll() and ->event_wait that is woken up in usif_notify()? That 
doesn't have any values that could be passed to userland. Is there a way 
for that too already? The motivation for the current approach is that 
it's lightweight, not requiring any kernel calls (and is compatible with 
nvidia gl drivers...)

I don't actually know all the details how higher-level parts of the 
driver use the buffer :( just assuming things atm. I'd guess that the 
buffer would be read after each submit (after a fence wait has finished, 
which is a separate thing) if checking the status is required. My 
understanding is that at least the desktop driver has (or used to have) 
a shared buffer for lots of common stuff and this would be part of it 
(hence the offset field), which would complicate the whole driver stack 
if this were done differently. In Tegra's libdrm-equivalent this is 
initialized on channel creation, so could be in fact part of channel 
init args in nouveau?

I considered also that the buffer could be just one page and allocated 
in the kernel, but as said, our current drivers want to be able to pass 
a separate buffer there.

>
>> Additionally, priority and timeout management for separate channels in flight
>> on the gpu is added in two patches. Neither is exactly what the name says, but
>> the effect is the same, and this is what nvgpu does currently. Those two
>> patches call the fifo channel object's methods directly from userspace, so a
>> hack is added in the nvif path to accept that. The objects are NEW'd from
>> kernel space, so calling from userspace isn't allowed, as it appears. How
>> should this be managed in a clean way?
> Is there any requirement to be able to adjust the priority of
> in-flight channels?  Or would it be satisfactory to have these as
> arguments to the channel creation functions?  Either way is fine, the
> latter is preferred if there's no use case for adjusting it afterwards
> though.
Maybe not exactly, but there is a technical requirement forced by nvidia 
userspace driver that doesn't enforce this to be done at creation time, 
but allows it at any time. I looked at that option too when starting to 
write these patches and yes, that (specifying at init time) would've 
simplified stuff.

It's certainly possible that the timeout/priority is set only later in 
userspace after finding out that this is e.g. a webgl process requiring 
to finish faster. That's one plausible usecase.

>
> I'm attempting to get some better channel handling (you might have
> noticed the current stuff the kernel uses is a bit fragile, it's
> evolved over a long time and several generations of channel classes)
> in place in time for 4.4, part of which will likely involve giving
> userspace better control over the arguments that get passed at channel
> creation time.
The current stuff is at least not trivial to grasp initially :) I'd love 
to also help to write documentation for lots of things if time permits - 
I've been taking some notes for myself, but a general nvkm glossary and 
general reasoning about how things are done the way they are would 
definitely help new folks to get to know the codebase (wasn't easy for 
me). The problem is that I only could write about how things are done, 
not why. Getting to know all that would require lots of discussion (and 
time).

>
> As for the permissions problem, that can be resolved fairly easily I
> think, I'll play with a few ideas and see what I come up with.

Good to hear, thanks! A bit related question to this: what exactly 
determines the level of control kernel vs userspace have for the 
channels? Management rights for nouveau_bo's? Looks like some stuff of 
nouveau_chan.c could live in userspace too.

Cheers,
Konsta

>
> Thanks,
> Ben.
>
>> Also, since nouveau often hangs on errors, the userspace hangs too (waiting on
>> a fence). The final patch attempts to fix this in a couple of specific error
>> paths to forcibly update all fences to be finished. I'd like to hear how that
>> would be handled properly - consider the patch just a proof-of-concept and
>> sample of what would be necessary.
>>
>> I don't expect the patches to be accepted as-is - as a newbie, I'd appreciate
>> any high-level comments on if I've understood anything, especially the event
>> and nvif/method mechanisms (I use the latter from userspace with a hack
>> constructed from the perfmon branch seen here earlier into nvidia's internal
>> libdrm-equivalent). The fence-forcing thing is something that is necessary with
>> the error notifiers (at least with our userspace that waits really long or
>> infinitely on fences). I'm working specifically on Tegra and don't know much
>> about the desktop's userspace details, so I may be biased in some areas.
>>
>> I'd be happy to write sample tests on e.g. libdrm for the new methods once the
>> kernel patches would get to a good shape, if that's required for accepting new
>> features. I tested these to work as a proof-of-concept on Jetson TK1, and the
>> code is adapted from the latest nvgpu.
>>
>> The patches can also be found in http://github.com/sooda/nouveau and are based
>> on a version of gnurou/staging.
>>
>> Thanks!
>> Konsta (sooda in IRC)
>>
>> Konsta Hölttä (5):
>>    notify channel errors to userspace
>>    don't verify route == owner in nvkm ioctl
>>    gk104: channel priority/timeslice support
>>    gk104: channel timeout detection
>>    HACK force fences updated on error
>>
>>   drm/nouveau/include/nvif/class.h       |  20 ++++
>>   drm/nouveau/include/nvif/event.h       |  12 +++
>>   drm/nouveau/include/nvkm/engine/fifo.h |   5 +-
>>   drm/nouveau/nouveau_chan.c             |  95 +++++++++++++++++++
>>   drm/nouveau/nouveau_chan.h             |  10 ++
>>   drm/nouveau/nouveau_drm.c              |   1 +
>>   drm/nouveau/nouveau_fence.c            |  13 ++-
>>   drm/nouveau/nouveau_gem.c              |  29 ++++++
>>   drm/nouveau/nouveau_gem.h              |   2 +
>>   drm/nouveau/nvkm/core/ioctl.c          |   9 +-
>>   drm/nouveau/nvkm/engine/fifo/base.c    |  56 ++++++++++-
>>   drm/nouveau/nvkm/engine/fifo/gf100.c   |   2 +-
>>   drm/nouveau/nvkm/engine/fifo/gk104.c   | 166 ++++++++++++++++++++++++++++++---
>>   drm/nouveau/nvkm/engine/fifo/nv04.c    |   2 +-
>>   drm/nouveau/nvkm/engine/gr/gf100.c     |   5 +
>>   drm/nouveau/uapi/drm/nouveau_drm.h     |  13 +++
>>   16 files changed, 415 insertions(+), 25 deletions(-)
>>
>> --
>> 2.1.4
>>
>> _______________________________________________
>> Nouveau mailing list
>> Nouveau at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/nouveau
> --
> To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html