[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling

Wed Sep 2 03:13:49 PDT 2015

On 02/09/15 13:01, Konsta Hölttä wrote:
>
>
> On 01/09/15 16:26, Ben Skeggs wrote:
>> On 31 August 2015 at 21:38, Konsta Hölttä <kholtta at nvidia.com> wrote:
>>> Hi there,
>>>
>>> Resending these now that they've had some more polish and testing, 
>>> and I heard
>>> that Ben's vacation is over :-)
>>>
>>> These patches work as a starting point for more explicit error 
>>> mechanisms and
>>> better robustness. At the moment, when a job hangs or faults, it 
>>> seems that
>>> nouveau doesn't quite know how to handle the situation and often 
>>> results in a
>>> hang. Some of these situations would require either completely 
>>> resetting the
>>> gpu, and/or a complex path for only recovering the broken channel.
>> Yes, this has been an issue that's been on my "really nice to
>> implement" list for quite a long time now, and only gotten around to
>> getting it partially done (as you can see).  Thanks for looking at
>> this!
>>
>>> To start, I worked on support for letting userspace know what 
>>> exactly happened.
>>> Proper recovery would come later. The "error notifier" in the first 
>>> patch is a
>>> simple shared buffer between kernel and userspace. Its error codes 
>>> match
>>> nvgpu's. Alternatively, the status could be queried with an ioctl, 
>>> but that
>>> would be considerably more heavyweight. I'd like to know if the 
>>> event mechanism
>>> is meant for these kinds of events at all (engines notify errors 
>>> upwards to the
>>> drm layer). Another alternative would probably be to register the 
>>> same buffer
>>> to all necessary engines separately in nvif method calls? Or 
>>> register it to
>>> just one (e.g., fifo) and get that engine when errors happen in 
>>> others (e.g.,
>>> gr)? And drm handles probably wouldn't fit there? Please comment on 
>>> this; I
>>> wrote this before understanding the mthd mechanism.
>> For the moment, I've just got a couple of high-level
>> comments/questions on these patches so far before we worry about
>> nit-picking the details:
>>
>> Is there any real need to have a full-blown notifier-style object to
>> pass this info back to userspace?  The DRM has an event mechanism
>> where userspace could poll on its fd to get events, which is what I'd
>> assume you'd be doing inside of fence waits with the error notifier?
>> NVIF events are actually hooked up to this mechanism already, so
>> userspace could directly request them and cut quite a lot out of that
>> first patch.  But, the NVKM-level passing of these events back over
>> NVIF is pretty much where I was going with the design too.  So, good
>> to see you've done that :)
> I hadn't heard about that event mechanism yet - do you mean just 
> drm_poll() and ->event_wait that is woken up in usif_notify()? That 
> doesn't have any values that could be passed to userland. Is there a 
> way for that too already? The motivation for the current approach is 
> that it's lightweight, not requiring any kernel calls (and is 
> compatible with nvidia gl drivers...)
Ah, to answer myself: that'd be drm_read(). Doh. IIUC, that applies to 
nouveau's notify objects created from userspace? (route usif)

The poll and read calls would still be of more overhead than a plain 
read in userspace, and that wouldn't be compatible with nvidia userspace 
drivers. Might be doable with heavy refactoring and measuring if the 
overhead is significant - I'm not familiar with that high-level part yet.

Konsta

>
> I don't actually know all the details how higher-level parts of the 
> driver use the buffer :( just assuming things atm. I'd guess that the 
> buffer would be read after each submit (after a fence wait has 
> finished, which is a separate thing) if checking the status is 
> required. My understanding is that at least the desktop driver has (or 
> used to have) a shared buffer for lots of common stuff and this would 
> be part of it (hence the offset field), which would complicate the 
> whole driver stack if this were done differently. In Tegra's 
> libdrm-equivalent this is initialized on channel creation, so could be 
> in fact part of channel init args in nouveau?
>
> I considered also that the buffer could be just one page and allocated 
> in the kernel, but as said, our current drivers want to be able to 
> pass a separate buffer there.
>
>>
>>> Additionally, priority and timeout management for separate channels 
>>> in flight
>>> on the gpu is added in two patches. Neither is exactly what the name 
>>> says, but
>>> the effect is the same, and this is what nvgpu does currently. Those 
>>> two
>>> patches call the fifo channel object's methods directly from 
>>> userspace, so a
>>> hack is added in the nvif path to accept that. The objects are NEW'd 
>>> from
>>> kernel space, so calling from userspace isn't allowed, as it 
>>> appears. How
>>> should this be managed in a clean way?
>> Is there any requirement to be able to adjust the priority of
>> in-flight channels?  Or would it be satisfactory to have these as
>> arguments to the channel creation functions?  Either way is fine, the
>> latter is preferred if there's no use case for adjusting it afterwards
>> though.
> Maybe not exactly, but there is a technical requirement forced by 
> nvidia userspace driver that doesn't enforce this to be done at 
> creation time, but allows it at any time. I looked at that option too 
> when starting to write these patches and yes, that (specifying at init 
> time) would've simplified stuff.
>
> It's certainly possible that the timeout/priority is set only later in 
> userspace after finding out that this is e.g. a webgl process 
> requiring to finish faster. That's one plausible usecase.
>
>>
>> I'm attempting to get some better channel handling (you might have
>> noticed the current stuff the kernel uses is a bit fragile, it's
>> evolved over a long time and several generations of channel classes)
>> in place in time for 4.4, part of which will likely involve giving
>> userspace better control over the arguments that get passed at channel
>> creation time.
> The current stuff is at least not trivial to grasp initially :) I'd 
> love to also help to write documentation for lots of things if time 
> permits - I've been taking some notes for myself, but a general nvkm 
> glossary and general reasoning about how things are done the way they 
> are would definitely help new folks to get to know the codebase 
> (wasn't easy for me). The problem is that I only could write about how 
> things are done, not why. Getting to know all that would require lots 
> of discussion (and time).
>
>>
>> As for the permissions problem, that can be resolved fairly easily I
>> think, I'll play with a few ideas and see what I come up with.
>
> Good to hear, thanks! A bit related question to this: what exactly 
> determines the level of control kernel vs userspace have for the 
> channels? Management rights for nouveau_bo's? Looks like some stuff of 
> nouveau_chan.c could live in userspace too.
>
> Cheers,
> Konsta
>
>>
>> Thanks,
>> Ben.
>>
>>> Also, since nouveau often hangs on errors, the userspace hangs too 
>>> (waiting on
>>> a fence). The final patch attempts to fix this in a couple of 
>>> specific error
>>> paths to forcibly update all fences to be finished. I'd like to hear 
>>> how that
>>> would be handled properly - consider the patch just a 
>>> proof-of-concept and
>>> sample of what would be necessary.
>>>
>>> I don't expect the patches to be accepted as-is - as a newbie, I'd 
>>> appreciate
>>> any high-level comments on if I've understood anything, especially 
>>> the event
>>> and nvif/method mechanisms (I use the latter from userspace with a hack
>>> constructed from the perfmon branch seen here earlier into nvidia's 
>>> internal
>>> libdrm-equivalent). The fence-forcing thing is something that is 
>>> necessary with
>>> the error notifiers (at least with our userspace that waits really 
>>> long or
>>> infinitely on fences). I'm working specifically on Tegra and don't 
>>> know much
>>> about the desktop's userspace details, so I may be biased in some 
>>> areas.
>>>
>>> I'd be happy to write sample tests on e.g. libdrm for the new 
>>> methods once the
>>> kernel patches would get to a good shape, if that's required for 
>>> accepting new
>>> features. I tested these to work as a proof-of-concept on Jetson 
>>> TK1, and the
>>> code is adapted from the latest nvgpu.
>>>
>>> The patches can also be found in http://github.com/sooda/nouveau and 
>>> are based
>>> on a version of gnurou/staging.
>>>
>>> Thanks!
>>> Konsta (sooda in IRC)
>>>
>>> Konsta Hölttä (5):
>>>    notify channel errors to userspace
>>>    don't verify route == owner in nvkm ioctl
>>>    gk104: channel priority/timeslice support
>>>    gk104: channel timeout detection
>>>    HACK force fences updated on error
>>>
>>>   drm/nouveau/include/nvif/class.h       |  20 ++++
>>>   drm/nouveau/include/nvif/event.h       |  12 +++
>>>   drm/nouveau/include/nvkm/engine/fifo.h |   5 +-
>>>   drm/nouveau/nouveau_chan.c             |  95 +++++++++++++++++++
>>>   drm/nouveau/nouveau_chan.h             |  10 ++
>>>   drm/nouveau/nouveau_drm.c              |   1 +
>>>   drm/nouveau/nouveau_fence.c            |  13 ++-
>>>   drm/nouveau/nouveau_gem.c              |  29 ++++++
>>>   drm/nouveau/nouveau_gem.h              |   2 +
>>>   drm/nouveau/nvkm/core/ioctl.c          |   9 +-
>>>   drm/nouveau/nvkm/engine/fifo/base.c    |  56 ++++++++++-
>>>   drm/nouveau/nvkm/engine/fifo/gf100.c   |   2 +-
>>>   drm/nouveau/nvkm/engine/fifo/gk104.c   | 166 
>>> ++++++++++++++++++++++++++++++---
>>>   drm/nouveau/nvkm/engine/fifo/nv04.c    |   2 +-
>>>   drm/nouveau/nvkm/engine/gr/gf100.c     |   5 +
>>>   drm/nouveau/uapi/drm/nouveau_drm.h     |  13 +++
>>>   16 files changed, 415 insertions(+), 25 deletions(-)
>>>
>>> -- 
>>> 2.1.4
>>>
>>> _______________________________________________
>>> Nouveau mailing list
>>> Nouveau at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/nouveau
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-tegra" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> _______________________________________________
> Nouveau mailing list
> Nouveau at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau