[Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling
Konsta Hölttä
kholtta at nvidia.com
Wed Sep 2 03:13:49 PDT 2015
On 02/09/15 13:01, Konsta Hölttä wrote:
>
>
> On 01/09/15 16:26, Ben Skeggs wrote:
>> On 31 August 2015 at 21:38, Konsta Hölttä <kholtta at nvidia.com> wrote:
>>> Hi there,
>>>
>>> Resending these now that they've had some more polish and testing,
>>> and I heard
>>> that Ben's vacation is over :-)
>>>
>>> These patches work as a starting point for more explicit error
>>> mechanisms and
>>> better robustness. At the moment, when a job hangs or faults, it
>>> seems that
>>> nouveau doesn't quite know how to handle the situation and often
>>> results in a
>>> hang. Some of these situations would require either completely
>>> resetting the
>>> gpu, and/or a complex path for only recovering the broken channel.
>> Yes, this has been an issue that's been on my "really nice to
>> implement" list for quite a long time now, and only gotten around to
>> getting it partially done (as you can see). Thanks for looking at
>> this!
>>
>>> To start, I worked on support for letting userspace know what
>>> exactly happened.
>>> Proper recovery would come later. The "error notifier" in the first
>>> patch is a
>>> simple shared buffer between kernel and userspace. Its error codes
>>> match
>>> nvgpu's. Alternatively, the status could be queried with an ioctl,
>>> but that
>>> would be considerably more heavyweight. I'd like to know if the
>>> event mechanism
>>> is meant for these kinds of events at all (engines notify errors
>>> upwards to the
>>> drm layer). Another alternative would probably be to register the
>>> same buffer
>>> to all necessary engines separately in nvif method calls? Or
>>> register it to
>>> just one (e.g., fifo) and get that engine when errors happen in
>>> others (e.g.,
>>> gr)? And drm handles probably wouldn't fit there? Please comment on
>>> this; I
>>> wrote this before understanding the mthd mechanism.
>> For the moment, I've just got a couple of high-level
>> comments/questions on these patches so far before we worry about
>> nit-picking the details:
>>
>> Is there any real need to have a full-blown notifier-style object to
>> pass this info back to userspace? The DRM has an event mechanism
>> where userspace could poll on its fd to get events, which is what I'd
>> assume you'd be doing inside of fence waits with the error notifier?
>> NVIF events are actually hooked up to this mechanism already, so
>> userspace could directly request them and cut quite a lot out of that
>> first patch. But, the NVKM-level passing of these events back over
>> NVIF is pretty much where I was going with the design too. So, good
>> to see you've done that :)
> I hadn't heard about that event mechanism yet - do you mean just
> drm_poll() and ->event_wait that is woken up in usif_notify()? That
> doesn't have any values that could be passed to userland. Is there a
> way for that too already? The motivation for the current approach is
> that it's lightweight, not requiring any kernel calls (and is
> compatible with nvidia gl drivers...)
Ah, to answer myself: that'd be drm_read(). Doh. IIUC, that applies to
nouveau's notify objects created from userspace? (route usif)
The poll and read calls would still be of more overhead than a plain
read in userspace, and that wouldn't be compatible with nvidia userspace
drivers. Might be doable with heavy refactoring and measuring if the
overhead is significant - I'm not familiar with that high-level part yet.
Konsta
>
> I don't actually know all the details how higher-level parts of the
> driver use the buffer :( just assuming things atm. I'd guess that the
> buffer would be read after each submit (after a fence wait has
> finished, which is a separate thing) if checking the status is
> required. My understanding is that at least the desktop driver has (or
> used to have) a shared buffer for lots of common stuff and this would
> be part of it (hence the offset field), which would complicate the
> whole driver stack if this were done differently. In Tegra's
> libdrm-equivalent this is initialized on channel creation, so could be
> in fact part of channel init args in nouveau?
>
> I considered also that the buffer could be just one page and allocated
> in the kernel, but as said, our current drivers want to be able to
> pass a separate buffer there.
>
>>
>>> Additionally, priority and timeout management for separate channels
>>> in flight
>>> on the gpu is added in two patches. Neither is exactly what the name
>>> says, but
>>> the effect is the same, and this is what nvgpu does currently. Those
>>> two
>>> patches call the fifo channel object's methods directly from
>>> userspace, so a
>>> hack is added in the nvif path to accept that. The objects are NEW'd
>>> from
>>> kernel space, so calling from userspace isn't allowed, as it
>>> appears. How
>>> should this be managed in a clean way?
>> Is there any requirement to be able to adjust the priority of
>> in-flight channels? Or would it be satisfactory to have these as
>> arguments to the channel creation functions? Either way is fine, the
>> latter is preferred if there's no use case for adjusting it afterwards
>> though.
> Maybe not exactly, but there is a technical requirement forced by
> nvidia userspace driver that doesn't enforce this to be done at
> creation time, but allows it at any time. I looked at that option too
> when starting to write these patches and yes, that (specifying at init
> time) would've simplified stuff.
>
> It's certainly possible that the timeout/priority is set only later in
> userspace after finding out that this is e.g. a webgl process
> requiring to finish faster. That's one plausible usecase.
>
>>
>> I'm attempting to get some better channel handling (you might have
>> noticed the current stuff the kernel uses is a bit fragile, it's
>> evolved over a long time and several generations of channel classes)
>> in place in time for 4.4, part of which will likely involve giving
>> userspace better control over the arguments that get passed at channel
>> creation time.
> The current stuff is at least not trivial to grasp initially :) I'd
> love to also help to write documentation for lots of things if time
> permits - I've been taking some notes for myself, but a general nvkm
> glossary and general reasoning about how things are done the way they
> are would definitely help new folks to get to know the codebase
> (wasn't easy for me). The problem is that I only could write about how
> things are done, not why. Getting to know all that would require lots
> of discussion (and time).
>
>>
>> As for the permissions problem, that can be resolved fairly easily I
>> think, I'll play with a few ideas and see what I come up with.
>
> Good to hear, thanks! A bit related question to this: what exactly
> determines the level of control kernel vs userspace have for the
> channels? Management rights for nouveau_bo's? Looks like some stuff of
> nouveau_chan.c could live in userspace too.
>
> Cheers,
> Konsta
>
>>
>> Thanks,
>> Ben.
>>
>>> Also, since nouveau often hangs on errors, the userspace hangs too
>>> (waiting on
>>> a fence). The final patch attempts to fix this in a couple of
>>> specific error
>>> paths to forcibly update all fences to be finished. I'd like to hear
>>> how that
>>> would be handled properly - consider the patch just a
>>> proof-of-concept and
>>> sample of what would be necessary.
>>>
>>> I don't expect the patches to be accepted as-is - as a newbie, I'd
>>> appreciate
>>> any high-level comments on if I've understood anything, especially
>>> the event
>>> and nvif/method mechanisms (I use the latter from userspace with a hack
>>> constructed from the perfmon branch seen here earlier into nvidia's
>>> internal
>>> libdrm-equivalent). The fence-forcing thing is something that is
>>> necessary with
>>> the error notifiers (at least with our userspace that waits really
>>> long or
>>> infinitely on fences). I'm working specifically on Tegra and don't
>>> know much
>>> about the desktop's userspace details, so I may be biased in some
>>> areas.
>>>
>>> I'd be happy to write sample tests on e.g. libdrm for the new
>>> methods once the
>>> kernel patches would get to a good shape, if that's required for
>>> accepting new
>>> features. I tested these to work as a proof-of-concept on Jetson
>>> TK1, and the
>>> code is adapted from the latest nvgpu.
>>>
>>> The patches can also be found in http://github.com/sooda/nouveau and
>>> are based
>>> on a version of gnurou/staging.
>>>
>>> Thanks!
>>> Konsta (sooda in IRC)
>>>
>>> Konsta Hölttä (5):
>>> notify channel errors to userspace
>>> don't verify route == owner in nvkm ioctl
>>> gk104: channel priority/timeslice support
>>> gk104: channel timeout detection
>>> HACK force fences updated on error
>>>
>>> drm/nouveau/include/nvif/class.h | 20 ++++
>>> drm/nouveau/include/nvif/event.h | 12 +++
>>> drm/nouveau/include/nvkm/engine/fifo.h | 5 +-
>>> drm/nouveau/nouveau_chan.c | 95 +++++++++++++++++++
>>> drm/nouveau/nouveau_chan.h | 10 ++
>>> drm/nouveau/nouveau_drm.c | 1 +
>>> drm/nouveau/nouveau_fence.c | 13 ++-
>>> drm/nouveau/nouveau_gem.c | 29 ++++++
>>> drm/nouveau/nouveau_gem.h | 2 +
>>> drm/nouveau/nvkm/core/ioctl.c | 9 +-
>>> drm/nouveau/nvkm/engine/fifo/base.c | 56 ++++++++++-
>>> drm/nouveau/nvkm/engine/fifo/gf100.c | 2 +-
>>> drm/nouveau/nvkm/engine/fifo/gk104.c | 166
>>> ++++++++++++++++++++++++++++++---
>>> drm/nouveau/nvkm/engine/fifo/nv04.c | 2 +-
>>> drm/nouveau/nvkm/engine/gr/gf100.c | 5 +
>>> drm/nouveau/uapi/drm/nouveau_drm.h | 13 +++
>>> 16 files changed, 415 insertions(+), 25 deletions(-)
>>>
>>> --
>>> 2.1.4
>>>
>>> _______________________________________________
>>> Nouveau mailing list
>>> Nouveau at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/nouveau
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-tegra" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> _______________________________________________
> Nouveau mailing list
> Nouveau at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau
More information about the Nouveau
mailing list