Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

21 Mar 2022


      On Fri, Mar 18, 2022 at 08:12:54AM -0700, Rob Clark wrote:
...
On Fri, Mar 18, 2022 at 12:42 AM Christian König
christian.koenig@amd.com wrote:
...
Am 17.03.22 um 18:31 schrieb Rob Clark:
...
On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter daniel@ffwll.ch wrote:
...
[SNIP]
...
(At some point, I'd like to use scheduler for the replay, and actually
use drm_sched_stop()/etc.. but last time I looked there were still
some sched bugs in that area which prevented me from deleting a bunch
of code ;-))
Not sure about your hw, but at least on intel replaying tends to just
result in follow-on fun. And that holds even more so the more complex a
workload is. This is why vk just dies immediately and does not try to
replay anything, offloading it to the app. Same with arb robusteness.
Afaik it's really only media and classic gl which insist that the driver
stack somehow recover.
At least for us, each submit must be self-contained (ie. not rely on
previous GPU hw state), so in practice replay works out pretty well.
The worst case is subsequent submits from same process fail as well
(if they depended on something that crashing submit failed to write
back to memory.. but in that case they just crash as well and we move
on to the next one.. the recent gens (a5xx+ at least) are pretty good
about quickly detecting problems and giving us an error irq.
Well I absolutely agree with Daniel.
The whole replay thing AMD did in the scheduler is an absolutely mess
and should probably be killed with fire.
I strongly recommend not to do the same mistake in other drivers.
If you want to have some replay feature then please make it driver
specific and don't use anything from the infrastructure in the DRM
scheduler.
hmm, perhaps I was not clear, but I'm only talking about re-emitting
jobs *following* the faulting one (which could be from other contexts,
etc).. not trying to restart the faulting job.
You absolutely can drop jobs on the floor, this is what both anv and iris
expect. They use what we call non-recoverable context, meaning when any
gpu hang happens and the context is affect (whether as the guilty on, or
because it was a multi-engine reset and it was victimized) we kill it
entirely. No replaying, and any further execbuf ioctl fails with -EIO.
Userspace then gets to sort out the mess, which for vk is
VK_ERROR_DEVICE_LOST, for robust gl it's the same, and for non-robust gl
iris re-creates a pile of things.
Anything in-between _is_ dropped on the floor completely.
Also note that this is obviously uapi, if you have an userspace which
expect contexts to survive, then replaying makes some sense.
...
You *absolutely* need to replay jobs following the faulting one, they
could be from unrelated contexts/processes.  You can't just drop them
on the floor.
Currently it is all driver specific, but I wanted to delete a lot of
code and move to using scheduler to handle faults/timeouts (but
blocked on that until [1] is resolved)
Yeah for the drivers where the uapi is "you can safely replay after a
hang, and you're supposed to", then sharing the code is ofc a good idea.
Just wanted to make it clear that this is only one of many uapi flavours
you can pick from, dropping it all on the floor is a perfectly legit
approach :-) And imo it's the more robust one, and also better fits with
latest apis like gl_arb_robustness or vk.
Cheers, Daniel
...
[1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-...
BR,
-R
...
Thanks,
Christian.
...
BR,
-R
...
And recovering from a mess in userspace is a lot simpler than trying to
pull of the same magic in the kernel. Plus it also helps with a few of the
dma_fence rules, which is a nice bonus.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event