[PATCH v2 1/2] drm: Add GPU reset sysfs event

Rob Clark robdclark at gmail.com
Wed Mar 23 17:30:00 UTC 2022


On Wed, Mar 23, 2022 at 8:14 AM Daniel Vetter <daniel at ffwll.ch> wrote:
>
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel at fooishbar.org> wrote:
> >
> > Hi,
> >
> > On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark at gmail.com> wrote:
> > > On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > > <christian.koenig at amd.com> wrote:
> > > > Well you can, it just means that their contexts are lost as well.
> > >
> > > Which is rather inconvenient when deqp-egl reset tests, for example,
> > > take down your compositor ;-)
> >
> > Yeah. Or anything WebGL.
> >
> > System-wide collateral damage is definitely a non-starter. If that
> > means that the userspace driver has to do what iris does and ensure
> > everything's recreated and resubmitted, that works too, just as long
> > as the response to 'my adblocker didn't detect a crypto miner ad'  is
> > something better than 'shoot the entire user session'.
>
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.
>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.

It would perhaps be nice if older gens which don't (yet) have
per-process pgtables to have gone with the userspace-replays (although
that would require a lot more tracking in userspace than what is done
currently).. but fortunately those older gens don't use "state
objects" which could potentially be corrupted, but instead re-emit
state in cmdstream, so there is a lot less possibility for bad
collateral damage.  (On all the gens we also use gpu read-only buffers
whenever the gpu does not need to be able to write them.)

For newer stuff, the process isolation works pretty well.  In fact we
recently changed MSM_PARAM_FAULTS to only report faults/hangs in the
same address space, so the compositor is not even aware (and doesn't
need to be aware).

BR,
-R

> -Daniel
>
> > Cheers,
> > Daniel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


More information about the amd-gfx mailing list