[PATCH v2 1/2] drm: Add GPU reset sysfs event

Marek Olšák maraeo at gmail.com
Tue Mar 29 16:25:55 UTC 2022


I don't know what iris does, but I would guess that the same problems as
with AMD GPUs apply, making GPUs resets very fragile.

Marek

On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig at amd.com>
wrote:

> My main question is what does the iris driver better than radeonsi when
> the client doesn't support the robustness extension?
>
> From Daniels description it sounds like they have at least a partial
> recovery mechanism in place.
>
> Apart from that I completely agree to what you said below.
>
> Christian.
>
> Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some GPU
> hangs due to bad shaders or bad states. The soft reset uses a signal that
> kills all currently-running shaders of a certain process (VM context),
> which unblocks the graphics pipeline, so draws and command buffers finish
> but are not correctly. This can then cause a hard hang if the shader was
> supposed to signal work completion through a shader store instruction and a
> non-shader consumer is waiting for it (skipping the store instruction by
> killing the shader won't signal the work, and thus the consumer will be
> stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it may
> use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> contents, but we should assume that any process that had running jobs on
> the GPU during a GPU reset has its memory resources in an inconsistent
> state, and thus following command buffers can cause another GPU hang. The
> shader store example above is enough to cause another hard hang due to
> incorrect content in memory resources, which can contain synchronization
> primitives that are used internally by the hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a sure
> way to hang it again. Unrelated processes can be affected due to lost VRAM
> or the misfortune of using the GPU while the GPU hang occurred. The window
> system should recreate GPU resources and redraw everything without
> affecting applications. If apps use GL, they should do the same. Processes
> that can't recover by redrawing content can be terminated or left alone,
> but they shouldn't be allowed to submit work to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to trigger
> a soft reset at this point, but Vulkan can also trigger a hard reset.
>
> Marek
> ------------------------------
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> <Christian.Koenig at amd.com>
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter <daniel at ffwll.ch> <daniel at ffwll.ch>; Daniel Stone
> <daniel at fooishbar.org> <daniel at fooishbar.org>; Olsak, Marek
> <Marek.Olsak at amd.com> <Marek.Olsak at amd.com>; Grodzovsky, Andrey
> <Andrey.Grodzovsky at amd.com> <Andrey.Grodzovsky at amd.com>
> *Cc:* Rob Clark <robdclark at gmail.com> <robdclark at gmail.com>; Rob Clark
> <robdclark at chromium.org> <robdclark at chromium.org>; Sharma, Shashank
> <Shashank.Sharma at amd.com> <Shashank.Sharma at amd.com>; Christian König
> <ckoenig.leichtzumerken at gmail.com> <ckoenig.leichtzumerken at gmail.com>;
> Somalapuram, Amaranath <Amaranath.Somalapuram at amd.com>
> <Amaranath.Somalapuram at amd.com>; Abhinav Kumar <quic_abhinavk at quicinc.com>
> <quic_abhinavk at quicinc.com>; dri-devel <dri-devel at lists.freedesktop.org>
> <dri-devel at lists.freedesktop.org>; amd-gfx list
> <amd-gfx at lists.freedesktop.org> <amd-gfx at lists.freedesktop.org>; Deucher,
> Alexander <Alexander.Deucher at amd.com> <Alexander.Deucher at amd.com>;
> Shashank Sharma <contactshashanksharma at gmail.com>
> <contactshashanksharma at gmail.com>
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
>
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel at fooishbar.org>
> <daniel at fooishbar.org> wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark at gmail.com>
> <robdclark at gmail.com> wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>> <christian.koenig at amd.com> <christian.koenig at amd.com> wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible feels a bit like overkill.
> > -Daniel
> >
> >> Cheers,
> >> Daniel
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220329/0a2232dc/attachment-0001.htm>


More information about the amd-gfx mailing list