<div dir="auto">I don't know what iris does, but I would guess that the same problems as with AMD GPUs apply, making GPUs resets very fragile.<div dir="auto"><br></div><div dir="auto">Marek</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue., Mar. 29, 2022, 08:14 Christian König, <<a href="mailto:christian.koenig@amd.com">christian.koenig@amd.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
My main question is what does the iris driver better than radeonsi
when the client doesn't support the robustness extension?<br>
<br>
From Daniels description it sounds like they have at least a partial
recovery mechanism in place.<br>
<br>
Apart from that I completely agree to what you said below.<br>
<br>
Christian.<br>
<br>
<div>Am 26.03.22 um 01:53 schrieb Olsak,
Marek:<br>
</div>
<blockquote type="cite">
<p style="font-family:Arial;font-size:10pt;color:#0000ff;margin:5pt" align="Left">
[AMD Official Use Only]<br>
</p>
<br>
<div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
amdgpu has 2 resets: soft reset and hard reset.<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
The soft reset is able to recover from an infinite loop and
even some GPU hangs due to bad shaders or bad states. The soft
reset uses a signal that kills all currently-running shaders
of a certain process (VM context), which unblocks the graphics
pipeline, so draws and command buffers finish but are not
correctly. This can then cause a hard hang if the shader was
supposed to signal work completion through a shader store
instruction and a non-shader consumer is waiting for it
(skipping the store instruction by killing the shader won't
signal the work, and thus the consumer will be stuck,
requiring a hard reset).<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-size:12pt">The hard reset can recover from
other hangs, which is great, but it may use a PCI reset,
which erases VRAM on dGPUs. APUs don't lose memory contents,
but we should assume that any process that had running jobs
on the GPU during a GPU reset has its memory resources in an
inconsistent state, and thus following command buffers can
cause another GPU hang. The shader store example above is
enough to cause another hard hang due to incorrect content
in memory resources, which can contain synchronization
primitives that are used internally by the hardware.</span><br>
<span style="font-size:12pt"></span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-size:12pt">Asking the driver to replay a
command buffer that caused a hang is a sure way to hang it
again. Unrelated processes can be affected due to lost VRAM
or the misfortune of using the GPU while the GPU hang
occurred. The window system should recreate GPU resources
and redraw everything without affecting applications. If
apps use GL, they should do the same. Processes that can't
recover by redrawing content can be terminated or left
alone, but they shouldn't be allowed to submit work to the
GPU anymore.</span><br>
<span style="font-size:12pt"></span><br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
dEQP only exercises the soft reset. I think WebGL is only able
to trigger a soft reset at this point, but Vulkan can also
trigger a hard reset.<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Marek<br>
</div>
<hr style="display:inline-block;width:98%">
<div id="m_7185412060794202476divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b>
Koenig, Christian <a href="mailto:Christian.Koenig@amd.com" target="_blank" rel="noreferrer"><Christian.Koenig@amd.com></a><br>
<b>Sent:</b> March 23, 2022 11:25<br>
<b>To:</b> Daniel Vetter <a href="mailto:daniel@ffwll.ch" target="_blank" rel="noreferrer"><daniel@ffwll.ch></a>; Daniel
Stone <a href="mailto:daniel@fooishbar.org" target="_blank" rel="noreferrer"><daniel@fooishbar.org></a>; Olsak, Marek
<a href="mailto:Marek.Olsak@amd.com" target="_blank" rel="noreferrer"><Marek.Olsak@amd.com></a>; Grodzovsky, Andrey
<a href="mailto:Andrey.Grodzovsky@amd.com" target="_blank" rel="noreferrer"><Andrey.Grodzovsky@amd.com></a><br>
<b>Cc:</b> Rob Clark <a href="mailto:robdclark@gmail.com" target="_blank" rel="noreferrer"><robdclark@gmail.com></a>; Rob Clark
<a href="mailto:robdclark@chromium.org" target="_blank" rel="noreferrer"><robdclark@chromium.org></a>; Sharma, Shashank
<a href="mailto:Shashank.Sharma@amd.com" target="_blank" rel="noreferrer"><Shashank.Sharma@amd.com></a>; Christian König
<a href="mailto:ckoenig.leichtzumerken@gmail.com" target="_blank" rel="noreferrer"><ckoenig.leichtzumerken@gmail.com></a>; Somalapuram,
Amaranath <a href="mailto:Amaranath.Somalapuram@amd.com" target="_blank" rel="noreferrer"><Amaranath.Somalapuram@amd.com></a>; Abhinav
Kumar <a href="mailto:quic_abhinavk@quicinc.com" target="_blank" rel="noreferrer"><quic_abhinavk@quicinc.com></a>; dri-devel
<a href="mailto:dri-devel@lists.freedesktop.org" target="_blank" rel="noreferrer"><dri-devel@lists.freedesktop.org></a>; amd-gfx list
<a href="mailto:amd-gfx@lists.freedesktop.org" target="_blank" rel="noreferrer"><amd-gfx@lists.freedesktop.org></a>; Deucher, Alexander
<a href="mailto:Alexander.Deucher@amd.com" target="_blank" rel="noreferrer"><Alexander.Deucher@amd.com></a>; Shashank Sharma
<a href="mailto:contactshashanksharma@gmail.com" target="_blank" rel="noreferrer"><contactshashanksharma@gmail.com></a><br>
<b>Subject:</b> Re: [PATCH v2 1/2] drm: Add GPU reset sysfs
event</font>
<div> </div>
</div>
<div><font size="2"><span style="font-size:11pt">
<div>[Adding Marek and Andrey as well]<br>
<br>
Am 23.03.22 um 16:14 schrieb Daniel Vetter:<br>
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone
<a href="mailto:daniel@fooishbar.org" target="_blank" rel="noreferrer"><daniel@fooishbar.org></a> wrote:<br>
>> Hi,<br>
>><br>
>> On Mon, 21 Mar 2022 at 16:02, Rob Clark
<a href="mailto:robdclark@gmail.com" target="_blank" rel="noreferrer"><robdclark@gmail.com></a> wrote:<br>
>>> On Mon, Mar 21, 2022 at 2:30 AM Christian
König<br>
>>> <a href="mailto:christian.koenig@amd.com" target="_blank" rel="noreferrer"><christian.koenig@amd.com></a> wrote:<br>
>>>> Well you can, it just means that their
contexts are lost as well.<br>
>>> Which is rather inconvenient when deqp-egl
reset tests, for example,<br>
>>> take down your compositor ;-)<br>
>> Yeah. Or anything WebGL.<br>
>><br>
>> System-wide collateral damage is definitely a
non-starter. If that<br>
>> means that the userspace driver has to do what
iris does and ensure<br>
>> everything's recreated and resubmitted, that
works too, just as long<br>
>> as the response to 'my adblocker didn't detect
a crypto miner ad' is<br>
>> something better than 'shoot the entire user
session'.<br>
> Not sure where that idea came from, I thought at
least I made it clear<br>
> that legacy gl _has_ to recover. It's only vk and
arb_robustness gl<br>
> which should die without recovery attempt.<br>
><br>
> The entire discussion here is who should be
responsible for replay and<br>
> at least if you can decide the uapi, then punting
that entirely to<br>
> userspace is a good approach.<br>
<br>
Yes, completely agree. We have the approach of
re-submitting things in <br>
the kernel and that failed quite miserable.<br>
<br>
In other words currently a GPU reset has something like
a 99% chance to <br>
get down your whole desktop.<br>
<br>
Daniel can you briefly explain what exactly iris does
when a lost <br>
context is detected without gl robustness?<br>
<br>
It sounds like you guys got that working quite well.<br>
<br>
Thanks,<br>
Christian.<br>
<br>
><br>
> Ofc it'd be nice if the collateral damage is
limited, i.e. requests<br>
> not currently on the gpu, or on different engines
and all that<br>
> shouldn't be nuked, if possible.<br>
><br>
> Also ofc since msm uapi is that the kernel tries to
recover there's<br>
> not much we can do there, contexts cannot be shot.
But still trying to<br>
> replay them as much as possible feels a bit like
overkill.<br>
> -Daniel<br>
><br>
>> Cheers,<br>
>> Daniel<br>
><br>
><br>
<br>
</div>
</span></font></div>
</div>
</blockquote>
<br>
</div>
</blockquote></div>