[PATCH] drm/amdgpu: Mark contexts guilty for any reset type

Marek Olšák maraeo at gmail.com
Wed Apr 26 15:52:14 UTC 2023


Perhaps I should clarify this. There are GL and Vulkan features that if any
app uses them and its shaders are killed, the next IB will hang. One of
them is Draw Indirect - if a shader is killed before storing the vertex
count and instance count in memory, the next draw will hang with a high
probability. No such app can be allowed to continue executing after a reset.

Marek

On Wed, Apr 26, 2023 at 5:51 AM Michel Dänzer <michel.daenzer at mailbox.org>
wrote:

> On 4/25/23 21:11, Marek Olšák wrote:
> > The last 3 comments in this thread contain arguments that are false and
> were specifically pointed out as false 6 comments ago: Soft resets are just
> as fatal as hard resets. There is nothing better about soft resets. If the
> VRAM is lost completely, that's a different story, and if the hard reset is
> 100% unreliable, that's also a different story, but other than those two
> outliers, there is no difference between the two from the user point view.
> Both can repeatedly hang if you don't prevent the app that caused the hang
> from using the GPU even if the app is not robust. The robustness context
> type doesn't matter here. By definition, no guilty app can continue after a
> reset, and no innocent apps affected by a reset can continue either because
> those can now hang too. That's how destructive all resets are. Personal
> anecdotes that the soft reset is better are just that, anecdotes.
>
> You're trying to frame the situation as black or white, but reality is
> shades of grey.
>
>
> There's a similar situation with kernel Oopsen: In principle it's not safe
> to continue executing the kernel after it hits an Oops, since it might be
> in an inconsistent state, which could result in any kind of misbehaviour.
> Still, the default behaviour is to continue executing, and in most cases it
> turns out fine. Users which cannot accept the residual risk can choose to
> make the kernel panic when it hits an Oops (either via CONFIG_PANIC_ON_OOPS
> at build time, or via oops=panic on the kernel command line). A kernel
> panic means that the machine basically freezes from a user PoV, which would
> be worse as the default behaviour for most users (because it would e.g.
> incur a higher risk of losing filesystem data).
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230426/93fb3a74/attachment.htm>


More information about the amd-gfx mailing list