[PATCH] drm/amdgpu: Mark contexts guilty for any reset type

Tue Apr 25 19:11:47 UTC 2023

The last 3 comments in this thread contain arguments that are false and
were specifically pointed out as false 6 comments ago: Soft resets are just
as fatal as hard resets. There is nothing better about soft resets. If the
VRAM is lost completely, that's a different story, and if the hard reset is
100% unreliable, that's also a different story, but other than those two
outliers, there is no difference between the two from the user point view.
Both can repeatedly hang if you don't prevent the app that caused the hang
from using the GPU even if the app is not robust. The robustness context
type doesn't matter here. By definition, no guilty app can continue after a
reset, and no innocent apps affected by a reset can continue either because
those can now hang too. That's how destructive all resets are. Personal
anecdotes that the soft reset is better are just that, anecdotes.

Marek

On Tue, Apr 25, 2023, 08:44 Christian König <christian.koenig at amd.com>
wrote:

> Am 25.04.23 um 14:14 schrieb Michel Dänzer:
> > On 4/25/23 14:08, Christian König wrote:
> >> Well signaling that something happened is not the question. We do this
> for both soft as well as hard resets.
> >>
> >> The question is if errors result in blocking further submissions with
> the same context or not.
> >>
> >> In case of a hard reset and potential loss of state we have to kill the
> context, otherwise a follow up submission would just lockup the hardware
> once more.
> >>
> >> In case of a soft reset I think we can keep the context alive, this way
> even applications without robustness handling can keep work.
> >>
> >> You potentially still get some corruption, but at least not your
> compositor killed.
> > Right, and if there is corruption, the user can restart the session.
> >
> >
> > Maybe a possible compromise could be making soft resets fatal if user
> space enabled robustness for the context, and non-fatal if not.
>
> Well that should already be mostly the case. If an application has
> enabled robustness it should notice that something went wrong and act
> appropriately.
>
> The only thing we need to handle is for applications without robustness
> in case of a hard reset or otherwise it will trigger an reset over and
> over again.
>
> Christian.
>
> >
> >
> >> Am 25.04.23 um 13:07 schrieb Marek Olšák:
> >>> That supposedly depends on the compositor. There may be compositors
> for very specific cases (e.g. Steam Deck) that handle resets very well, and
> those would like to be properly notified of all resets because that's how
> they get the best outcome, e.g. no corruption. A soft reset that is
> unhandled by userspace may result in persistent corruption.
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20230425/82b62017/attachment.htm>