[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 16:19:36 UTC 2024

On 15.01.24 16:43, Joshua Ashton wrote:
>
>
> On 1/15/24 15:25, Michel Dänzer wrote:
>> On 2024-01-15 14:17, Christian König wrote:
>>> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
>>>> On 1/15/24 09:40, Christian König wrote:
>>>>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>>>>
>>>>>> Without this feedback, the application may keep pushing through
>>>>>> the soft
>>>>>> recoveries, continually hanging the system with jobs that timeout.
>>>>>
>>>>> Well, that is intentional behavior. Marek is voting for making
>>>>> soft recovered errors fatal as well while Michel is voting for
>>>>> better ignoring them.
>>>>>
>>>>> I'm not really sure what to do. If you guys think that soft
>>>>> recovered hangs should be fatal as well then we can certainly do
>>>>> this.
>>
>> A possible compromise might be making soft resets fatal if they
>> happen repeatedly (within a certain period of time?).
>>
>
> No, no and no. Aside from introducing issues by side effects not
> surfacing and all of the stuff I mentioned about descriptor buffers,
> bda, draw indirect and stuff just resulting in more faults and hangs...
>
> You are proposing we throw out every promise we made to an application
> on the API contract level because it "might work". That's just wrong!
>
> Let me put this in explicit terms: What you are proposing is in direct
> violation of the GL and Vulkan specification.
>
> You can't just chose to break these contracts because you think it
> 'might' be a better user experience.

Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)

>
>>
>>>> They have to be!
>>>>
>>>> As Marek and I have pointed out, applications that hang or fault
>>>> will just hang or fault again, especially when they use things like
>>>> draw indirect, buffer device address, descriptor buffers, etc.
>>>
>>> Ok, well then I now have two people (Marek and you) saying that soft
>>> recovery should be fatal while Michel is saying that soft recovery
>>> being non fatal improves stability for him :)
>>
>> That's not quite what I wrote before.
>>
>> I pointed out that my GNOME session has survived a soft reset without
>> issues[0] on multiple occasions, whereas Marek's proposal at the time
>> would have kicked me back to the login screen every time. > 0 vs
>> effectively 0 chance of survival.
>
> The correct thing for GNOME/Mutter to do is to simply re-create it's
> context, reimport it's DMABUFs, etc.
>
> The fact that it survives and keeps soldiering on with whatever side
> effects missed is purely coincidental and not valid API usage.
>
> If you want such behaviour for hangs for Mutter, you should propose a
> GL/VK extension for it, but I really doubt that will get anywhere.
>
> - Joshie 🐸✨
>
>>
>> [0] Except for Firefox unnecessarily falling back to software
>> rendering, which is a side note, not the main point.
>>
>>
>