[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Michel Dänzer michel at daenzer.net
Mon Jan 15 15:25:12 UTC 2024


On 2024-01-15 14:17, Christian König wrote:
> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
>> On 1/15/24 09:40, Christian König wrote:
>>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>>
>>>> Without this feedback, the application may keep pushing through the soft
>>>> recoveries, continually hanging the system with jobs that timeout.
>>>
>>> Well, that is intentional behavior. Marek is voting for making soft recovered errors fatal as well while Michel is voting for better ignoring them.
>>>
>>> I'm not really sure what to do. If you guys think that soft recovered hangs should be fatal as well then we can certainly do this.

A possible compromise might be making soft resets fatal if they happen repeatedly (within a certain period of time?).


>> They have to be!
>>
>> As Marek and I have pointed out, applications that hang or fault will just hang or fault again, especially when they use things like draw indirect, buffer device address, descriptor buffers, etc.
> 
> Ok, well then I now have two people (Marek and you) saying that soft recovery should be fatal while Michel is saying that soft recovery being non fatal improves stability for him :)

That's not quite what I wrote before.

I pointed out that my GNOME session has survived a soft reset without issues[0] on multiple occasions, whereas Marek's proposal at the time would have kicked me back to the login screen every time. > 0 vs effectively 0 chance of survival.

[0] Except for Firefox unnecessarily falling back to software rendering, which is a side note, not the main point.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer



More information about the amd-gfx mailing list