[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 18:04:31 UTC 2024

On 1/15/24 17:09, Michel Dänzer wrote:
> On 2024-01-15 17:46, Joshua Ashton wrote:
>> On 1/15/24 16:34, Michel Dänzer wrote:
>>> On 2024-01-15 17:19, Friedrich Vock wrote:
>>>> On 15.01.24 16:43, Joshua Ashton wrote:
>>>>> On 1/15/24 15:25, Michel Dänzer wrote:
>>>>>> On 2024-01-15 14:17, Christian König wrote:
>>>>>>> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
>>>>>>>> On 1/15/24 09:40, Christian König wrote:
>>>>>>>>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>>>>>>>>
>>>>>>>>>> Without this feedback, the application may keep pushing through
>>>>>>>>>> the soft
>>>>>>>>>> recoveries, continually hanging the system with jobs that timeout.
>>>>>>>>>
>>>>>>>>> Well, that is intentional behavior. Marek is voting for making
>>>>>>>>> soft recovered errors fatal as well while Michel is voting for
>>>>>>>>> better ignoring them.
>>>>>>>>>
>>>>>>>>> I'm not really sure what to do. If you guys think that soft
>>>>>>>>> recovered hangs should be fatal as well then we can certainly do
>>>>>>>>> this.
>>>>>>
>>>>>> A possible compromise might be making soft resets fatal if they
>>>>>> happen repeatedly (within a certain period of time?).
>>>>>
>>>>> No, no and no. Aside from introducing issues by side effects not
>>>>> surfacing and all of the stuff I mentioned about descriptor buffers,
>>>>> bda, draw indirect and stuff just resulting in more faults and hangs...
>>>>>
>>>>> You are proposing we throw out every promise we made to an application
>>>>> on the API contract level because it "might work". That's just wrong!
>>>>>
>>>>> Let me put this in explicit terms: What you are proposing is in direct
>>>>> violation of the GL and Vulkan specification.
>>>>>
>>>>> You can't just chose to break these contracts because you think it
>>>>> 'might' be a better user experience.
>>>>
>>>> Is the original issue that motivated soft resets to be non-fatal even an
>>>> issue anymore?
>>>>
>>>> If I read that old thread correctly, the rationale for that was that
>>>> assigning guilt to a context was more broken than not doing it, because
>>>> the compositor/Xwayland process would also crash despite being unrelated
>>>> to the hang.
>>>> With Joshua's Mesa fixes, this is not the case anymore, so I don't think
>>>> keeping soft resets non-fatal provides any benefit to the user experience.
>>>> The potential detriments to user experience have been outlined multiple
>>>> times in this thread already.
>>>>
>>>> (I suppose if the compositor itself faults it might still bring down a
>>>> session, but I've literally never seen that, and it's not like a
>>>> compositor triggering segfaults on CPU stays alive either.)
>>>
>>> That's indeed what happened for me, multiple times. And each time the session continued running fine for days after the soft reset.
>>>
>>> But apparently my experience isn't valid somehow, and I should have been forced to log in again to please the GL gods...
>>>
>>>
>>> Conversely, I can't remember hitting a case where an app kept running into soft resets. It's almost as if different people may have different experiences! ;)
>>
>> Your anecdote of whatever application coincidentally managing to soldier through being hung is really not relevant.
> 
> But yours is, got it.

Yes, as what I am stating is backed by the specification for the APIs we 
are using.
You previously said things are not black and white, but it very 
explicitly is -- we have specifications for a reason!

Your app just-so happening to survive a command buffer being ignored is 
not a useful or valid addition to this discussion at all.

- Joshie 🐸✨

> 
> 
>> It looks like Mutter already has some handling for GL robustness anyway...
> 
> That's just for suspend/resume with the nvidia driver. It can't recover from GPU hangs yet.
> 
>