[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 19:30:28 UTC 2024

On 1/15/24 19:19, Christian König wrote:
> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>> On 1/15/24 18:53, Christian König wrote:
>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>
>>>>>     Re-sending as plaintext, sorry about that
>>>>>
>>>>>     On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>      > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>      >> [snip]
>>>>>      >> The fundamental problem here is that not telling 
>>>>> applications that
>>>>>      >> something went wrong when you just canceled their work 
>>>>> midway is an
>>>>>      >> out-of-spec hack.
>>>>>      >> When there is a report of real-world apps breaking because of
>>>>>     that hack,
>>>>>      >> reports of different apps working (even if it's convenient 
>>>>> that they
>>>>>      >> work) doesn't justify keeping the broken code.
>>>>>      > If the breaking apps hit multiple soft resets in a row, I've 
>>>>> laid
>>>>>     out a pragmatic solution which covers both cases.
>>>>>     Hitting soft reset every time is the lucky path. Once GPU work is
>>>>>     interrupted out of nowhere, all bets are off and it might as well
>>>>>     trigger a full system hang next time. No hang recovery should 
>>>>> be able to
>>>>>     cause that under any circumstance.
>>>>>
>>>>>
>>>>> I think the more insidious situation is no further hangs but wrong 
>>>>> results because we skipped some work. That we skipped work may e.g. 
>>>>> result in some texture not being uploaded or some GPGPU work not 
>>>>> being done and causing further errors downstream (say if a game is 
>>>>> doing AI/physics on the GPU not to say anything of actual GPGPU 
>>>>> work one might be doing like AI)
>>>>
>>>> Even worse if this is compute on eg. OpenCL for something 
>>>> science/math/whatever related, or training a model.
>>>>
>>>> You could randomly just get invalid/wrong results without even knowing!
>>>
>>> Well on the kernel side we do provide an API to query the result of a 
>>> submission. That includes canceling submissions with a soft recovery.
>>>
>>> What we just doesn't do is to prevent further submissions from this 
>>> application. E.g. enforcing that the application is punished for bad 
>>> behavior.
>>
>> You do prevent future submissions for regular resets though: Those 
>> increase karma which sets ctx->guilty, and if ctx->guilty then 
>> -ECANCELED is returned for a submission.
>>
>> ctx->guilty is never true for soft recovery though, as it doesn't 
>> increase karma, which is the problem this patch is trying to solve.
>>
>> By the submission result query API, I you assume you mean checking the 
>> submission fence error somehow? That doesn't seem very ergonomic for a 
>> Vulkan driver compared to the simple solution which is to just mark it 
>> as guilty with what already exists...
> 
> Well as I said the guilty handling is broken for quite a number of reasons.
> 
> What we can do rather trivially is changing this code in 
> amdgpu_job_prepare_job():
> 
>          /* Ignore soft recovered fences here */
>          r = drm_sched_entity_error(s_entity);
>          if (r && r != -ENODATA)
>                  goto error;
> 
> This will bubble up errors from soft recoveries into the entity as well 
> and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will 
not tell the app that it was guilty.

We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.

I am not super sure what is wrong/egrigious about the ctx->guilty 
handling, is there a previous email thread explaining that?

- Joshie 🐸✨

> 
> Regards,
> Christian.
> 
>>
>> - Joshie 🐸✨
>>
>>>
>>>>
>>>> Now imagine this is VulkanSC displaying something in the car 
>>>> dashboard, or some medical device doing some compute work to show 
>>>> something on a graph...
>>>>
>>>> I am not saying you should be doing any of that with RADV + AMDGPU, 
>>>> but it's just food for thought... :-)
>>>>
>>>> As I have been saying, you simply cannot just violate API contracts 
>>>> like this, it's flatout wrong.
>>>
>>> Yeah, completely agree to that.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> - Joshie 🐸✨
>>>>
>>>>>
>>>>>      >
>>>>>      >
>>>>>      >> If mutter needs to be robust against faults it caused 
>>>>> itself, it
>>>>>     should be robust
>>>>>      >> against GPU resets.
>>>>>      > It's unlikely that the hangs I've seen were caused by mutter
>>>>>     itself, more likely Mesa or amdgpu.
>>>>>      >
>>>>>      > Anyway, this will happen at some point, the reality is it 
>>>>> hasn't
>>>>>     yet though.
>>>>>      >
>>>>>      >
>>>>>
>>>
>>
>> - Joshie 🐸✨
>