[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 19:19:52 UTC 2024

Am 15.01.24 um 20:13 schrieb Joshua Ashton:
> On 1/15/24 18:53, Christian König wrote:
>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>
>>>>     Re-sending as plaintext, sorry about that
>>>>
>>>>     On 15.01.24 18:54, Michel Dänzer wrote:
>>>>      > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>      >> [snip]
>>>>      >> The fundamental problem here is that not telling 
>>>> applications that
>>>>      >> something went wrong when you just canceled their work 
>>>> midway is an
>>>>      >> out-of-spec hack.
>>>>      >> When there is a report of real-world apps breaking because of
>>>>     that hack,
>>>>      >> reports of different apps working (even if it's convenient 
>>>> that they
>>>>      >> work) doesn't justify keeping the broken code.
>>>>      > If the breaking apps hit multiple soft resets in a row, I've 
>>>> laid
>>>>     out a pragmatic solution which covers both cases.
>>>>     Hitting soft reset every time is the lucky path. Once GPU work is
>>>>     interrupted out of nowhere, all bets are off and it might as well
>>>>     trigger a full system hang next time. No hang recovery should 
>>>> be able to
>>>>     cause that under any circumstance.
>>>>
>>>>
>>>> I think the more insidious situation is no further hangs but wrong 
>>>> results because we skipped some work. That we skipped work may e.g. 
>>>> result in some texture not being uploaded or some GPGPU work not 
>>>> being done and causing further errors downstream (say if a game is 
>>>> doing AI/physics on the GPU not to say anything of actual GPGPU 
>>>> work one might be doing like AI)
>>>
>>> Even worse if this is compute on eg. OpenCL for something 
>>> science/math/whatever related, or training a model.
>>>
>>> You could randomly just get invalid/wrong results without even knowing!
>>
>> Well on the kernel side we do provide an API to query the result of a 
>> submission. That includes canceling submissions with a soft recovery.
>>
>> What we just doesn't do is to prevent further submissions from this 
>> application. E.g. enforcing that the application is punished for bad 
>> behavior.
>
> You do prevent future submissions for regular resets though: Those 
> increase karma which sets ctx->guilty, and if ctx->guilty then 
> -ECANCELED is returned for a submission.
>
> ctx->guilty is never true for soft recovery though, as it doesn't 
> increase karma, which is the problem this patch is trying to solve.
>
> By the submission result query API, I you assume you mean checking the 
> submission fence error somehow? That doesn't seem very ergonomic for a 
> Vulkan driver compared to the simple solution which is to just mark it 
> as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of reasons.

What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():

         /* Ignore soft recovered fences here */
         r = drm_sched_entity_error(s_entity);
         if (r && r != -ENODATA)
                 goto error;

This will bubble up errors from soft recoveries into the entity as well 
and makes sure that further submissions are rejected.

Regards,
Christian.

>
> - Joshie 🐸✨
>
>>
>>>
>>> Now imagine this is VulkanSC displaying something in the car 
>>> dashboard, or some medical device doing some compute work to show 
>>> something on a graph...
>>>
>>> I am not saying you should be doing any of that with RADV + AMDGPU, 
>>> but it's just food for thought... :-)
>>>
>>> As I have been saying, you simply cannot just violate API contracts 
>>> like this, it's flatout wrong.
>>
>> Yeah, completely agree to that.
>>
>> Regards,
>> Christian.
>>
>>>
>>> - Joshie 🐸✨
>>>
>>>>
>>>>      >
>>>>      >
>>>>      >> If mutter needs to be robust against faults it caused 
>>>> itself, it
>>>>     should be robust
>>>>      >> against GPU resets.
>>>>      > It's unlikely that the hangs I've seen were caused by mutter
>>>>     itself, more likely Mesa or amdgpu.
>>>>      >
>>>>      > Anyway, this will happen at some point, the reality is it 
>>>> hasn't
>>>>     yet though.
>>>>      >
>>>>      >
>>>>
>>
>
> - Joshie 🐸✨