[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 19:57:10 UTC 2024

Am 15.01.24 um 20:30 schrieb Joshua Ashton:
> On 1/15/24 19:19, Christian König wrote:
>> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>>> On 1/15/24 18:53, Christian König wrote:
>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
>>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>>
>>>>>>     Re-sending as plaintext, sorry about that
>>>>>>
>>>>>>     On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>>      > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>>      >> [snip]
>>>>>>      >> The fundamental problem here is that not telling 
>>>>>> applications that
>>>>>>      >> something went wrong when you just canceled their work 
>>>>>> midway is an
>>>>>>      >> out-of-spec hack.
>>>>>>      >> When there is a report of real-world apps breaking 
>>>>>> because of
>>>>>>     that hack,
>>>>>>      >> reports of different apps working (even if it's 
>>>>>> convenient that they
>>>>>>      >> work) doesn't justify keeping the broken code.
>>>>>>      > If the breaking apps hit multiple soft resets in a row, 
>>>>>> I've laid
>>>>>>     out a pragmatic solution which covers both cases.
>>>>>>     Hitting soft reset every time is the lucky path. Once GPU 
>>>>>> work is
>>>>>>     interrupted out of nowhere, all bets are off and it might as 
>>>>>> well
>>>>>>     trigger a full system hang next time. No hang recovery should 
>>>>>> be able to
>>>>>>     cause that under any circumstance.
>>>>>>
>>>>>>
>>>>>> I think the more insidious situation is no further hangs but 
>>>>>> wrong results because we skipped some work. That we skipped work 
>>>>>> may e.g. result in some texture not being uploaded or some GPGPU 
>>>>>> work not being done and causing further errors downstream (say if 
>>>>>> a game is doing AI/physics on the GPU not to say anything of 
>>>>>> actual GPGPU work one might be doing like AI)
>>>>>
>>>>> Even worse if this is compute on eg. OpenCL for something 
>>>>> science/math/whatever related, or training a model.
>>>>>
>>>>> You could randomly just get invalid/wrong results without even 
>>>>> knowing!
>>>>
>>>> Well on the kernel side we do provide an API to query the result of 
>>>> a submission. That includes canceling submissions with a soft 
>>>> recovery.
>>>>
>>>> What we just doesn't do is to prevent further submissions from this 
>>>> application. E.g. enforcing that the application is punished for 
>>>> bad behavior.
>>>
>>> You do prevent future submissions for regular resets though: Those 
>>> increase karma which sets ctx->guilty, and if ctx->guilty then 
>>> -ECANCELED is returned for a submission.
>>>
>>> ctx->guilty is never true for soft recovery though, as it doesn't 
>>> increase karma, which is the problem this patch is trying to solve.
>>>
>>> By the submission result query API, I you assume you mean checking 
>>> the submission fence error somehow? That doesn't seem very ergonomic 
>>> for a Vulkan driver compared to the simple solution which is to just 
>>> mark it as guilty with what already exists...
>>
>> Well as I said the guilty handling is broken for quite a number of 
>> reasons.
>>
>> What we can do rather trivially is changing this code in 
>> amdgpu_job_prepare_job():
>>
>>          /* Ignore soft recovered fences here */
>>          r = drm_sched_entity_error(s_entity);
>>          if (r && r != -ENODATA)
>>                  goto error;
>>
>> This will bubble up errors from soft recoveries into the entity as 
>> well and makes sure that further submissions are rejected.
>
> That makes sense to do, but at least for GL_EXT_robustness, that will 
> not tell the app that it was guilty.

No, it clearly gets that signaled. We should probably replace the guilty 
atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this, 
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or 
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence result IOCTL.

When you get an -ENODATA you have been guilty of causing a soft 
recovery, when you get an -ETIME you are guilty of causing a timeout 
which had to be hard recovered. When you get an -ECANCELED you are an 
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that 
should be trivial to do.

Regards,
Christian.

>
> We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.
>
> I am not super sure what is wrong/egrigious about the ctx->guilty 
> handling, is there a previous email thread explaining that?
>
> - Joshie 🐸✨
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> - Joshie 🐸✨
>>>
>>>>
>>>>>
>>>>> Now imagine this is VulkanSC displaying something in the car 
>>>>> dashboard, or some medical device doing some compute work to show 
>>>>> something on a graph...
>>>>>
>>>>> I am not saying you should be doing any of that with RADV + 
>>>>> AMDGPU, but it's just food for thought... :-)
>>>>>
>>>>> As I have been saying, you simply cannot just violate API 
>>>>> contracts like this, it's flatout wrong.
>>>>
>>>> Yeah, completely agree to that.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> - Joshie 🐸✨
>>>>>
>>>>>>
>>>>>>      >
>>>>>>      >
>>>>>>      >> If mutter needs to be robust against faults it caused 
>>>>>> itself, it
>>>>>>     should be robust
>>>>>>      >> against GPU resets.
>>>>>>      > It's unlikely that the hangs I've seen were caused by mutter
>>>>>>     itself, more likely Mesa or amdgpu.
>>>>>>      >
>>>>>>      > Anyway, this will happen at some point, the reality is it 
>>>>>> hasn't
>>>>>>     yet though.
>>>>>>      >
>>>>>>      >
>>>>>>
>>>>
>>>
>>> - Joshie 🐸✨
>>