[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Mon Jan 15 20:04:23 UTC 2024

On 1/15/24 19:57, Christian König wrote:
> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
>> On 1/15/24 19:19, Christian König wrote:
>>> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>>>> On 1/15/24 18:53, Christian König wrote:
>>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
>>>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>>>
>>>>>>>     Re-sending as plaintext, sorry about that
>>>>>>>
>>>>>>>     On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>>>      > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>>>      >> [snip]
>>>>>>>      >> The fundamental problem here is that not telling 
>>>>>>> applications that
>>>>>>>      >> something went wrong when you just canceled their work 
>>>>>>> midway is an
>>>>>>>      >> out-of-spec hack.
>>>>>>>      >> When there is a report of real-world apps breaking 
>>>>>>> because of
>>>>>>>     that hack,
>>>>>>>      >> reports of different apps working (even if it's 
>>>>>>> convenient that they
>>>>>>>      >> work) doesn't justify keeping the broken code.
>>>>>>>      > If the breaking apps hit multiple soft resets in a row, 
>>>>>>> I've laid
>>>>>>>     out a pragmatic solution which covers both cases.
>>>>>>>     Hitting soft reset every time is the lucky path. Once GPU 
>>>>>>> work is
>>>>>>>     interrupted out of nowhere, all bets are off and it might as 
>>>>>>> well
>>>>>>>     trigger a full system hang next time. No hang recovery should 
>>>>>>> be able to
>>>>>>>     cause that under any circumstance.
>>>>>>>
>>>>>>>
>>>>>>> I think the more insidious situation is no further hangs but 
>>>>>>> wrong results because we skipped some work. That we skipped work 
>>>>>>> may e.g. result in some texture not being uploaded or some GPGPU 
>>>>>>> work not being done and causing further errors downstream (say if 
>>>>>>> a game is doing AI/physics on the GPU not to say anything of 
>>>>>>> actual GPGPU work one might be doing like AI)
>>>>>>
>>>>>> Even worse if this is compute on eg. OpenCL for something 
>>>>>> science/math/whatever related, or training a model.
>>>>>>
>>>>>> You could randomly just get invalid/wrong results without even 
>>>>>> knowing!
>>>>>
>>>>> Well on the kernel side we do provide an API to query the result of 
>>>>> a submission. That includes canceling submissions with a soft 
>>>>> recovery.
>>>>>
>>>>> What we just doesn't do is to prevent further submissions from this 
>>>>> application. E.g. enforcing that the application is punished for 
>>>>> bad behavior.
>>>>
>>>> You do prevent future submissions for regular resets though: Those 
>>>> increase karma which sets ctx->guilty, and if ctx->guilty then 
>>>> -ECANCELED is returned for a submission.
>>>>
>>>> ctx->guilty is never true for soft recovery though, as it doesn't 
>>>> increase karma, which is the problem this patch is trying to solve.
>>>>
>>>> By the submission result query API, I you assume you mean checking 
>>>> the submission fence error somehow? That doesn't seem very ergonomic 
>>>> for a Vulkan driver compared to the simple solution which is to just 
>>>> mark it as guilty with what already exists...
>>>
>>> Well as I said the guilty handling is broken for quite a number of 
>>> reasons.
>>>
>>> What we can do rather trivially is changing this code in 
>>> amdgpu_job_prepare_job():
>>>
>>>          /* Ignore soft recovered fences here */
>>>          r = drm_sched_entity_error(s_entity);
>>>          if (r && r != -ENODATA)
>>>                  goto error;
>>>
>>> This will bubble up errors from soft recoveries into the entity as 
>>> well and makes sure that further submissions are rejected.
>>
>> That makes sense to do, but at least for GL_EXT_robustness, that will 
>> not tell the app that it was guilty.
> 
> No, it clearly gets that signaled. We should probably replace the guilty 
> atomic with a calls to drm_sched_entity_error().
> 
> It's just that this isn't what Marek and I had in mind for this, 
> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or 
> AMDGPU_CTX_OP_QUERY_STATE2.
> 
> Instead just look at the return value of the CS or query fence result 
> IOCTL.
> 
> When you get an -ENODATA you have been guilty of causing a soft 
> recovery, when you get an -ETIME you are guilty of causing a timeout 
> which had to be hard recovered. When you get an -ECANCELED you are an 
> innocent victim of a hard recovery somebody else caused.
> 
> What we haven't defined yet is an error code for loosing VRAM, but that 
> should be trivial to do.

Thanks for the info, I will test things out here and likely send a patch 
to change if (r && r != -ENODATA) -> if (r) if things work out.

- Joshie 🐸✨

> 
> Regards,
> Christian.
> 
>>
>> We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.
>>
>> I am not super sure what is wrong/egrigious about the ctx->guilty 
>> handling, is there a previous email thread explaining that?
>>
>> - Joshie 🐸✨
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> - Joshie 🐸✨
>>>>
>>>>>
>>>>>>
>>>>>> Now imagine this is VulkanSC displaying something in the car 
>>>>>> dashboard, or some medical device doing some compute work to show 
>>>>>> something on a graph...
>>>>>>
>>>>>> I am not saying you should be doing any of that with RADV + 
>>>>>> AMDGPU, but it's just food for thought... :-)
>>>>>>
>>>>>> As I have been saying, you simply cannot just violate API 
>>>>>> contracts like this, it's flatout wrong.
>>>>>
>>>>> Yeah, completely agree to that.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> - Joshie 🐸✨
>>>>>>
>>>>>>>
>>>>>>>      >
>>>>>>>      >
>>>>>>>      >> If mutter needs to be robust against faults it caused 
>>>>>>> itself, it
>>>>>>>     should be robust
>>>>>>>      >> against GPU resets.
>>>>>>>      > It's unlikely that the hangs I've seen were caused by mutter
>>>>>>>     itself, more likely Mesa or amdgpu.
>>>>>>>      >
>>>>>>>      > Anyway, this will happen at some point, the reality is it 
>>>>>>> hasn't
>>>>>>>     yet though.
>>>>>>>      >
>>>>>>>      >
>>>>>>>
>>>>>
>>>>
>>>> - Joshie 🐸✨
>>>
>