[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Tue Jan 16 12:24:27 UTC 2024

On 1/16/24 07:47, Christian König wrote:
> Am 16.01.24 um 01:05 schrieb Marek Olšák:
>> On Mon, Jan 15, 2024 at 3:06 PM Christian König
>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
>>>> On 1/15/24 19:19, Christian König wrote:
>>>>> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>>>>>> On 1/15/24 18:53, Christian König wrote:
>>>>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
>>>>>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>>>>>
>>>>>>>>>      Re-sending as plaintext, sorry about that
>>>>>>>>>
>>>>>>>>>      On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>>>>>       > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>>>>>       >> [snip]
>>>>>>>>>       >> The fundamental problem here is that not telling
>>>>>>>>> applications that
>>>>>>>>>       >> something went wrong when you just canceled their work
>>>>>>>>> midway is an
>>>>>>>>>       >> out-of-spec hack.
>>>>>>>>>       >> When there is a report of real-world apps breaking
>>>>>>>>> because of
>>>>>>>>>      that hack,
>>>>>>>>>       >> reports of different apps working (even if it's
>>>>>>>>> convenient that they
>>>>>>>>>       >> work) doesn't justify keeping the broken code.
>>>>>>>>>       > If the breaking apps hit multiple soft resets in a row,
>>>>>>>>> I've laid
>>>>>>>>>      out a pragmatic solution which covers both cases.
>>>>>>>>>      Hitting soft reset every time is the lucky path. Once GPU
>>>>>>>>> work is
>>>>>>>>>      interrupted out of nowhere, all bets are off and it might as
>>>>>>>>> well
>>>>>>>>>      trigger a full system hang next time. No hang recovery should
>>>>>>>>> be able to
>>>>>>>>>      cause that under any circumstance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think the more insidious situation is no further hangs but
>>>>>>>>> wrong results because we skipped some work. That we skipped work
>>>>>>>>> may e.g. result in some texture not being uploaded or some GPGPU
>>>>>>>>> work not being done and causing further errors downstream (say if
>>>>>>>>> a game is doing AI/physics on the GPU not to say anything of
>>>>>>>>> actual GPGPU work one might be doing like AI)
>>>>>>>> Even worse if this is compute on eg. OpenCL for something
>>>>>>>> science/math/whatever related, or training a model.
>>>>>>>>
>>>>>>>> You could randomly just get invalid/wrong results without even
>>>>>>>> knowing!
>>>>>>> Well on the kernel side we do provide an API to query the result of
>>>>>>> a submission. That includes canceling submissions with a soft
>>>>>>> recovery.
>>>>>>>
>>>>>>> What we just doesn't do is to prevent further submissions from this
>>>>>>> application. E.g. enforcing that the application is punished for
>>>>>>> bad behavior.
>>>>>> You do prevent future submissions for regular resets though: Those
>>>>>> increase karma which sets ctx->guilty, and if ctx->guilty then
>>>>>> -ECANCELED is returned for a submission.
>>>>>>
>>>>>> ctx->guilty is never true for soft recovery though, as it doesn't
>>>>>> increase karma, which is the problem this patch is trying to solve.
>>>>>>
>>>>>> By the submission result query API, I you assume you mean checking
>>>>>> the submission fence error somehow? That doesn't seem very ergonomic
>>>>>> for a Vulkan driver compared to the simple solution which is to just
>>>>>> mark it as guilty with what already exists...
>>>>> Well as I said the guilty handling is broken for quite a number of
>>>>> reasons.
>>>>>
>>>>> What we can do rather trivially is changing this code in
>>>>> amdgpu_job_prepare_job():
>>>>>
>>>>>           /* Ignore soft recovered fences here */
>>>>>           r = drm_sched_entity_error(s_entity);
>>>>>           if (r && r != -ENODATA)
>>>>>                   goto error;
>>>>>
>>>>> This will bubble up errors from soft recoveries into the entity as
>>>>> well and makes sure that further submissions are rejected.
>>>> That makes sense to do, but at least for GL_EXT_robustness, that will
>>>> not tell the app that it was guilty.
>>> No, it clearly gets that signaled. We should probably replace the guilty
>>> atomic with a calls to drm_sched_entity_error().
>>>
>>> It's just that this isn't what Marek and I had in mind for this,
>>> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
>>> AMDGPU_CTX_OP_QUERY_STATE2.
>>>
>>> Instead just look at the return value of the CS or query fence result 
>>> IOCTL.
>>>
>>> When you get an -ENODATA you have been guilty of causing a soft
>>> recovery, when you get an -ETIME you are guilty of causing a timeout
>>> which had to be hard recovered. When you get an -ECANCELED you are an
>>> innocent victim of a hard recovery somebody else caused.
>>>
>>> What we haven't defined yet is an error code for loosing VRAM, but that
>>> should be trivial to do.
>> So far we have implemented the GPU reset and soft reset, but we
>> haven't done anything to have a robust system recovery. Under the
>> current system, things can easily keep hanging indefinitely because
>> nothing prevents that.
>>
>> The reset status query should stay. Robust apps will use it to tell
>> when they should recreate their context and resources even if they
>> don't submit anything. Let's fully trust robust apps here. In the
>> future we might change our mind about that, but for now, let's just
>> focus on API conformance, and later we can change it as long as we
>> stay API conformant.
>>
>> Non-robust apps must be terminated when they hang or are innocent but
>> affected. Their existence is a security and usability problem and a
>> source of frustrations for users. 100% guaranteed system recovery is
>> impossible if they continue to live.
>>
>> IBs should be rejected for all guilty and affected innocent contexts
>> unconditionally, both robust and non-robust ones, by the kernel.
>> Userspace only forwards the reset status to apps for robust contexts
>> and doesn't do anything else, but userspace may decide to terminate
>> the process if any non-robust context is affected.
> 
> Yeah, that absolutely works for me.
> 
> Going to adjust the implementation accordingly.

Awesome, please CC me know when you have something.

In the short-term I have changed if (r && r != -ENODATA) to if (r) and 
that seems to work nicely for me.

- Joshie 🐸✨

> 
> Christian.
> 
>>
>>
>> Marek
>