[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Christian König ckoenig.leichtzumerken at gmail.com
Tue Jan 16 07:47:01 UTC 2024


Am 16.01.24 um 01:05 schrieb Marek Olšák:
> On Mon, Jan 15, 2024 at 3:06 PM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
>>> On 1/15/24 19:19, Christian König wrote:
>>>> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>>>>> On 1/15/24 18:53, Christian König wrote:
>>>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
>>>>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>>>>
>>>>>>>>      Re-sending as plaintext, sorry about that
>>>>>>>>
>>>>>>>>      On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>>>>       > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>>>>       >> [snip]
>>>>>>>>       >> The fundamental problem here is that not telling
>>>>>>>> applications that
>>>>>>>>       >> something went wrong when you just canceled their work
>>>>>>>> midway is an
>>>>>>>>       >> out-of-spec hack.
>>>>>>>>       >> When there is a report of real-world apps breaking
>>>>>>>> because of
>>>>>>>>      that hack,
>>>>>>>>       >> reports of different apps working (even if it's
>>>>>>>> convenient that they
>>>>>>>>       >> work) doesn't justify keeping the broken code.
>>>>>>>>       > If the breaking apps hit multiple soft resets in a row,
>>>>>>>> I've laid
>>>>>>>>      out a pragmatic solution which covers both cases.
>>>>>>>>      Hitting soft reset every time is the lucky path. Once GPU
>>>>>>>> work is
>>>>>>>>      interrupted out of nowhere, all bets are off and it might as
>>>>>>>> well
>>>>>>>>      trigger a full system hang next time. No hang recovery should
>>>>>>>> be able to
>>>>>>>>      cause that under any circumstance.
>>>>>>>>
>>>>>>>>
>>>>>>>> I think the more insidious situation is no further hangs but
>>>>>>>> wrong results because we skipped some work. That we skipped work
>>>>>>>> may e.g. result in some texture not being uploaded or some GPGPU
>>>>>>>> work not being done and causing further errors downstream (say if
>>>>>>>> a game is doing AI/physics on the GPU not to say anything of
>>>>>>>> actual GPGPU work one might be doing like AI)
>>>>>>> Even worse if this is compute on eg. OpenCL for something
>>>>>>> science/math/whatever related, or training a model.
>>>>>>>
>>>>>>> You could randomly just get invalid/wrong results without even
>>>>>>> knowing!
>>>>>> Well on the kernel side we do provide an API to query the result of
>>>>>> a submission. That includes canceling submissions with a soft
>>>>>> recovery.
>>>>>>
>>>>>> What we just doesn't do is to prevent further submissions from this
>>>>>> application. E.g. enforcing that the application is punished for
>>>>>> bad behavior.
>>>>> You do prevent future submissions for regular resets though: Those
>>>>> increase karma which sets ctx->guilty, and if ctx->guilty then
>>>>> -ECANCELED is returned for a submission.
>>>>>
>>>>> ctx->guilty is never true for soft recovery though, as it doesn't
>>>>> increase karma, which is the problem this patch is trying to solve.
>>>>>
>>>>> By the submission result query API, I you assume you mean checking
>>>>> the submission fence error somehow? That doesn't seem very ergonomic
>>>>> for a Vulkan driver compared to the simple solution which is to just
>>>>> mark it as guilty with what already exists...
>>>> Well as I said the guilty handling is broken for quite a number of
>>>> reasons.
>>>>
>>>> What we can do rather trivially is changing this code in
>>>> amdgpu_job_prepare_job():
>>>>
>>>>           /* Ignore soft recovered fences here */
>>>>           r = drm_sched_entity_error(s_entity);
>>>>           if (r && r != -ENODATA)
>>>>                   goto error;
>>>>
>>>> This will bubble up errors from soft recoveries into the entity as
>>>> well and makes sure that further submissions are rejected.
>>> That makes sense to do, but at least for GL_EXT_robustness, that will
>>> not tell the app that it was guilty.
>> No, it clearly gets that signaled. We should probably replace the guilty
>> atomic with a calls to drm_sched_entity_error().
>>
>> It's just that this isn't what Marek and I had in mind for this,
>> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
>> AMDGPU_CTX_OP_QUERY_STATE2.
>>
>> Instead just look at the return value of the CS or query fence result IOCTL.
>>
>> When you get an -ENODATA you have been guilty of causing a soft
>> recovery, when you get an -ETIME you are guilty of causing a timeout
>> which had to be hard recovered. When you get an -ECANCELED you are an
>> innocent victim of a hard recovery somebody else caused.
>>
>> What we haven't defined yet is an error code for loosing VRAM, but that
>> should be trivial to do.
> So far we have implemented the GPU reset and soft reset, but we
> haven't done anything to have a robust system recovery. Under the
> current system, things can easily keep hanging indefinitely because
> nothing prevents that.
>
> The reset status query should stay. Robust apps will use it to tell
> when they should recreate their context and resources even if they
> don't submit anything. Let's fully trust robust apps here. In the
> future we might change our mind about that, but for now, let's just
> focus on API conformance, and later we can change it as long as we
> stay API conformant.
>
> Non-robust apps must be terminated when they hang or are innocent but
> affected. Their existence is a security and usability problem and a
> source of frustrations for users. 100% guaranteed system recovery is
> impossible if they continue to live.
>
> IBs should be rejected for all guilty and affected innocent contexts
> unconditionally, both robust and non-robust ones, by the kernel.
> Userspace only forwards the reset status to apps for robust contexts
> and doesn't do anything else, but userspace may decide to terminate
> the process if any non-robust context is affected.

Yeah, that absolutely works for me.

Going to adjust the implementation accordingly.

Christian.

>
>
> Marek



More information about the amd-gfx mailing list