[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Tue Jan 16 13:41:23 UTC 2024

On 1/16/24 12:24, Joshua Ashton wrote:
> 
> 
> On 1/16/24 07:47, Christian König wrote:
>> Am 16.01.24 um 01:05 schrieb Marek Olšák:
>>> On Mon, Jan 15, 2024 at 3:06 PM Christian König
>>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
>>>>> On 1/15/24 19:19, Christian König wrote:
>>>>>> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>>>>>>> On 1/15/24 18:53, Christian König wrote:
>>>>>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>>>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
>>>>>>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>>>>>>
>>>>>>>>>>      Re-sending as plaintext, sorry about that
>>>>>>>>>>
>>>>>>>>>>      On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>>>>>>       > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>>>>>>       >> [snip]
>>>>>>>>>>       >> The fundamental problem here is that not telling
>>>>>>>>>> applications that
>>>>>>>>>>       >> something went wrong when you just canceled their work
>>>>>>>>>> midway is an
>>>>>>>>>>       >> out-of-spec hack.
>>>>>>>>>>       >> When there is a report of real-world apps breaking
>>>>>>>>>> because of
>>>>>>>>>>      that hack,
>>>>>>>>>>       >> reports of different apps working (even if it's
>>>>>>>>>> convenient that they
>>>>>>>>>>       >> work) doesn't justify keeping the broken code.
>>>>>>>>>>       > If the breaking apps hit multiple soft resets in a row,
>>>>>>>>>> I've laid
>>>>>>>>>>      out a pragmatic solution which covers both cases.
>>>>>>>>>>      Hitting soft reset every time is the lucky path. Once GPU
>>>>>>>>>> work is
>>>>>>>>>>      interrupted out of nowhere, all bets are off and it might as
>>>>>>>>>> well
>>>>>>>>>>      trigger a full system hang next time. No hang recovery 
>>>>>>>>>> should
>>>>>>>>>> be able to
>>>>>>>>>>      cause that under any circumstance.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think the more insidious situation is no further hangs but
>>>>>>>>>> wrong results because we skipped some work. That we skipped work
>>>>>>>>>> may e.g. result in some texture not being uploaded or some GPGPU
>>>>>>>>>> work not being done and causing further errors downstream (say if
>>>>>>>>>> a game is doing AI/physics on the GPU not to say anything of
>>>>>>>>>> actual GPGPU work one might be doing like AI)
>>>>>>>>> Even worse if this is compute on eg. OpenCL for something
>>>>>>>>> science/math/whatever related, or training a model.
>>>>>>>>>
>>>>>>>>> You could randomly just get invalid/wrong results without even
>>>>>>>>> knowing!
>>>>>>>> Well on the kernel side we do provide an API to query the result of
>>>>>>>> a submission. That includes canceling submissions with a soft
>>>>>>>> recovery.
>>>>>>>>
>>>>>>>> What we just doesn't do is to prevent further submissions from this
>>>>>>>> application. E.g. enforcing that the application is punished for
>>>>>>>> bad behavior.
>>>>>>> You do prevent future submissions for regular resets though: Those
>>>>>>> increase karma which sets ctx->guilty, and if ctx->guilty then
>>>>>>> -ECANCELED is returned for a submission.
>>>>>>>
>>>>>>> ctx->guilty is never true for soft recovery though, as it doesn't
>>>>>>> increase karma, which is the problem this patch is trying to solve.
>>>>>>>
>>>>>>> By the submission result query API, I you assume you mean checking
>>>>>>> the submission fence error somehow? That doesn't seem very ergonomic
>>>>>>> for a Vulkan driver compared to the simple solution which is to just
>>>>>>> mark it as guilty with what already exists...
>>>>>> Well as I said the guilty handling is broken for quite a number of
>>>>>> reasons.
>>>>>>
>>>>>> What we can do rather trivially is changing this code in
>>>>>> amdgpu_job_prepare_job():
>>>>>>
>>>>>>           /* Ignore soft recovered fences here */
>>>>>>           r = drm_sched_entity_error(s_entity);
>>>>>>           if (r && r != -ENODATA)
>>>>>>                   goto error;
>>>>>>
>>>>>> This will bubble up errors from soft recoveries into the entity as
>>>>>> well and makes sure that further submissions are rejected.
>>>>> That makes sense to do, but at least for GL_EXT_robustness, that will
>>>>> not tell the app that it was guilty.
>>>> No, it clearly gets that signaled. We should probably replace the 
>>>> guilty
>>>> atomic with a calls to drm_sched_entity_error().
>>>>
>>>> It's just that this isn't what Marek and I had in mind for this,
>>>> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
>>>> AMDGPU_CTX_OP_QUERY_STATE2.
>>>>
>>>> Instead just look at the return value of the CS or query fence 
>>>> result IOCTL.
>>>>
>>>> When you get an -ENODATA you have been guilty of causing a soft
>>>> recovery, when you get an -ETIME you are guilty of causing a timeout
>>>> which had to be hard recovered. When you get an -ECANCELED you are an
>>>> innocent victim of a hard recovery somebody else caused.
>>>>
>>>> What we haven't defined yet is an error code for loosing VRAM, but that
>>>> should be trivial to do.
>>> So far we have implemented the GPU reset and soft reset, but we
>>> haven't done anything to have a robust system recovery. Under the
>>> current system, things can easily keep hanging indefinitely because
>>> nothing prevents that.
>>>
>>> The reset status query should stay. Robust apps will use it to tell
>>> when they should recreate their context and resources even if they
>>> don't submit anything. Let's fully trust robust apps here. In the
>>> future we might change our mind about that, but for now, let's just
>>> focus on API conformance, and later we can change it as long as we
>>> stay API conformant.
>>>
>>> Non-robust apps must be terminated when they hang or are innocent but
>>> affected. Their existence is a security and usability problem and a
>>> source of frustrations for users. 100% guaranteed system recovery is
>>> impossible if they continue to live.
>>>
>>> IBs should be rejected for all guilty and affected innocent contexts
>>> unconditionally, both robust and non-robust ones, by the kernel.
>>> Userspace only forwards the reset status to apps for robust contexts
>>> and doesn't do anything else, but userspace may decide to terminate
>>> the process if any non-robust context is affected.
>>
>> Yeah, that absolutely works for me.
>>
>> Going to adjust the implementation accordingly.
> 
> Awesome, please CC me know when you have something.
> 
> In the short-term I have changed if (r && r != -ENODATA) to if (r) and 
> that seems to work nicely for me.

One problem with solely relying on the CS submission return value from 
userspace is cancelled syncobj waits.

For example, if we have an application with one thread that makes a 
submission, and then kicks off a vkWaitSemaphores to wait on a semaphore 
on another thread and that submission hangs, the syncobj relating to the 
vkWaitSemaphores should be signalled which is fine, but we need to 
return VK_ERROR_DEVICE_LOST if the context loss resulted in the signal 
for the VkSemaphore.

The way this was previously integrated was with the query thing, which 
as we have established does not provide correct information regarding 
soft recovery at the moment.

Unless you have an alternative for us to get some error out of the 
syncobj (eg. -ENODATA), then right now we still require the query.

I think fixing the -ENODATA reporting back for submit is a good step, 
but I believe we still need the query to report the same information as 
we would have gotten from that here.

Thanks

- Joshie 🐸✨

> 
> - Joshie 🐸✨
> 
>>
>> Christian.
>>
>>>
>>>
>>> Marek
>>
>