[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Tue Jan 16 17:31:01 UTC 2024

On 1/16/24 15:48, Michel Dänzer wrote:
> On 2024-01-16 14:44, Joshua Ashton wrote:
>> On 1/16/24 13:41, Joshua Ashton wrote:
>>> On 1/16/24 12:24, Joshua Ashton wrote:
>>>> On 1/16/24 07:47, Christian König wrote:
>>>>> Am 16.01.24 um 01:05 schrieb Marek Olšák:
>>>>>> On Mon, Jan 15, 2024 at 3:06 PM Christian König
>>>>>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>>>>> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
>>>>>>>> On 1/15/24 19:19, Christian König wrote:
>>>>>>>>> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>>>>>>>>>> On 1/15/24 18:53, Christian König wrote:
>>>>>>>>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>>>>>>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>>>>>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
>>>>>>>>>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>       Re-sending as plaintext, sorry about that
>>>>>>>>>>>>>
>>>>>>>>>>>>>       On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>>>>>>>>>        > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>>>>>>>>>        >> [snip]
>>>>>>>>>>>>>        >> The fundamental problem here is that not telling
>>>>>>>>>>>>> applications that
>>>>>>>>>>>>>        >> something went wrong when you just canceled their work
>>>>>>>>>>>>> midway is an
>>>>>>>>>>>>>        >> out-of-spec hack.
>>>>>>>>>>>>>        >> When there is a report of real-world apps breaking
>>>>>>>>>>>>> because of
>>>>>>>>>>>>>       that hack,
>>>>>>>>>>>>>        >> reports of different apps working (even if it's
>>>>>>>>>>>>> convenient that they
>>>>>>>>>>>>>        >> work) doesn't justify keeping the broken code.
>>>>>>>>>>>>>        > If the breaking apps hit multiple soft resets in a row,
>>>>>>>>>>>>> I've laid
>>>>>>>>>>>>>       out a pragmatic solution which covers both cases.
>>>>>>>>>>>>>       Hitting soft reset every time is the lucky path. Once GPU
>>>>>>>>>>>>> work is
>>>>>>>>>>>>>       interrupted out of nowhere, all bets are off and it might as
>>>>>>>>>>>>> well
>>>>>>>>>>>>>       trigger a full system hang next time. No hang recovery should
>>>>>>>>>>>>> be able to
>>>>>>>>>>>>>       cause that under any circumstance.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the more insidious situation is no further hangs but
>>>>>>>>>>>>> wrong results because we skipped some work. That we skipped work
>>>>>>>>>>>>> may e.g. result in some texture not being uploaded or some GPGPU
>>>>>>>>>>>>> work not being done and causing further errors downstream (say if
>>>>>>>>>>>>> a game is doing AI/physics on the GPU not to say anything of
>>>>>>>>>>>>> actual GPGPU work one might be doing like AI)
>>>>>>>>>>>> Even worse if this is compute on eg. OpenCL for something
>>>>>>>>>>>> science/math/whatever related, or training a model.
>>>>>>>>>>>>
>>>>>>>>>>>> You could randomly just get invalid/wrong results without even
>>>>>>>>>>>> knowing!
>>>>>>>>>>> Well on the kernel side we do provide an API to query the result of
>>>>>>>>>>> a submission. That includes canceling submissions with a soft
>>>>>>>>>>> recovery.
>>>>>>>>>>>
>>>>>>>>>>> What we just doesn't do is to prevent further submissions from this
>>>>>>>>>>> application. E.g. enforcing that the application is punished for
>>>>>>>>>>> bad behavior.
>>>>>>>>>> You do prevent future submissions for regular resets though: Those
>>>>>>>>>> increase karma which sets ctx->guilty, and if ctx->guilty then
>>>>>>>>>> -ECANCELED is returned for a submission.
>>>>>>>>>>
>>>>>>>>>> ctx->guilty is never true for soft recovery though, as it doesn't
>>>>>>>>>> increase karma, which is the problem this patch is trying to solve.
>>>>>>>>>>
>>>>>>>>>> By the submission result query API, I you assume you mean checking
>>>>>>>>>> the submission fence error somehow? That doesn't seem very ergonomic
>>>>>>>>>> for a Vulkan driver compared to the simple solution which is to just
>>>>>>>>>> mark it as guilty with what already exists...
>>>>>>>>> Well as I said the guilty handling is broken for quite a number of
>>>>>>>>> reasons.
>>>>>>>>>
>>>>>>>>> What we can do rather trivially is changing this code in
>>>>>>>>> amdgpu_job_prepare_job():
>>>>>>>>>
>>>>>>>>>            /* Ignore soft recovered fences here */
>>>>>>>>>            r = drm_sched_entity_error(s_entity);
>>>>>>>>>            if (r && r != -ENODATA)
>>>>>>>>>                    goto error;
>>>>>>>>>
>>>>>>>>> This will bubble up errors from soft recoveries into the entity as
>>>>>>>>> well and makes sure that further submissions are rejected.
>>>>>>>> That makes sense to do, but at least for GL_EXT_robustness, that will
>>>>>>>> not tell the app that it was guilty.
>>>>>>> No, it clearly gets that signaled. We should probably replace the guilty
>>>>>>> atomic with a calls to drm_sched_entity_error().
>>>>>>>
>>>>>>> It's just that this isn't what Marek and I had in mind for this,
>>>>>>> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
>>>>>>> AMDGPU_CTX_OP_QUERY_STATE2.
>>>>>>>
>>>>>>> Instead just look at the return value of the CS or query fence result IOCTL.
>>>>>>>
>>>>>>> When you get an -ENODATA you have been guilty of causing a soft
>>>>>>> recovery, when you get an -ETIME you are guilty of causing a timeout
>>>>>>> which had to be hard recovered. When you get an -ECANCELED you are an
>>>>>>> innocent victim of a hard recovery somebody else caused.
>>>>>>>
>>>>>>> What we haven't defined yet is an error code for loosing VRAM, but that
>>>>>>> should be trivial to do.
>>>>>> So far we have implemented the GPU reset and soft reset, but we
>>>>>> haven't done anything to have a robust system recovery. Under the
>>>>>> current system, things can easily keep hanging indefinitely because
>>>>>> nothing prevents that.
>>>>>>
>>>>>> The reset status query should stay. Robust apps will use it to tell
>>>>>> when they should recreate their context and resources even if they
>>>>>> don't submit anything. Let's fully trust robust apps here. In the
>>>>>> future we might change our mind about that, but for now, let's just
>>>>>> focus on API conformance, and later we can change it as long as we
>>>>>> stay API conformant.
>>>>>>
>>>>>> Non-robust apps must be terminated when they hang or are innocent but
>>>>>> affected. Their existence is a security and usability problem and a
>>>>>> source of frustrations for users. 100% guaranteed system recovery is
>>>>>> impossible if they continue to live.
>>>>>>
>>>>>> IBs should be rejected for all guilty and affected innocent contexts
>>>>>> unconditionally, both robust and non-robust ones, by the kernel.
>>>>>> Userspace only forwards the reset status to apps for robust contexts
>>>>>> and doesn't do anything else, but userspace may decide to terminate
>>>>>> the process if any non-robust context is affected.
>>>>>
>>>>> Yeah, that absolutely works for me.
>>>>>
>>>>> Going to adjust the implementation accordingly.
>>>>
>>>> Awesome, please CC me know when you have something.
>>>>
>>>> In the short-term I have changed if (r && r != -ENODATA) to if (r) and that seems to work nicely for me.
>>>
>>> One problem with solely relying on the CS submission return value from userspace is cancelled syncobj waits.
>>>
>>> For example, if we have an application with one thread that makes a submission, and then kicks off a vkWaitSemaphores to wait on a semaphore on another thread and that submission hangs, the syncobj relating to the vkWaitSemaphores should be signalled which is fine, but we need to return VK_ERROR_DEVICE_LOST if the context loss resulted in the signal for the VkSemaphore.
>>>
>>> The way this was previously integrated was with the query thing, which as we have established does not provide correct information regarding soft recovery at the moment.
>>>
>>> Unless you have an alternative for us to get some error out of the syncobj (eg. -ENODATA), then right now we still require the query.
>>>
>>> I think fixing the -ENODATA reporting back for submit is a good step, but I believe we still need the query to report the same information as we would have gotten from that here.
>>
>> Hmmm, actually the spec states that VK_SUCCESS is valid in this situation:
>>
>> Commands that wait indefinitely for device execution (namely vkDeviceWaitIdle, vkQueueWaitIdle, vkWaitForFences with a maximum timeout, and vkGetQueryPoolResults with the VK_QUERY_RESULT_WAIT_BIT bit set in flags) must return in finite time even in the case of a lost device, and return either VK_SUCCESS or VK_ERROR_DEVICE_LOST.
> 
> That could be read as "Return VK_SUCCESS on success, VK_ERROR_DEVICE_LOST on device lost", couldn't it?

No, there are other places where this is more explicit.

"If device loss occurs (see <<devsandqueues-lost-device,Lost Device>>) 
before
the timeout has expired, fname:vkWaitForFences must: return in finite time
with either ename:VK_SUCCESS or ename:VK_ERROR_DEVICE_LOST."

"If device loss occurs (see <<devsandqueues-lost-device,Lost Device>>) 
before
the timeout has expired, fname:vkWaitSemaphores must: return in finite time
with either ename:VK_SUCCESS or ename:VK_ERROR_DEVICE_LOST."

- Joshie 🐸✨

> 
>