[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Tue Jan 16 13:44:14 UTC 2024

On 1/16/24 13:41, Joshua Ashton wrote:
> 
> 
> On 1/16/24 12:24, Joshua Ashton wrote:
>>
>>
>> On 1/16/24 07:47, Christian König wrote:
>>> Am 16.01.24 um 01:05 schrieb Marek Olšák:
>>>> On Mon, Jan 15, 2024 at 3:06 PM Christian König
>>>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>>> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
>>>>>> On 1/15/24 19:19, Christian König wrote:
>>>>>>> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>>>>>>>> On 1/15/24 18:53, Christian König wrote:
>>>>>>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>>>>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>>>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
>>>>>>>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>      Re-sending as plaintext, sorry about that
>>>>>>>>>>>
>>>>>>>>>>>      On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>>>>>>>       > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>>>>>>>       >> [snip]
>>>>>>>>>>>       >> The fundamental problem here is that not telling
>>>>>>>>>>> applications that
>>>>>>>>>>>       >> something went wrong when you just canceled their work
>>>>>>>>>>> midway is an
>>>>>>>>>>>       >> out-of-spec hack.
>>>>>>>>>>>       >> When there is a report of real-world apps breaking
>>>>>>>>>>> because of
>>>>>>>>>>>      that hack,
>>>>>>>>>>>       >> reports of different apps working (even if it's
>>>>>>>>>>> convenient that they
>>>>>>>>>>>       >> work) doesn't justify keeping the broken code.
>>>>>>>>>>>       > If the breaking apps hit multiple soft resets in a row,
>>>>>>>>>>> I've laid
>>>>>>>>>>>      out a pragmatic solution which covers both cases.
>>>>>>>>>>>      Hitting soft reset every time is the lucky path. Once GPU
>>>>>>>>>>> work is
>>>>>>>>>>>      interrupted out of nowhere, all bets are off and it 
>>>>>>>>>>> might as
>>>>>>>>>>> well
>>>>>>>>>>>      trigger a full system hang next time. No hang recovery 
>>>>>>>>>>> should
>>>>>>>>>>> be able to
>>>>>>>>>>>      cause that under any circumstance.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think the more insidious situation is no further hangs but
>>>>>>>>>>> wrong results because we skipped some work. That we skipped work
>>>>>>>>>>> may e.g. result in some texture not being uploaded or some GPGPU
>>>>>>>>>>> work not being done and causing further errors downstream 
>>>>>>>>>>> (say if
>>>>>>>>>>> a game is doing AI/physics on the GPU not to say anything of
>>>>>>>>>>> actual GPGPU work one might be doing like AI)
>>>>>>>>>> Even worse if this is compute on eg. OpenCL for something
>>>>>>>>>> science/math/whatever related, or training a model.
>>>>>>>>>>
>>>>>>>>>> You could randomly just get invalid/wrong results without even
>>>>>>>>>> knowing!
>>>>>>>>> Well on the kernel side we do provide an API to query the 
>>>>>>>>> result of
>>>>>>>>> a submission. That includes canceling submissions with a soft
>>>>>>>>> recovery.
>>>>>>>>>
>>>>>>>>> What we just doesn't do is to prevent further submissions from 
>>>>>>>>> this
>>>>>>>>> application. E.g. enforcing that the application is punished for
>>>>>>>>> bad behavior.
>>>>>>>> You do prevent future submissions for regular resets though: Those
>>>>>>>> increase karma which sets ctx->guilty, and if ctx->guilty then
>>>>>>>> -ECANCELED is returned for a submission.
>>>>>>>>
>>>>>>>> ctx->guilty is never true for soft recovery though, as it doesn't
>>>>>>>> increase karma, which is the problem this patch is trying to solve.
>>>>>>>>
>>>>>>>> By the submission result query API, I you assume you mean checking
>>>>>>>> the submission fence error somehow? That doesn't seem very 
>>>>>>>> ergonomic
>>>>>>>> for a Vulkan driver compared to the simple solution which is to 
>>>>>>>> just
>>>>>>>> mark it as guilty with what already exists...
>>>>>>> Well as I said the guilty handling is broken for quite a number of
>>>>>>> reasons.
>>>>>>>
>>>>>>> What we can do rather trivially is changing this code in
>>>>>>> amdgpu_job_prepare_job():
>>>>>>>
>>>>>>>           /* Ignore soft recovered fences here */
>>>>>>>           r = drm_sched_entity_error(s_entity);
>>>>>>>           if (r && r != -ENODATA)
>>>>>>>                   goto error;
>>>>>>>
>>>>>>> This will bubble up errors from soft recoveries into the entity as
>>>>>>> well and makes sure that further submissions are rejected.
>>>>>> That makes sense to do, but at least for GL_EXT_robustness, that will
>>>>>> not tell the app that it was guilty.
>>>>> No, it clearly gets that signaled. We should probably replace the 
>>>>> guilty
>>>>> atomic with a calls to drm_sched_entity_error().
>>>>>
>>>>> It's just that this isn't what Marek and I had in mind for this,
>>>>> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
>>>>> AMDGPU_CTX_OP_QUERY_STATE2.
>>>>>
>>>>> Instead just look at the return value of the CS or query fence 
>>>>> result IOCTL.
>>>>>
>>>>> When you get an -ENODATA you have been guilty of causing a soft
>>>>> recovery, when you get an -ETIME you are guilty of causing a timeout
>>>>> which had to be hard recovered. When you get an -ECANCELED you are an
>>>>> innocent victim of a hard recovery somebody else caused.
>>>>>
>>>>> What we haven't defined yet is an error code for loosing VRAM, but 
>>>>> that
>>>>> should be trivial to do.
>>>> So far we have implemented the GPU reset and soft reset, but we
>>>> haven't done anything to have a robust system recovery. Under the
>>>> current system, things can easily keep hanging indefinitely because
>>>> nothing prevents that.
>>>>
>>>> The reset status query should stay. Robust apps will use it to tell
>>>> when they should recreate their context and resources even if they
>>>> don't submit anything. Let's fully trust robust apps here. In the
>>>> future we might change our mind about that, but for now, let's just
>>>> focus on API conformance, and later we can change it as long as we
>>>> stay API conformant.
>>>>
>>>> Non-robust apps must be terminated when they hang or are innocent but
>>>> affected. Their existence is a security and usability problem and a
>>>> source of frustrations for users. 100% guaranteed system recovery is
>>>> impossible if they continue to live.
>>>>
>>>> IBs should be rejected for all guilty and affected innocent contexts
>>>> unconditionally, both robust and non-robust ones, by the kernel.
>>>> Userspace only forwards the reset status to apps for robust contexts
>>>> and doesn't do anything else, but userspace may decide to terminate
>>>> the process if any non-robust context is affected.
>>>
>>> Yeah, that absolutely works for me.
>>>
>>> Going to adjust the implementation accordingly.
>>
>> Awesome, please CC me know when you have something.
>>
>> In the short-term I have changed if (r && r != -ENODATA) to if (r) and 
>> that seems to work nicely for me.
> 
> One problem with solely relying on the CS submission return value from 
> userspace is cancelled syncobj waits.
> 
> For example, if we have an application with one thread that makes a 
> submission, and then kicks off a vkWaitSemaphores to wait on a semaphore 
> on another thread and that submission hangs, the syncobj relating to the 
> vkWaitSemaphores should be signalled which is fine, but we need to 
> return VK_ERROR_DEVICE_LOST if the context loss resulted in the signal 
> for the VkSemaphore.
> 
> The way this was previously integrated was with the query thing, which 
> as we have established does not provide correct information regarding 
> soft recovery at the moment.
> 
> Unless you have an alternative for us to get some error out of the 
> syncobj (eg. -ENODATA), then right now we still require the query.
> 
> I think fixing the -ENODATA reporting back for submit is a good step, 
> but I believe we still need the query to report the same information as 
> we would have gotten from that here.

Hmmm, actually the spec states that VK_SUCCESS is valid in this situation:

Commands that wait indefinitely for device execution (namely 
vkDeviceWaitIdle, vkQueueWaitIdle, vkWaitForFences with a maximum 
timeout, and vkGetQueryPoolResults with the VK_QUERY_RESULT_WAIT_BIT bit 
set in flags) must return in finite time even in the case of a lost 
device, and return either VK_SUCCESS or VK_ERROR_DEVICE_LOST.

...

Once a device is lost, command execution may fail, and certain commands 
that return a VkResult may return VK_ERROR_DEVICE_LOST.

- Joshie 🐸✨

> 
> Thanks
> 
> - Joshie 🐸✨
> 
>>
>> - Joshie 🐸✨
>>
>>>
>>> Christian.
>>>
>>>>
>>>>
>>>> Marek
>>>
>>