[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path
Joshua Ashton
joshua at froggi.es
Tue Jan 16 13:48:40 UTC 2024
On 1/16/24 13:44, Joshua Ashton wrote:
>
>
> On 1/16/24 13:41, Joshua Ashton wrote:
>>
>>
>> On 1/16/24 12:24, Joshua Ashton wrote:
>>>
>>>
>>> On 1/16/24 07:47, Christian König wrote:
>>>> Am 16.01.24 um 01:05 schrieb Marek Olšák:
>>>>> On Mon, Jan 15, 2024 at 3:06 PM Christian König
>>>>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>>>> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
>>>>>>> On 1/15/24 19:19, Christian König wrote:
>>>>>>>> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
>>>>>>>>> On 1/15/24 18:53, Christian König wrote:
>>>>>>>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>>>>>>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
>>>>>>>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
>>>>>>>>>>>> <friedrich.vock at gmx.de <mailto:friedrich.vock at gmx.de>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Re-sending as plaintext, sorry about that
>>>>>>>>>>>>
>>>>>>>>>>>> On 15.01.24 18:54, Michel Dänzer wrote:
>>>>>>>>>>>> > On 2024-01-15 18:26, Friedrich Vock wrote:
>>>>>>>>>>>> >> [snip]
>>>>>>>>>>>> >> The fundamental problem here is that not telling
>>>>>>>>>>>> applications that
>>>>>>>>>>>> >> something went wrong when you just canceled their work
>>>>>>>>>>>> midway is an
>>>>>>>>>>>> >> out-of-spec hack.
>>>>>>>>>>>> >> When there is a report of real-world apps breaking
>>>>>>>>>>>> because of
>>>>>>>>>>>> that hack,
>>>>>>>>>>>> >> reports of different apps working (even if it's
>>>>>>>>>>>> convenient that they
>>>>>>>>>>>> >> work) doesn't justify keeping the broken code.
>>>>>>>>>>>> > If the breaking apps hit multiple soft resets in a row,
>>>>>>>>>>>> I've laid
>>>>>>>>>>>> out a pragmatic solution which covers both cases.
>>>>>>>>>>>> Hitting soft reset every time is the lucky path. Once GPU
>>>>>>>>>>>> work is
>>>>>>>>>>>> interrupted out of nowhere, all bets are off and it
>>>>>>>>>>>> might as
>>>>>>>>>>>> well
>>>>>>>>>>>> trigger a full system hang next time. No hang recovery
>>>>>>>>>>>> should
>>>>>>>>>>>> be able to
>>>>>>>>>>>> cause that under any circumstance.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think the more insidious situation is no further hangs but
>>>>>>>>>>>> wrong results because we skipped some work. That we skipped
>>>>>>>>>>>> work
>>>>>>>>>>>> may e.g. result in some texture not being uploaded or some
>>>>>>>>>>>> GPGPU
>>>>>>>>>>>> work not being done and causing further errors downstream
>>>>>>>>>>>> (say if
>>>>>>>>>>>> a game is doing AI/physics on the GPU not to say anything of
>>>>>>>>>>>> actual GPGPU work one might be doing like AI)
>>>>>>>>>>> Even worse if this is compute on eg. OpenCL for something
>>>>>>>>>>> science/math/whatever related, or training a model.
>>>>>>>>>>>
>>>>>>>>>>> You could randomly just get invalid/wrong results without even
>>>>>>>>>>> knowing!
>>>>>>>>>> Well on the kernel side we do provide an API to query the
>>>>>>>>>> result of
>>>>>>>>>> a submission. That includes canceling submissions with a soft
>>>>>>>>>> recovery.
>>>>>>>>>>
>>>>>>>>>> What we just doesn't do is to prevent further submissions from
>>>>>>>>>> this
>>>>>>>>>> application. E.g. enforcing that the application is punished for
>>>>>>>>>> bad behavior.
>>>>>>>>> You do prevent future submissions for regular resets though: Those
>>>>>>>>> increase karma which sets ctx->guilty, and if ctx->guilty then
>>>>>>>>> -ECANCELED is returned for a submission.
>>>>>>>>>
>>>>>>>>> ctx->guilty is never true for soft recovery though, as it doesn't
>>>>>>>>> increase karma, which is the problem this patch is trying to
>>>>>>>>> solve.
>>>>>>>>>
>>>>>>>>> By the submission result query API, I you assume you mean checking
>>>>>>>>> the submission fence error somehow? That doesn't seem very
>>>>>>>>> ergonomic
>>>>>>>>> for a Vulkan driver compared to the simple solution which is to
>>>>>>>>> just
>>>>>>>>> mark it as guilty with what already exists...
>>>>>>>> Well as I said the guilty handling is broken for quite a number of
>>>>>>>> reasons.
>>>>>>>>
>>>>>>>> What we can do rather trivially is changing this code in
>>>>>>>> amdgpu_job_prepare_job():
>>>>>>>>
>>>>>>>> /* Ignore soft recovered fences here */
>>>>>>>> r = drm_sched_entity_error(s_entity);
>>>>>>>> if (r && r != -ENODATA)
>>>>>>>> goto error;
>>>>>>>>
>>>>>>>> This will bubble up errors from soft recoveries into the entity as
>>>>>>>> well and makes sure that further submissions are rejected.
>>>>>>> That makes sense to do, but at least for GL_EXT_robustness, that
>>>>>>> will
>>>>>>> not tell the app that it was guilty.
>>>>>> No, it clearly gets that signaled. We should probably replace the
>>>>>> guilty
>>>>>> atomic with a calls to drm_sched_entity_error().
>>>>>>
>>>>>> It's just that this isn't what Marek and I had in mind for this,
>>>>>> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
>>>>>> AMDGPU_CTX_OP_QUERY_STATE2.
>>>>>>
>>>>>> Instead just look at the return value of the CS or query fence
>>>>>> result IOCTL.
>>>>>>
>>>>>> When you get an -ENODATA you have been guilty of causing a soft
>>>>>> recovery, when you get an -ETIME you are guilty of causing a timeout
>>>>>> which had to be hard recovered. When you get an -ECANCELED you are an
>>>>>> innocent victim of a hard recovery somebody else caused.
>>>>>>
>>>>>> What we haven't defined yet is an error code for loosing VRAM, but
>>>>>> that
>>>>>> should be trivial to do.
>>>>> So far we have implemented the GPU reset and soft reset, but we
>>>>> haven't done anything to have a robust system recovery. Under the
>>>>> current system, things can easily keep hanging indefinitely because
>>>>> nothing prevents that.
>>>>>
>>>>> The reset status query should stay. Robust apps will use it to tell
>>>>> when they should recreate their context and resources even if they
>>>>> don't submit anything. Let's fully trust robust apps here. In the
>>>>> future we might change our mind about that, but for now, let's just
>>>>> focus on API conformance, and later we can change it as long as we
>>>>> stay API conformant.
>>>>>
>>>>> Non-robust apps must be terminated when they hang or are innocent but
>>>>> affected. Their existence is a security and usability problem and a
>>>>> source of frustrations for users. 100% guaranteed system recovery is
>>>>> impossible if they continue to live.
>>>>>
>>>>> IBs should be rejected for all guilty and affected innocent contexts
>>>>> unconditionally, both robust and non-robust ones, by the kernel.
>>>>> Userspace only forwards the reset status to apps for robust contexts
>>>>> and doesn't do anything else, but userspace may decide to terminate
>>>>> the process if any non-robust context is affected.
>>>>
>>>> Yeah, that absolutely works for me.
>>>>
>>>> Going to adjust the implementation accordingly.
>>>
>>> Awesome, please CC me know when you have something.
>>>
>>> In the short-term I have changed if (r && r != -ENODATA) to if (r)
>>> and that seems to work nicely for me.
>>
>> One problem with solely relying on the CS submission return value from
>> userspace is cancelled syncobj waits.
>>
>> For example, if we have an application with one thread that makes a
>> submission, and then kicks off a vkWaitSemaphores to wait on a
>> semaphore on another thread and that submission hangs, the syncobj
>> relating to the vkWaitSemaphores should be signalled which is fine,
>> but we need to return VK_ERROR_DEVICE_LOST if the context loss
>> resulted in the signal for the VkSemaphore.
>>
>> The way this was previously integrated was with the query thing, which
>> as we have established does not provide correct information regarding
>> soft recovery at the moment.
>>
>> Unless you have an alternative for us to get some error out of the
>> syncobj (eg. -ENODATA), then right now we still require the query.
>>
>> I think fixing the -ENODATA reporting back for submit is a good step,
>> but I believe we still need the query to report the same information
>> as we would have gotten from that here.
>
> Hmmm, actually the spec states that VK_SUCCESS is valid in this situation:
>
> Commands that wait indefinitely for device execution (namely
> vkDeviceWaitIdle, vkQueueWaitIdle, vkWaitForFences with a maximum
> timeout, and vkGetQueryPoolResults with the VK_QUERY_RESULT_WAIT_BIT bit
> set in flags) must return in finite time even in the case of a lost
> device, and return either VK_SUCCESS or VK_ERROR_DEVICE_LOST.
>
> ...
>
>
> Once a device is lost, command execution may fail, and certain commands
> that return a VkResult may return VK_ERROR_DEVICE_LOST.
I guess for now disregard last email regarding us potentially needing
the query, it does seem that returning SUCCESS is completely valid.
- Joshie 🐸✨
>
> - Joshie 🐸✨
>
>>
>> Thanks
>>
>> - Joshie 🐸✨
>>
>>>
>>> - Joshie 🐸✨
>>>
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>>
>>>>> Marek
>>>>
>>>
More information about the amd-gfx
mailing list