[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Tue Jan 16 17:34:19 UTC 2024

On 1/16/24 15:05, Christian König wrote:
> Am 16.01.24 um 14:48 schrieb Joshua Ashton:
>>
>> [SNIP]
>>>>>> Going to adjust the implementation accordingly.
>>>>>
>>>>> Awesome, please CC me know when you have something.
> 
> Sure, going to keep that in mind.
> 
>>>>>
>>>>> In the short-term I have changed if (r && r != -ENODATA) to if (r) 
>>>>> and that seems to work nicely for me.
>>>>
>>>> One problem with solely relying on the CS submission return value 
>>>> from userspace is cancelled syncobj waits.
>>>>
>>>> For example, if we have an application with one thread that makes a 
>>>> submission, and then kicks off a vkWaitSemaphores to wait on a 
>>>> semaphore on another thread and that submission hangs, the syncobj 
>>>> relating to the vkWaitSemaphores should be signalled which is fine, 
>>>> but we need to return VK_ERROR_DEVICE_LOST if the context loss 
>>>> resulted in the signal for the VkSemaphore.
>>>>
>>>> The way this was previously integrated was with the query thing, 
>>>> which as we have established does not provide correct information 
>>>> regarding soft recovery at the moment.
>>>>
>>>> Unless you have an alternative for us to get some error out of the 
>>>> syncobj (eg. -ENODATA), then right now we still require the query.
>>>>
>>>> I think fixing the -ENODATA reporting back for submit is a good 
>>>> step, but I believe we still need the query to report the same 
>>>> information as we would have gotten from that here.
> 
> Yeah, exactly that was one of the reasons why the guilty handling 
> approach didn't solved the problem sufficiently.
> 
> If I remember correctly at least for the syncobj used on Android you can 
> actually query the result of the execution after waiting for them to 
> signal.
> 
> So not only the issuer of a submission can get the result, but also 
> everybody waiting for the result. In other words Wayland, X, etc... can 
> implement graceful handling should an application send them nonsense.

That would be great. Right now the vkQueueSubmit for a hanging 
application ends up going through as the Wayland commit has been made, 
and the fence gets signalled.

This means we can get an image that is complete garbage from a hung app 
-- no DCC retile occured for it, which is pretty funny as the scanout 
image looks garbage, but it displays whatever it last was in composite. xD

Getting the fence error on the compositor side would be great to 
disregard such images.

We could also do more useful device loss handling with that on the 
driver side.

Thanks!
- Joshie 🐸✨

> 
> I don't think we ever implemented something similar for drm_syncobj and 
> we might need error forwarding in the dma_fence_chain container, but 
> stuff like that is trivial to implement should the requirement for this 
> ever come up.
> 
>>>
>>> Hmmm, actually the spec states that VK_SUCCESS is valid in this 
>>> situation:
>>>
>>> Commands that wait indefinitely for device execution (namely 
>>> vkDeviceWaitIdle, vkQueueWaitIdle, vkWaitForFences with a maximum 
>>> timeout, and vkGetQueryPoolResults with the VK_QUERY_RESULT_WAIT_BIT 
>>> bit set in flags) must return in finite time even in the case of a 
>>> lost device, and return either VK_SUCCESS or VK_ERROR_DEVICE_LOST.
>>>
>>> ...
>>>
>>>
>>> Once a device is lost, command execution may fail, and certain 
>>> commands that return a VkResult may return VK_ERROR_DEVICE_LOST.
>>
>> I guess for now disregard last email regarding us potentially needing 
>> the query, it does seem that returning SUCCESS is completely valid.
> 
> Well how you interpret the information the kernel gives you in userspace 
> is your thing. But we should at least make sure that the general 
> approach inside the kernel has a design which can handle those 
> requirements.
> 
> Christian.
> 
>>
>> - Joshie 🐸✨
>>
>>>
>>> - Joshie 🐸✨
>>>
>>>>
>>>> Thanks
>>>>
>>>> - Joshie 🐸✨
>>>>
>>>>>
>>>>> - Joshie 🐸✨
>>>>>
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Marek
>>>>>>
>>>>>
>