[RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Thu May 4 06:43:42 UTC 2023

Am 03.05.23 um 21:14 schrieb André Almeida:
> Em 03/05/2023 14:43, Timur Kristóf escreveu:
>> Hi Felix,
>>
>> On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
>>> That's the worst-case scenario where you're debugging HW or FW
>>> issues.
>>> Those should be pretty rare post-bringup. But are there hangs caused
>>> by
>>> user mode driver or application bugs that are easier to debug and
>>> probably don't even require a GPU reset?
>>
>> There are many GPU hangs that gamers experience while playing. We have
>> dozens of open bug reports against RADV about GPU hangs on various GPU
>> generations. These usually fall into two categories:
>>
>> 1. When the hang always happens at the same point in a game. These are
>> painful to debug but manageable.
>> 2. "Random" hangs that happen to users over the course of playing a
>> game for several hours. It is absolute hell to try to even reproduce
>> let alone diagnose these issues, and this is what we would like to
>> improve.
>>
>> For these hard-to-diagnose problems, it is already a challenge to
>> determine whether the problem is the kernel (eg. setting wrong voltages
>> / frequencies) or userspace (eg. missing some synchronization), can be
>> even a game bug that we need to work around.
>>
>>> For example most VM faults can
>>> be handled without hanging the GPU. Similarly, a shader in an endless
>>> loop should not require a full GPU reset.
>>
>> This is actually not the case, AFAIK André's test case was an app that
>> had an infinite loop in a shader.
>>
>
> This is the test app if anyone want to try out: 
> https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run.
>
> The kernel calls amdgpu_ring_soft_recovery() when I run my example, 
> but I'm not sure what a soft recovery means here and if it's a full 
> GPU reset or not.

That's just "soft" recovery. In other words we send the SQ a command to 
kill a shader.

That usually works for shaders which contain an endless loop (which is 
the most common application bug), but unfortunately not for any other 
problem.

>
> But if we can at least trust the CP registers to dump information for 
> soft resets, it would be some improvement from the current state I think

Especially for endless loops the CP registers are completely useless. 
The CP just prepares the draw commands and all the state which is then 
send to the SQ for execution.

As Marek wrote we know which submission has timed out in the kernel, but 
we can't figure out where inside this submission we are.

>
>>>
>>> It's more complicated for graphics because of the more complex
>>> pipeline
>>> and the lack of CWSR. But it should still be possible to do some
>>> debugging without JTAG if the problem is in SW and not HW or FW. It's
>>> probably worth improving that debugability without getting hung-up on
>>> the worst case.
>>
>> I agree, and we welcome any constructive suggestion to improve the
>> situation. It seems like our idea doesn't work if the kernel can't give
>> us the information we need.
>>
>> How do we move forward?

As I said the best approach to figure out which draw command hangs is to 
sprinkle WRITE_DATA commands into your command stream.

That's not so much overhead and at least Bas things that this is doable 
in RADV with some changes.

For the kernel we can certainly implement devcoredump and allow writing 
out register values and other state when a problem happens.

Regards,
Christian.

>>
>> Best regards,
>> Timur
>>