[RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

André Almeida andrealmeid at igalia.com
Wed May 3 19:14:11 UTC 2023


Em 03/05/2023 14:43, Timur Kristóf escreveu:
> Hi Felix,
> 
> On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
>> That's the worst-case scenario where you're debugging HW or FW
>> issues.
>> Those should be pretty rare post-bringup. But are there hangs caused
>> by
>> user mode driver or application bugs that are easier to debug and
>> probably don't even require a GPU reset?
> 
> There are many GPU hangs that gamers experience while playing. We have
> dozens of open bug reports against RADV about GPU hangs on various GPU
> generations. These usually fall into two categories:
> 
> 1. When the hang always happens at the same point in a game. These are
> painful to debug but manageable.
> 2. "Random" hangs that happen to users over the course of playing a
> game for several hours. It is absolute hell to try to even reproduce
> let alone diagnose these issues, and this is what we would like to
> improve.
> 
> For these hard-to-diagnose problems, it is already a challenge to
> determine whether the problem is the kernel (eg. setting wrong voltages
> / frequencies) or userspace (eg. missing some synchronization), can be
> even a game bug that we need to work around.
> 
>> For example most VM faults can
>> be handled without hanging the GPU. Similarly, a shader in an endless
>> loop should not require a full GPU reset.
> 
> This is actually not the case, AFAIK André's test case was an app that
> had an infinite loop in a shader.
> 

This is the test app if anyone want to try out: 
https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run.

The kernel calls amdgpu_ring_soft_recovery() when I run my example, but 
I'm not sure what a soft recovery means here and if it's a full GPU 
reset or not.

But if we can at least trust the CP registers to dump information for 
soft resets, it would be some improvement from the current state I think

>>
>> It's more complicated for graphics because of the more complex
>> pipeline
>> and the lack of CWSR. But it should still be possible to do some
>> debugging without JTAG if the problem is in SW and not HW or FW. It's
>> probably worth improving that debugability without getting hung-up on
>> the worst case.
> 
> I agree, and we welcome any constructive suggestion to improve the
> situation. It seems like our idea doesn't work if the kernel can't give
> us the information we need.
> 
> How do we move forward?
> 
> Best regards,
> Timur
> 


More information about the amd-gfx mailing list