Fw: [PATCH] drm/amdgpu: add amdgpu_timeout_ring_* file to debugfs

Thu Jun 15 09:14:28 UTC 2023

Am 15.06.23 um 10:55 schrieb Nicolai Hähnle:
> On Thu, Jun 15, 2023 at 9:47 AM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>> Uh, that's very dangerous what you do here and wouldn't work in a whole
>>>> bunch of cases.
>>> Please elaborate: *what* case doesn't work?
>> The core memory management can wait at any time for the GPU reset to finish.
>>
>> If we set the timeout to infinity we risk just deadlocking the kernel.
> Okay, thanks. I may have seen some aspect of this before in cases
> where GPU reset failed and left processes in an unkillable state.
>
> I'll be honest, I've seen my fair share of exotic GPU hangs and to put
> it mildly I'm not impressed by the kernel's handling of them.

Yeah, and I completely agree with you. The whole situation around that 
is just horrible.

> Obviously you know much more about the intricacies of kernel memory
> management than I do, but the fact that processes can end up in an
> unkillable state for *any* GPU-related reason feels to me like the
> result of a bad design decision somewhere.
>
> But anyway, I'm not even asking you to fix those problems. All I'm
> asking you is to let me do *my* job, part of which is to help prevent
> GPU hangs from happening in the first place. For that, I need useful
> debugging facilities -- and so do others.
>
>
>>> Again, being able to disable GPU recovery is a crucial debugging
>>> feature. We need to be able to inspect the live state of hung shaders,
>>> and we need to be able to single-step through shaders. All of that
>>> requires disabling GPU recovery.
>> Yeah, I'm perfectly aware of that. The problem is this is just *not*
>> supported on Linux for graphics shaders.
>>
>> What you can do is to run the shader with something like CWSR enabled
>> (or an to be invented graphics equivalent). Since we are debugging the
>> shader anyway that should be possible I think.
>>
>>> Forcing people to reboot just to be able to disable GPU recovery for
>>> debugging is developer hostile.
>> Well, I think you misunderstood me. The suggestion was even to force
>> them to re-compile the kernel driver to disable GPU recovery.
>>
>> Disabling GPU recovery is *not* something you can do and expect the
>> system to be stable.
>>
>> The only case we can do that is when we attach a JTAG debugger in an AMD
>> lab.
> You're being *completely* unreasonable here. Even Windows(!) allows
> disabling GPU recovery at runtime from software, and Windows is
> usually far more developer hostile than Linux in these things.
> Seriously, this level of hostility against developers coming from you
> is not okay.

Well, I'm not hostile against developers, but just realistic that this 
will lead to even more problems.

And I rather avoid problems with end-users than with developers because 
the later are usually the more skilled people.

As far as I can see that Windows allowed to disable GPU recovery was 
actually the source of the problem and is absolutely no argument to 
repeat the same mistake on Linux again.

> Yes, it's a tool that has sharp edges. That is perfectly well
> understood. If we need to add warning labels then so be it. And if the
> details of *how* to change the timeout or disable GPU recovery at
> runtime should be changed, that too is fine. But it's an important
> tool. Can we please just move forward on this in a pragmatic fashion?

Yes and because of this I'm just rejecting this approach here.

Rebooting to disable GPU reset is perfectly fine and reasonable to do 
for a developer.

As I said the requirement I have for the other extreme is to make it a 
compile time only option and I'm trying to avoid that as well.

Regards,
Christian.

>
> Thanks,
> Nicolai
>
>
>> Regards,
>> Christian.
>>
>>> So again, if there really are cases where this "doesn't work" (and
>>> those cases aren't just that your desktop will freeze -- that part is
>>> intentional), then let's talk through it and see how to address them.
>>>
>>> Thanks,
>>> Nicolai
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 32 +++++++++++++++++++++++-
>>>>>     1 file changed, 31 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> index dc474b809604..32d223daa789 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> @@ -471,35 +471,65 @@ static ssize_t amdgpu_debugfs_ring_read(struct file *f, char __user *buf,
>>>>>
>>>>>          return result;
>>>>>     }
>>>>>
>>>>>     static const struct file_operations amdgpu_debugfs_ring_fops = {
>>>>>          .owner = THIS_MODULE,
>>>>>          .read = amdgpu_debugfs_ring_read,
>>>>>          .llseek = default_llseek
>>>>>     };
>>>>>
>>>>> +static int amdgpu_debugfs_timeout_ring_get(void *data, u64 *val) {
>>>>> +     struct amdgpu_ring *ring = data;
>>>>> +
>>>>> +     if (ring->sched.timeout == MAX_SCHEDULE_TIMEOUT)
>>>>> +             *val = 0;
>>>>> +     else
>>>>> +             *val = jiffies_to_msecs(ring->sched.timeout);
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static int amdgpu_debugfs_timeout_ring_set(void *data, u64 val) {
>>>>> +     struct amdgpu_ring *ring = data;
>>>>> +
>>>>> +     if (val == 0)
>>>>> +             ring->sched.timeout = MAX_SCHEDULE_TIMEOUT;
>>>>> +     else
>>>>> +             ring->sched.timeout = msecs_to_jiffies(val);
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_timeout_ring_fops,
>>>>> +                      amdgpu_debugfs_timeout_ring_get,
>>>>> +                      amdgpu_debugfs_timeout_ring_set,
>>>>> +                      "%llu\n");
>>>>> +
>>>>>     #endif
>>>>>
>>>>>     void amdgpu_debugfs_ring_init(struct amdgpu_device *adev,
>>>>>                                struct amdgpu_ring *ring)
>>>>>     {
>>>>>     #if defined(CONFIG_DEBUG_FS)
>>>>>          struct drm_minor *minor = adev_to_drm(adev)->primary;
>>>>>          struct dentry *root = minor->debugfs_root;
>>>>> -     char name[32];
>>>>> +     char name[40];
>>>>>
>>>>>          sprintf(name, "amdgpu_ring_%s", ring->name);
>>>>>          debugfs_create_file_size(name, S_IFREG | S_IRUGO, root, ring,
>>>>>                                   &amdgpu_debugfs_ring_fops,
>>>>>                                   ring->ring_size + 12);
>>>>>
>>>>> +     sprintf(name, "amdgpu_timeout_ring_%s", ring->name);
>>>>> +     debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR, root, ring,
>>>>> +                         &amdgpu_debugfs_timeout_ring_fops);
>>>>>     #endif
>>>>>     }
>>>>>
>>>>>     /**
>>>>>      * amdgpu_ring_test_helper - tests ring and set sched readiness status
>>>>>      *
>>>>>      * @ring: ring to try the recovery on
>>>>>      *
>>>>>      * Tests ring and set sched readiness status
>>>>>      *
>