Fw: [PATCH] drm/amdgpu: add amdgpu_timeout_ring_* file to debugfs

Thu Jun 15 07:47:34 UTC 2023

Am 14.06.23 um 21:20 schrieb Nicolai Hähnle:
> Hi Christian,
>
>>> Report the per-ring timeout in milliseconds and allow users to adjust
>>> the timeout dynamically. This can be useful for debugging, e.g. to more
>>> easily test whether a submission genuinely hangs or is just taking very
>>> long, and to temporarily disable GPU recovery so that shader problems
>>> can be examined in detail, including single-stepping through shader
>>> code.
>>>
>>> It feels a bit questionable to access ring->sched.timeout without any
>>> locking -- under a C++ memory model it would technically be undefined
>>> behavior. But it's not like a lot can go wrong here in practice, and
>>> it's not clear to me what locking or atomics, if any, should be used.
>> Uh, that's very dangerous what you do here and wouldn't work in a whole
>> bunch of cases.
> Please elaborate: *what* case doesn't work?

The core memory management can wait at any time for the GPU reset to finish.

If we set the timeout to infinity we risk just deadlocking the kernel.

See here as well: https://lpc.events/event/11/contributions/1115/

>
>
>> First of all GPU recovery is part of normal operation and necessary for
>> system stability. So disabling GPU recovery is actually not a good idea
>> in the first place.
> That's a complete non-argument because the whole point of this is that
> it is a debugging feature. You're using this when the system as a
> whole (most likely a UMD component) is already broken in some way.
> Putting this in debugfs is not an accident.
>
>
>> We already discussed that we probably need to taint the kernel if we do
>> so to indicate in crash logs that the system is not considered stable
>> any more. The problem was only that there wasn't an agreement on how to
>> do this.
> I'd be happy to add kernel tainting if you tell me how.
>
>
>> Since this here now makes it even easier to disable GPU recovery it's
>> probably not the right approach.
> Again, being able to disable GPU recovery is a crucial debugging
> feature. We need to be able to inspect the live state of hung shaders,
> and we need to be able to single-step through shaders. All of that
> requires disabling GPU recovery.

Yeah, I'm perfectly aware of that. The problem is this is just *not* 
supported on Linux for graphics shaders.

What you can do is to run the shader with something like CWSR enabled 
(or an to be invented graphics equivalent). Since we are debugging the 
shader anyway that should be possible I think.

> Forcing people to reboot just to be able to disable GPU recovery for
> debugging is developer hostile.

Well, I think you misunderstood me. The suggestion was even to force 
them to re-compile the kernel driver to disable GPU recovery.

Disabling GPU recovery is *not* something you can do and expect the 
system to be stable.

The only case we can do that is when we attach a JTAG debugger in an AMD 
lab.

Regards,
Christian.

>
> So again, if there really are cases where this "doesn't work" (and
> those cases aren't just that your desktop will freeze -- that part is
> intentional), then let's talk through it and see how to address them.
>
> Thanks,
> Nicolai
>
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 32 +++++++++++++++++++++++-
>>>    1 file changed, 31 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> index dc474b809604..32d223daa789 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> @@ -471,35 +471,65 @@ static ssize_t amdgpu_debugfs_ring_read(struct file *f, char __user *buf,
>>>
>>>         return result;
>>>    }
>>>
>>>    static const struct file_operations amdgpu_debugfs_ring_fops = {
>>>         .owner = THIS_MODULE,
>>>         .read = amdgpu_debugfs_ring_read,
>>>         .llseek = default_llseek
>>>    };
>>>
>>> +static int amdgpu_debugfs_timeout_ring_get(void *data, u64 *val) {
>>> +     struct amdgpu_ring *ring = data;
>>> +
>>> +     if (ring->sched.timeout == MAX_SCHEDULE_TIMEOUT)
>>> +             *val = 0;
>>> +     else
>>> +             *val = jiffies_to_msecs(ring->sched.timeout);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int amdgpu_debugfs_timeout_ring_set(void *data, u64 val) {
>>> +     struct amdgpu_ring *ring = data;
>>> +
>>> +     if (val == 0)
>>> +             ring->sched.timeout = MAX_SCHEDULE_TIMEOUT;
>>> +     else
>>> +             ring->sched.timeout = msecs_to_jiffies(val);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_timeout_ring_fops,
>>> +                      amdgpu_debugfs_timeout_ring_get,
>>> +                      amdgpu_debugfs_timeout_ring_set,
>>> +                      "%llu\n");
>>> +
>>>    #endif
>>>
>>>    void amdgpu_debugfs_ring_init(struct amdgpu_device *adev,
>>>                               struct amdgpu_ring *ring)
>>>    {
>>>    #if defined(CONFIG_DEBUG_FS)
>>>         struct drm_minor *minor = adev_to_drm(adev)->primary;
>>>         struct dentry *root = minor->debugfs_root;
>>> -     char name[32];
>>> +     char name[40];
>>>
>>>         sprintf(name, "amdgpu_ring_%s", ring->name);
>>>         debugfs_create_file_size(name, S_IFREG | S_IRUGO, root, ring,
>>>                                  &amdgpu_debugfs_ring_fops,
>>>                                  ring->ring_size + 12);
>>>
>>> +     sprintf(name, "amdgpu_timeout_ring_%s", ring->name);
>>> +     debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR, root, ring,
>>> +                         &amdgpu_debugfs_timeout_ring_fops);
>>>    #endif
>>>    }
>>>
>>>    /**
>>>     * amdgpu_ring_test_helper - tests ring and set sched readiness status
>>>     *
>>>     * @ring: ring to try the recovery on
>>>     *
>>>     * Tests ring and set sched readiness status
>>>     *
>