Fw: [PATCH] drm/amdgpu: add amdgpu_timeout_ring_* file to debugfs
Christian König
ckoenig.leichtzumerken at gmail.com
Thu Jun 15 09:14:28 UTC 2023
Am 15.06.23 um 10:55 schrieb Nicolai Hähnle:
> On Thu, Jun 15, 2023 at 9:47 AM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>> Uh, that's very dangerous what you do here and wouldn't work in a whole
>>>> bunch of cases.
>>> Please elaborate: *what* case doesn't work?
>> The core memory management can wait at any time for the GPU reset to finish.
>>
>> If we set the timeout to infinity we risk just deadlocking the kernel.
> Okay, thanks. I may have seen some aspect of this before in cases
> where GPU reset failed and left processes in an unkillable state.
>
> I'll be honest, I've seen my fair share of exotic GPU hangs and to put
> it mildly I'm not impressed by the kernel's handling of them.
Yeah, and I completely agree with you. The whole situation around that
is just horrible.
> Obviously you know much more about the intricacies of kernel memory
> management than I do, but the fact that processes can end up in an
> unkillable state for *any* GPU-related reason feels to me like the
> result of a bad design decision somewhere.
>
> But anyway, I'm not even asking you to fix those problems. All I'm
> asking you is to let me do *my* job, part of which is to help prevent
> GPU hangs from happening in the first place. For that, I need useful
> debugging facilities -- and so do others.
>
>
>>> Again, being able to disable GPU recovery is a crucial debugging
>>> feature. We need to be able to inspect the live state of hung shaders,
>>> and we need to be able to single-step through shaders. All of that
>>> requires disabling GPU recovery.
>> Yeah, I'm perfectly aware of that. The problem is this is just *not*
>> supported on Linux for graphics shaders.
>>
>> What you can do is to run the shader with something like CWSR enabled
>> (or an to be invented graphics equivalent). Since we are debugging the
>> shader anyway that should be possible I think.
>>
>>> Forcing people to reboot just to be able to disable GPU recovery for
>>> debugging is developer hostile.
>> Well, I think you misunderstood me. The suggestion was even to force
>> them to re-compile the kernel driver to disable GPU recovery.
>>
>> Disabling GPU recovery is *not* something you can do and expect the
>> system to be stable.
>>
>> The only case we can do that is when we attach a JTAG debugger in an AMD
>> lab.
> You're being *completely* unreasonable here. Even Windows(!) allows
> disabling GPU recovery at runtime from software, and Windows is
> usually far more developer hostile than Linux in these things.
> Seriously, this level of hostility against developers coming from you
> is not okay.
Well, I'm not hostile against developers, but just realistic that this
will lead to even more problems.
And I rather avoid problems with end-users than with developers because
the later are usually the more skilled people.
As far as I can see that Windows allowed to disable GPU recovery was
actually the source of the problem and is absolutely no argument to
repeat the same mistake on Linux again.
> Yes, it's a tool that has sharp edges. That is perfectly well
> understood. If we need to add warning labels then so be it. And if the
> details of *how* to change the timeout or disable GPU recovery at
> runtime should be changed, that too is fine. But it's an important
> tool. Can we please just move forward on this in a pragmatic fashion?
Yes and because of this I'm just rejecting this approach here.
Rebooting to disable GPU reset is perfectly fine and reasonable to do
for a developer.
As I said the requirement I have for the other extreme is to make it a
compile time only option and I'm trying to avoid that as well.
Regards,
Christian.
>
> Thanks,
> Nicolai
>
>
>> Regards,
>> Christian.
>>
>>> So again, if there really are cases where this "doesn't work" (and
>>> those cases aren't just that your desktop will freeze -- that part is
>>> intentional), then let's talk through it and see how to address them.
>>>
>>> Thanks,
>>> Nicolai
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 32 +++++++++++++++++++++++-
>>>>> 1 file changed, 31 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> index dc474b809604..32d223daa789 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> @@ -471,35 +471,65 @@ static ssize_t amdgpu_debugfs_ring_read(struct file *f, char __user *buf,
>>>>>
>>>>> return result;
>>>>> }
>>>>>
>>>>> static const struct file_operations amdgpu_debugfs_ring_fops = {
>>>>> .owner = THIS_MODULE,
>>>>> .read = amdgpu_debugfs_ring_read,
>>>>> .llseek = default_llseek
>>>>> };
>>>>>
>>>>> +static int amdgpu_debugfs_timeout_ring_get(void *data, u64 *val) {
>>>>> + struct amdgpu_ring *ring = data;
>>>>> +
>>>>> + if (ring->sched.timeout == MAX_SCHEDULE_TIMEOUT)
>>>>> + *val = 0;
>>>>> + else
>>>>> + *val = jiffies_to_msecs(ring->sched.timeout);
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +static int amdgpu_debugfs_timeout_ring_set(void *data, u64 val) {
>>>>> + struct amdgpu_ring *ring = data;
>>>>> +
>>>>> + if (val == 0)
>>>>> + ring->sched.timeout = MAX_SCHEDULE_TIMEOUT;
>>>>> + else
>>>>> + ring->sched.timeout = msecs_to_jiffies(val);
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_timeout_ring_fops,
>>>>> + amdgpu_debugfs_timeout_ring_get,
>>>>> + amdgpu_debugfs_timeout_ring_set,
>>>>> + "%llu\n");
>>>>> +
>>>>> #endif
>>>>>
>>>>> void amdgpu_debugfs_ring_init(struct amdgpu_device *adev,
>>>>> struct amdgpu_ring *ring)
>>>>> {
>>>>> #if defined(CONFIG_DEBUG_FS)
>>>>> struct drm_minor *minor = adev_to_drm(adev)->primary;
>>>>> struct dentry *root = minor->debugfs_root;
>>>>> - char name[32];
>>>>> + char name[40];
>>>>>
>>>>> sprintf(name, "amdgpu_ring_%s", ring->name);
>>>>> debugfs_create_file_size(name, S_IFREG | S_IRUGO, root, ring,
>>>>> &amdgpu_debugfs_ring_fops,
>>>>> ring->ring_size + 12);
>>>>>
>>>>> + sprintf(name, "amdgpu_timeout_ring_%s", ring->name);
>>>>> + debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR, root, ring,
>>>>> + &amdgpu_debugfs_timeout_ring_fops);
>>>>> #endif
>>>>> }
>>>>>
>>>>> /**
>>>>> * amdgpu_ring_test_helper - tests ring and set sched readiness status
>>>>> *
>>>>> * @ring: ring to try the recovery on
>>>>> *
>>>>> * Tests ring and set sched readiness status
>>>>> *
>
More information about the amd-gfx
mailing list