[PATCH v2] drm/amdgpu: add ring timeout information in devcoredump

Tue Mar 5 13:20:35 UTC 2024

Am 05.03.24 um 14:17 schrieb Khatri, Sunil:
>
> On 3/5/2024 6:40 PM, Christian König wrote:
>> Am 05.03.24 um 12:58 schrieb Sunil Khatri:
>>> Add ring timeout related information in the amdgpu
>>> devcoredump file for debugging purposes.
>>>
>>> During the gpu recovery process the registered call
>>> is triggered and add the debug information in data
>>> file created by devcoredump framework under the
>>> directory /sys/class/devcoredump/devcdx/
>>>
>>> Signed-off-by: Sunil Khatri <sunil.khatri at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 +++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  2 ++
>>>   2 files changed, 17 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> index a59364e9b6ed..aa7fed59a0d5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> @@ -196,6 +196,13 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
>>> offset, size_t count,
>>>                  coredump->reset_task_info.process_name,
>>>                  coredump->reset_task_info.pid);
>>>   +    if (coredump->ring_timeout) {
>>> +        drm_printf(&p, "\nRing timed out details\n");
>>> +        drm_printf(&p, "IP Type: %d Ring Name: %s \n",
>>> +                coredump->ring->funcs->type,
>>> +                coredump->ring->name);
>>> +    }
>>> +
>>>       if (coredump->reset_vram_lost)
>>>           drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>>       if (coredump->adev->reset_info.num_regs) {
>>> @@ -220,6 +227,8 @@ void amdgpu_coredump(struct amdgpu_device *adev, 
>>> bool vram_lost,
>>>   {
>>>       struct amdgpu_coredump_info *coredump;
>>>       struct drm_device *dev = adev_to_drm(adev);
>>> +    struct amdgpu_job *job = reset_context->job;
>>> +    struct drm_sched_job *s_job;
>>>         coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT);
>>>   @@ -228,6 +237,12 @@ void amdgpu_coredump(struct amdgpu_device 
>>> *adev, bool vram_lost,
>>>           return;
>>>       }
>>>   +    if (job) {
>>> +        s_job = &job->base;
>>> +        coredump->ring = to_amdgpu_ring(s_job->sched);
>>> +        coredump->ring_timeout = TRUE;
>>> +    }
>>> +
>>>       coredump->reset_vram_lost = vram_lost;
>>>         if (reset_context->job && reset_context->job->vm) {
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> index 19899f6b9b2b..6d67001a1057 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> @@ -97,6 +97,8 @@ struct amdgpu_coredump_info {
>>>       struct amdgpu_task_info         reset_task_info;
>>>       struct timespec64               reset_time;
>>>       bool                            reset_vram_lost;
>>> +    struct amdgpu_ring          *ring;
>>> +    bool                            ring_timeout;
>>
>> I think you can drop ring_timeout, just having ring as optional 
>> information should be enough.
>>
>> Apart from that looks pretty good I think.
>>
> - GPU reset could happen due to two possibilities atleast: 1. via 
> sysfs cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover there is no 
> timeout or page fault here. In this case we need information if 
> ringtimeout happened or not else it will try to print empty 
> information in devcoredump. Same goes for pagefault also in that case 
> also we need to see if recovery ran due to pagefault and then only add 
> that information.
>
> So to cover all use cases i added this parameter.

Yeah, but you don't need that parameter.

If a GPU reset is triggered by a ring timeout we note the ring causing 
this, otherwise the ring is simply NULL.

There is no need for a separate variable to note that a timeout happened.

Regards,
Christian.

>
>
> Thanks
> Sunil
>
>> Regards,
>> Christian.
>>
>>>   };
>>>   #endif
>>