[PATCH 3/4] drm/amdgpu: add reset register trace dump function

Christian König christian.koenig at amd.com
Mon Jan 24 17:07:22 UTC 2022



Am 24.01.22 um 18:00 schrieb Sharma, Shashank:
>
>
> On 1/24/2022 5:48 PM, Christian König wrote:
>> Am 24.01.22 um 17:45 schrieb Sharma, Shashank:
>>> Hello Christian,
>>> Thank for your comments, please fine mine inline:
>>>
>>> On 1/24/2022 8:15 AM, Christian König wrote:
>>>> Am 21.01.22 um 21:34 schrieb Sharma, Shashank:
>>>>> From 1c5c552eeddaffd9fb3e7d45ece1b2b28fccc575 Mon Sep 17 00:00:00 
>>>>> 2001
>>>>> From: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>>> Date: Fri, 21 Jan 2022 14:19:10 +0530
>>>>> Subject: [PATCH 3/4] drm/amdgpu: add reset register trace dump 
>>>>> function for
>>>>>  gfx_v10_0
>>>>>
>>>>> Implementation of register trace dump function on the AMD GPU resets
>>>>>
>>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>>> ---
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  8 ++++
>>>>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c    | 53 
>>>>> ++++++++++++++++++++++-
>>>>>  drivers/gpu/drm/amd/include/amd_shared.h  |  1 +
>>>>>  3 files changed, 60 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> index d855cb53c7e0..c97b53b54333 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> @@ -537,6 +537,14 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>>>>>                __entry->seqno)
>>>>>  );
>>>>>
>>>>> +TRACE_EVENT(gfx_v10_0_reset_reg_dumps,
>>>>> +        TP_PROTO(char *reg_dumps),
>>>>> +        TP_ARGS(reg_dumps),
>>>>> +        TP_STRUCT__entry(__string(dumps, reg_dumps)),
>>>>> +        TP_fast_assign(__assign_str(dumps, reg_dumps);),
>>>>> +        TP_printk("amdgpu register dump {%s}", __get_str(dumps))
>>>>> +);
>>>>> +
>>>>>  #undef AMDGPU_JOB_GET_TIMELINE_NAME
>>>>>  #endif
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>> index 16dbe593cba2..05974ed5416d 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>> @@ -46,7 +46,7 @@
>>>>>  #include "v10_structs.h"
>>>>>  #include "gfx_v10_0.h"
>>>>>  #include "nbio_v2_3.h"
>>>>> -
>>>>> +#include "amdgpu_trace.h"
>>>>>  /*
>>>>>   * Navi10 has two graphic rings to share each graphic pipe.
>>>>>   * 1. Primary ring
>>>>> @@ -188,6 +188,12 @@
>>>>>  #define RLCG_ERROR_REPORT_ENABLED(adev) \
>>>>>      (amdgpu_sriov_reg_indirect_mmhub(adev) || 
>>>>> amdgpu_sriov_reg_indirect_gc(adev))
>>>>>
>>>>> +#define N_REGS (17)
>>>>> +#define DUMP_REG(addr) do {                    \
>>>>> +        (dump)[i][0] = (addr);                \
>>>>> +        (dump)[i++][1] =  RREG32_SOC15_IP(GC, addr);    \
>>>>> +    } while (0)
>>>>
>>>> Those macros need an AMDGPU_ prefix and maybe some better names.
>>>
>>> Agree, @Amar pls check this out.
>>>>
>>>>  From the design POV I'm really wondering if it wouldn't be better 
>>>> if userspace defines the registers we want to dump in case of a crash.
>>>>
>>>
>>> I am not so sure about it. This means we have to communicate with 
>>> the userspace and get the list of registers it wishes to see, do I 
>>> smell a new IOCTL or DRM property ? I also feel that we have better 
>>> filtering tools for a trace event in userspace than in kernel.
>>>
>>> What do you think ?
>>
>> Just a writeable debugfs file should probably do it. We have a list 
>> of known registers filled in during boot (just reg offsets) and 
>> userspace can read/write the file to update it.
>>
>
> Ok, so in this case, what would be our state machine ?
> 1. gpu_reset happened
> 2. uevent_sent from work queue
> 3. userspace goes and writes into debugfs files
> 4. kernel collects the desired register values
>
> 5 (a). kernel send the trace_event with desired values
> 6 (a). kernel sends another uevent to indicate trace_event
>
> Or
>
> 5 (a+b). kernel loads the reg values via debugfs itself.
>
> In both of these case, wouldn't it be too late to collect register 
> values by the time userspace list arrives at kernel ?
>
> Or do you have a better state machine in mind ?

1. system boots, kernel loads a default register list to dump.
2. Userspace optionally updates this register list.
....
3. GPU reset happens, we dump the registers into the trace log.
4. We send udev event to signal the GPU reset to umr or other tool to 
save and store the trace log.

Regards,
Christian.


More information about the amd-gfx mailing list