[PATCH 3/4] drm/amdgpu: add reset register trace dump function

Sharma, Shashank shashank.sharma at amd.com
Mon Jan 24 17:10:10 UTC 2022



On 1/24/2022 6:07 PM, Christian König wrote:
> 
> 
> Am 24.01.22 um 18:00 schrieb Sharma, Shashank:
>>
>>
>> On 1/24/2022 5:48 PM, Christian König wrote:
>>> Am 24.01.22 um 17:45 schrieb Sharma, Shashank:
>>>> Hello Christian,
>>>> Thank for your comments, please fine mine inline:
>>>>
>>>> On 1/24/2022 8:15 AM, Christian König wrote:
>>>>> Am 21.01.22 um 21:34 schrieb Sharma, Shashank:
>>>>>> From 1c5c552eeddaffd9fb3e7d45ece1b2b28fccc575 Mon Sep 17 00:00:00 
>>>>>> 2001
>>>>>> From: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>>>> Date: Fri, 21 Jan 2022 14:19:10 +0530
>>>>>> Subject: [PATCH 3/4] drm/amdgpu: add reset register trace dump 
>>>>>> function for
>>>>>>  gfx_v10_0
>>>>>>
>>>>>> Implementation of register trace dump function on the AMD GPU resets
>>>>>>
>>>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>>>> ---
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  8 ++++
>>>>>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c    | 53 
>>>>>> ++++++++++++++++++++++-
>>>>>>  drivers/gpu/drm/amd/include/amd_shared.h  |  1 +
>>>>>>  3 files changed, 60 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>> index d855cb53c7e0..c97b53b54333 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>> @@ -537,6 +537,14 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>>>>>>                __entry->seqno)
>>>>>>  );
>>>>>>
>>>>>> +TRACE_EVENT(gfx_v10_0_reset_reg_dumps,
>>>>>> +        TP_PROTO(char *reg_dumps),
>>>>>> +        TP_ARGS(reg_dumps),
>>>>>> +        TP_STRUCT__entry(__string(dumps, reg_dumps)),
>>>>>> +        TP_fast_assign(__assign_str(dumps, reg_dumps);),
>>>>>> +        TP_printk("amdgpu register dump {%s}", __get_str(dumps))
>>>>>> +);
>>>>>> +
>>>>>>  #undef AMDGPU_JOB_GET_TIMELINE_NAME
>>>>>>  #endif
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>>> index 16dbe593cba2..05974ed5416d 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>>> @@ -46,7 +46,7 @@
>>>>>>  #include "v10_structs.h"
>>>>>>  #include "gfx_v10_0.h"
>>>>>>  #include "nbio_v2_3.h"
>>>>>> -
>>>>>> +#include "amdgpu_trace.h"
>>>>>>  /*
>>>>>>   * Navi10 has two graphic rings to share each graphic pipe.
>>>>>>   * 1. Primary ring
>>>>>> @@ -188,6 +188,12 @@
>>>>>>  #define RLCG_ERROR_REPORT_ENABLED(adev) \
>>>>>>      (amdgpu_sriov_reg_indirect_mmhub(adev) || 
>>>>>> amdgpu_sriov_reg_indirect_gc(adev))
>>>>>>
>>>>>> +#define N_REGS (17)
>>>>>> +#define DUMP_REG(addr) do {                    \
>>>>>> +        (dump)[i][0] = (addr);                \
>>>>>> +        (dump)[i++][1] =  RREG32_SOC15_IP(GC, addr);    \
>>>>>> +    } while (0)
>>>>>
>>>>> Those macros need an AMDGPU_ prefix and maybe some better names.
>>>>
>>>> Agree, @Amar pls check this out.
>>>>>
>>>>>  From the design POV I'm really wondering if it wouldn't be better 
>>>>> if userspace defines the registers we want to dump in case of a crash.
>>>>>
>>>>
>>>> I am not so sure about it. This means we have to communicate with 
>>>> the userspace and get the list of registers it wishes to see, do I 
>>>> smell a new IOCTL or DRM property ? I also feel that we have better 
>>>> filtering tools for a trace event in userspace than in kernel.
>>>>
>>>> What do you think ?
>>>
>>> Just a writeable debugfs file should probably do it. We have a list 
>>> of known registers filled in during boot (just reg offsets) and 
>>> userspace can read/write the file to update it.
>>>
>>
>> Ok, so in this case, what would be our state machine ?
>> 1. gpu_reset happened
>> 2. uevent_sent from work queue
>> 3. userspace goes and writes into debugfs files
>> 4. kernel collects the desired register values
>>
>> 5 (a). kernel send the trace_event with desired values
>> 6 (a). kernel sends another uevent to indicate trace_event
>>
>> Or
>>
>> 5 (a+b). kernel loads the reg values via debugfs itself.
>>
>> In both of these case, wouldn't it be too late to collect register 
>> values by the time userspace list arrives at kernel ?
>>
>> Or do you have a better state machine in mind ?
> 
> 1. system boots, kernel loads a default register list to dump.
> 2. Userspace optionally updates this register list.
> ....
> 3. GPU reset happens, we dump the registers into the trace log.
> 4. We send udev event to signal the GPU reset to umr or other tool to 
> save and store the trace log.
> 

Got it, let us rework in those lines.
- Shashank

> Regards,
> Christian.


More information about the amd-gfx mailing list