[PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

Thu Feb 10 11:59:58 UTC 2022

On 2/10/2022 6:29 AM, Somalapuram, Amaranath wrote:
> 
> On 2/9/2022 1:17 PM, Christian König wrote:
>> Am 08.02.22 um 16:28 schrieb Alex Deucher:
>>> On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
>>> <Amaranath.Somalapuram at amd.com> wrote:
>>>> Dump the list of register values to trace event on GPU reset.
>>>>
>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 ++++++++++++++++++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++++++++++++++++++
>>>>   2 files changed, 39 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 1e651b959141..057922fb7e37 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>          return r;
>>>>   }
>>>>
>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>> +{
>>>> +       int i;
>>>> +       uint32_t reg_value[128];
>>>> +
>>>> +       for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
>>>> +               if (adev->asic_type >= CHIP_NAVI10)
>>> This check should be against CHIP_VEGA10.  Also, this only allows for
>>> GC registers.  If we wanted to dump other registers, we'd need a
>>> different macro.  Might be better to just use RREG32 here for
>>> everything and then encode the full offset using
>>> SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
>>> about how to handle gfxoff in this case.  gfxoff needs to be disabled
>>> or we'll hang the chip if we try and read GC or SDMA registers via
>>> MMIO which will adversely affect the hang signature.
>>
>> Well this should execute right before a GPU reset, so I think it 
>> shouldn't matter if we hang the chip or not as long as the read comes 
>> back correctly (I remember a very long UVD debug session because of 
>> this).
>>
>> But in general I agree, we should just use RREG32() here and always 
>> encode the full register offset.
>>
>> Regards,
>> Christian.
>>
> Can I use something like this:
> 
> +                       reg_value[i] = 
> RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]
> + [adev->reset_dump_reg_list[i][1]]
> + [adev->reset_dump_reg_list[i][2]])
> +                                 + adev->reset_dump_reg_list[i][3]);
> 
> ip --> adev->reset_dump_reg_list[i][0]
> 
> inst --> adev->reset_dump_reg_list[i][1]
> 
> BASE_IDX--> adev->reset_dump_reg_list[i][2]
> 
> reg --> adev->reset_dump_reg_list[i][3]
> 
> which requires 4 values in user space for each register.
> 
> using any existing macro like RREG32_SOC15** will not be able to pass 
> proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)
> 

Why cant we use just a simple array
adev->reset_dump_reg_list[10] for both ip and reg offsets ?

Userspace can provide the IP engine enum in first entry of the array,
reset_dump_reg_list[0], and register offsets in other entries starting 
from 1. We can convert that into desirable engine substring using an 
array of char *, something like:

const char *ip_engine_name_substing[] = {
	/* Same order as enum amd_hw_ip_block_type */
	"GC", "HDP", ......
}

engine enum;
u32 ip = adev->reset_dump_reg_list[0];
const char *ip_name = ip_engine_name_subs[ip];

for (i = 0; i < 9; i++) {
	reg_val = RREG_SOC15_IP(ip_name, reset_dump_reg_list[i+1]);
}

- Shashank

>>
>>>
>>> Alex
>>>
>>>> +                       reg_value[i] = RREG32_SOC15_IP(GC, 
>>>> adev->reset_dump_reg_list[i]);
>>>> +               else
>>>> +                       reg_value[i] = 
>>>> RREG32(adev->reset_dump_reg_list[i]);
>>>> +       }
>>>> +
>>>> + trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, reg_value, 
>>>> i);
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>>   int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>                           struct amdgpu_reset_context *reset_context)
>>>>   {
>>>> @@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
>>>> *device_list_handle,
>>>> tmp_adev->gmc.xgmi.pending_reset = false;
>>>>                                  if (!queue_work(system_unbound_wq, 
>>>> &tmp_adev->xgmi_reset_work))
>>>>                                          r = -EALREADY;
>>>> -                       } else
>>>> +                       } else {
>>>> + amdgpu_reset_reg_dumps(tmp_adev);
>>>>                                  r = amdgpu_asic_reset(tmp_adev);
>>>> +                       }
>>>>
>>>>                          if (r) {
>>>>                                  dev_err(tmp_adev->dev, "ASIC reset 
>>>> failed with error, %d for drm dev, %s",
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>> index d855cb53c7e0..3fe33de3564a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>> @@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>>>>                        __entry->seqno)
>>>>   );
>>>>
>>>> +TRACE_EVENT(amdgpu_reset_reg_dumps,
>>>> +           TP_PROTO(long *address, uint32_t *value, int length),
>>>> +           TP_ARGS(address, value, length),
>>>> +           TP_STRUCT__entry(
>>>> +                            __array(long, address, 128)
>>>> +                            __array(uint32_t, value, 128)
>>>> +                            __field(int, len)
>>>> +                            ),
>>>> +           TP_fast_assign(
>>>> +                          memcpy(__entry->address, address, 128);
>>>> +                          memcpy(__entry->value,  value, 128);
>>>> +                          __entry->len = length;
>>>> +                          ),
>>>> +           TP_printk("amdgpu register dump offset: %s value: %s ",
>>>> +                     __print_array(__entry->address, __entry->len, 8),
>>>> +                     __print_array(__entry->value, __entry->len, 8)
>>>> +                    )
>>>> +);
>>>> +
>>>>   #undef AMDGPU_JOB_GET_TIMELINE_NAME
>>>>   #endif
>>>>
>>>> -- 
>>>> 2.25.1
>>>>
>>