[PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

Thu Feb 10 13:18:48 UTC 2022

On 2/10/2022 8:38 AM, Christian König wrote:
> Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:
>>
>> On 2/10/2022 12:39 PM, Christian König wrote:
>>> Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:
>>>>
>>>> On 2/9/2022 1:17 PM, Christian König wrote:
>>>>> Am 08.02.22 um 16:28 schrieb Alex Deucher:
>>>>>> On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
>>>>>> <Amaranath.Somalapuram at amd.com> wrote:
>>>>>>> Dump the list of register values to trace event on GPU reset.
>>>>>>>
>>>>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
>>>>>>> ++++++++++++++++++++-
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 
>>>>>>> +++++++++++++++++++
>>>>>>>   2 files changed, 39 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 1e651b959141..057922fb7e37 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>>          return r;
>>>>>>>   }
>>>>>>>
>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>> +{
>>>>>>> +       int i;
>>>>>>> +       uint32_t reg_value[128];
>>>>>>> +
>>>>>>> +       for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
>>>>>>> +               if (adev->asic_type >= CHIP_NAVI10)
>>>>>> This check should be against CHIP_VEGA10.  Also, this only allows for
>>>>>> GC registers.  If we wanted to dump other registers, we'd need a
>>>>>> different macro.  Might be better to just use RREG32 here for
>>>>>> everything and then encode the full offset using
>>>>>> SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
>>>>>> about how to handle gfxoff in this case.  gfxoff needs to be disabled
>>>>>> or we'll hang the chip if we try and read GC or SDMA registers via
>>>>>> MMIO which will adversely affect the hang signature.
>>>>>
>>>>> Well this should execute right before a GPU reset, so I think it 
>>>>> shouldn't matter if we hang the chip or not as long as the read 
>>>>> comes back correctly (I remember a very long UVD debug session 
>>>>> because of this).
>>>>>
>>>>> But in general I agree, we should just use RREG32() here and always 
>>>>> encode the full register offset.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>> Can I use something like this:
>>>>
>>>> +                       reg_value[i] = 
>>>> RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]
>>>> + [adev->reset_dump_reg_list[i][1]]
>>>> + [adev->reset_dump_reg_list[i][2]])
>>>> +                                 + adev->reset_dump_reg_list[i][3]);
>>>>
>>>> ip --> adev->reset_dump_reg_list[i][0]
>>>>
>>>> inst --> adev->reset_dump_reg_list[i][1]
>>>>
>>>> BASE_IDX--> adev->reset_dump_reg_list[i][2]
>>>>
>>>> reg --> adev->reset_dump_reg_list[i][3]
>>>
>>> No, that won't work.
>>>
>>> What you need to do is to use the full 32bit address of the register. 
>>> Userspace can worry about figuring out which ip, instance, base and 
>>> reg to resolve into that address.
>>>
>>> Regards,
>>> Christian.
>>>
>> Thanks Christian.
>>
>> should I consider using gfxoff like below code or not required:
>> amdgpu_gfx_off_ctrl(adev, false);
>> amdgpu_gfx_off_ctrl(adev, true);
> 
> That's a really good question I can't fully answer.
> 
> I think we don't want that because the GPU is stuck when the dump is 
> made, but better let Alex comment as well.
> 
> Regards,
> Christian.

I had a quick look at the function amdgpu_gfx_off_ctrl, and it locks 
this mutex internally:
mutex_lock(&adev->gfx.gfx_off_mutex);

and the reference state is tracked in:
adev->gfx.gfx_off_state

We can do something like this maybe:
- If (adev->gfx_off_state == 0) {
   trylock(gfx_off_mutex)
   read_regs_now;
   unlock_mutex();
}

How does it sounds ?

- Shashank
> 
>>
>>
>> Regards,
>>
>> S.Amarnath
>>>>
>>>> which requires 4 values in user space for each register.
>>>>
>>>> using any existing macro like RREG32_SOC15** will not be able to 
>>>> pass proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)
>>>>
>>>>>
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>> + reg_value[i] = RREG32_SOC15_IP(GC, adev->reset_dump_reg_list[i]);
>>>>>>> +               else
>>>>>>> +                       reg_value[i] = 
>>>>>>> RREG32(adev->reset_dump_reg_list[i]);
>>>>>>> +       }
>>>>>>> +
>>>>>>> + trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
>>>>>>> reg_value, i);
>>>>>>> +
>>>>>>> +       return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>>   int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>>>                           struct amdgpu_reset_context 
>>>>>>> *reset_context)
>>>>>>>   {
>>>>>>> @@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
>>>>>>> *device_list_handle,
>>>>>>> tmp_adev->gmc.xgmi.pending_reset = false;
>>>>>>>                                  if 
>>>>>>> (!queue_work(system_unbound_wq, &tmp_adev->xgmi_reset_work))
>>>>>>>                                          r = -EALREADY;
>>>>>>> -                       } else
>>>>>>> +                       } else {
>>>>>>> + amdgpu_reset_reg_dumps(tmp_adev);
>>>>>>>                                  r = amdgpu_asic_reset(tmp_adev);
>>>>>>> +                       }
>>>>>>>
>>>>>>>                          if (r) {
>>>>>>> dev_err(tmp_adev->dev, "ASIC reset failed with error, %d for drm 
>>>>>>> dev, %s",
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>> index d855cb53c7e0..3fe33de3564a 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>> @@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>>>>>>>                        __entry->seqno)
>>>>>>>   );
>>>>>>>
>>>>>>> +TRACE_EVENT(amdgpu_reset_reg_dumps,
>>>>>>> +           TP_PROTO(long *address, uint32_t *value, int length),
>>>>>>> +           TP_ARGS(address, value, length),
>>>>>>> +           TP_STRUCT__entry(
>>>>>>> +                            __array(long, address, 128)
>>>>>>> +                            __array(uint32_t, value, 128)
>>>>>>> +                            __field(int, len)
>>>>>>> +                            ),
>>>>>>> +           TP_fast_assign(
>>>>>>> +                          memcpy(__entry->address, address, 128);
>>>>>>> +                          memcpy(__entry->value, value, 128);
>>>>>>> +                          __entry->len = length;
>>>>>>> +                          ),
>>>>>>> +           TP_printk("amdgpu register dump offset: %s value: %s ",
>>>>>>> +                     __print_array(__entry->address, 
>>>>>>> __entry->len, 8),
>>>>>>> +                     __print_array(__entry->value, __entry->len, 8)
>>>>>>> +                    )
>>>>>>> +);
>>>>>>> +
>>>>>>>   #undef AMDGPU_JOB_GET_TIMELINE_NAME
>>>>>>>   #endif
>>>>>>>
>>>>>>> -- 
>>>>>>> 2.25.1
>>>>>>>
>>>>>
>>>
>