[PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

Thu Feb 10 14:11:04 UTC 2022

On 2/10/2022 3:05 PM, Christian König wrote:
> Am 10.02.22 um 14:18 schrieb Sharma, Shashank:
>>
>>
>> On 2/10/2022 8:38 AM, Christian König wrote:
>>> Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:
>>>>
>>>> On 2/10/2022 12:39 PM, Christian König wrote:
>>>>> Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:
>>>>>>
>>>>>> On 2/9/2022 1:17 PM, Christian König wrote:
>>>>>>> Am 08.02.22 um 16:28 schrieb Alex Deucher:
>>>>>>>> On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
>>>>>>>> <Amaranath.Somalapuram at amd.com> wrote:
>>>>>>>>> Dump the list of register values to trace event on GPU reset.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Somalapuram Amaranath 
>>>>>>>>> <Amaranath.Somalapuram at amd.com>
>>>>>>>>> ---
>>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
>>>>>>>>> ++++++++++++++++++++-
>>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 
>>>>>>>>> +++++++++++++++++++
>>>>>>>>>   2 files changed, 39 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index 1e651b959141..057922fb7e37 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>          return r;
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>>>> +{
>>>>>>>>> +       int i;
>>>>>>>>> +       uint32_t reg_value[128];
>>>>>>>>> +
>>>>>>>>> +       for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
>>>>>>>>> +               if (adev->asic_type >= CHIP_NAVI10)
>>>>>>>> This check should be against CHIP_VEGA10.  Also, this only 
>>>>>>>> allows for
>>>>>>>> GC registers.  If we wanted to dump other registers, we'd need a
>>>>>>>> different macro.  Might be better to just use RREG32 here for
>>>>>>>> everything and then encode the full offset using
>>>>>>>> SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to 
>>>>>>>> think
>>>>>>>> about how to handle gfxoff in this case.  gfxoff needs to be 
>>>>>>>> disabled
>>>>>>>> or we'll hang the chip if we try and read GC or SDMA registers via
>>>>>>>> MMIO which will adversely affect the hang signature.
>>>>>>>
>>>>>>> Well this should execute right before a GPU reset, so I think it 
>>>>>>> shouldn't matter if we hang the chip or not as long as the read 
>>>>>>> comes back correctly (I remember a very long UVD debug session 
>>>>>>> because of this).
>>>>>>>
>>>>>>> But in general I agree, we should just use RREG32() here and 
>>>>>>> always encode the full register offset.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>> Can I use something like this:
>>>>>>
>>>>>> +                       reg_value[i] = 
>>>>>> RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]
>>>>>> + [adev->reset_dump_reg_list[i][1]]
>>>>>> + [adev->reset_dump_reg_list[i][2]])
>>>>>> +                                 + adev->reset_dump_reg_list[i][3]);
>>>>>>
>>>>>> ip --> adev->reset_dump_reg_list[i][0]
>>>>>>
>>>>>> inst --> adev->reset_dump_reg_list[i][1]
>>>>>>
>>>>>> BASE_IDX--> adev->reset_dump_reg_list[i][2]
>>>>>>
>>>>>> reg --> adev->reset_dump_reg_list[i][3]
>>>>>
>>>>> No, that won't work.
>>>>>
>>>>> What you need to do is to use the full 32bit address of the 
>>>>> register. Userspace can worry about figuring out which ip, 
>>>>> instance, base and reg to resolve into that address.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>> Thanks Christian.
>>>>
>>>> should I consider using gfxoff like below code or not required:
>>>> amdgpu_gfx_off_ctrl(adev, false);
>>>> amdgpu_gfx_off_ctrl(adev, true);
>>>
>>> That's a really good question I can't fully answer.
>>>
>>> I think we don't want that because the GPU is stuck when the dump is 
>>> made, but better let Alex comment as well.
>>>
>>> Regards,
>>> Christian.
>>
>>
>> I had a quick look at the function amdgpu_gfx_off_ctrl, and it locks 
>> this mutex internally:
>> mutex_lock(&adev->gfx.gfx_off_mutex);
>>
>> and the reference state is tracked in:
>> adev->gfx.gfx_off_state
>>
>> We can do something like this maybe:
>> - If (adev->gfx_off_state == 0) {
>>   trylock(gfx_off_mutex)
>>   read_regs_now;
>>   unlock_mutex();
>> }
>>
>> How does it sounds ?
> 
> As far as I know that won't work. GFX_off is only disabled intentionally 
> in very few places.
> 
> So we will probably never get a register trace with that.
> 

Ok, I don't know much about this feature, but due to the name I was 
udner the impression that gfx_off will be mostly disabled. But if it is 
hardly ever disabled, we need more infomrmation about it first, like 
when is this disabled and why ?

Alex ?

- Shashank

> Regards,
> Christian.
> 
>>
>> - Shashank
>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> S.Amarnath
>>>>>>
>>>>>> which requires 4 values in user space for each register.
>>>>>>
>>>>>> using any existing macro like RREG32_SOC15** will not be able to 
>>>>>> pass proper argument from user space (like ip##_HWIP or 
>>>>>> reg##_BASE_IDX)
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>> + reg_value[i] = RREG32_SOC15_IP(GC, 
>>>>>>>>> adev->reset_dump_reg_list[i]);
>>>>>>>>> +               else
>>>>>>>>> +                       reg_value[i] = 
>>>>>>>>> RREG32(adev->reset_dump_reg_list[i]);
>>>>>>>>> +       }
>>>>>>>>> +
>>>>>>>>> + trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
>>>>>>>>> reg_value, i);
>>>>>>>>> +
>>>>>>>>> +       return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>   int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>>>>>                           struct amdgpu_reset_context 
>>>>>>>>> *reset_context)
>>>>>>>>>   {
>>>>>>>>> @@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct 
>>>>>>>>> list_head *device_list_handle,
>>>>>>>>> tmp_adev->gmc.xgmi.pending_reset = false;
>>>>>>>>>                                  if 
>>>>>>>>> (!queue_work(system_unbound_wq, &tmp_adev->xgmi_reset_work))
>>>>>>>>>                                          r = -EALREADY;
>>>>>>>>> -                       } else
>>>>>>>>> +                       } else {
>>>>>>>>> + amdgpu_reset_reg_dumps(tmp_adev);
>>>>>>>>>                                  r = amdgpu_asic_reset(tmp_adev);
>>>>>>>>> +                       }
>>>>>>>>>
>>>>>>>>>                          if (r) {
>>>>>>>>> dev_err(tmp_adev->dev, "ASIC reset failed with error, %d for 
>>>>>>>>> drm dev, %s",
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>>>> index d855cb53c7e0..3fe33de3564a 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>>>> @@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>>>>>>>>>                        __entry->seqno)
>>>>>>>>>   );
>>>>>>>>>
>>>>>>>>> +TRACE_EVENT(amdgpu_reset_reg_dumps,
>>>>>>>>> +           TP_PROTO(long *address, uint32_t *value, int length),
>>>>>>>>> +           TP_ARGS(address, value, length),
>>>>>>>>> +           TP_STRUCT__entry(
>>>>>>>>> +                            __array(long, address, 128)
>>>>>>>>> +                            __array(uint32_t, value, 128)
>>>>>>>>> +                            __field(int, len)
>>>>>>>>> +                            ),
>>>>>>>>> +           TP_fast_assign(
>>>>>>>>> + memcpy(__entry->address, address, 128);
>>>>>>>>> +                          memcpy(__entry->value, value, 128);
>>>>>>>>> +                          __entry->len = length;
>>>>>>>>> +                          ),
>>>>>>>>> +           TP_printk("amdgpu register dump offset: %s value: 
>>>>>>>>> %s ",
>>>>>>>>> + __print_array(__entry->address, __entry->len, 8),
>>>>>>>>> + __print_array(__entry->value, __entry->len, 8)
>>>>>>>>> +                    )
>>>>>>>>> +);
>>>>>>>>> +
>>>>>>>>>   #undef AMDGPU_JOB_GET_TIMELINE_NAME
>>>>>>>>>   #endif
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> 2.25.1
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>