[PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

Thu Feb 10 14:05:46 UTC 2022

Am 10.02.22 um 14:18 schrieb Sharma, Shashank:
>
>
> On 2/10/2022 8:38 AM, Christian König wrote:
>> Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:
>>>
>>> On 2/10/2022 12:39 PM, Christian König wrote:
>>>> Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:
>>>>>
>>>>> On 2/9/2022 1:17 PM, Christian König wrote:
>>>>>> Am 08.02.22 um 16:28 schrieb Alex Deucher:
>>>>>>> On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
>>>>>>> <Amaranath.Somalapuram at amd.com> wrote:
>>>>>>>> Dump the list of register values to trace event on GPU reset.
>>>>>>>>
>>>>>>>> Signed-off-by: Somalapuram Amaranath 
>>>>>>>> <Amaranath.Somalapuram at amd.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
>>>>>>>> ++++++++++++++++++++-
>>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 
>>>>>>>> +++++++++++++++++++
>>>>>>>>   2 files changed, 39 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 1e651b959141..057922fb7e37 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
>>>>>>>> amdgpu_device *adev,
>>>>>>>>          return r;
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>>> +{
>>>>>>>> +       int i;
>>>>>>>> +       uint32_t reg_value[128];
>>>>>>>> +
>>>>>>>> +       for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
>>>>>>>> +               if (adev->asic_type >= CHIP_NAVI10)
>>>>>>> This check should be against CHIP_VEGA10.  Also, this only 
>>>>>>> allows for
>>>>>>> GC registers.  If we wanted to dump other registers, we'd need a
>>>>>>> different macro.  Might be better to just use RREG32 here for
>>>>>>> everything and then encode the full offset using
>>>>>>> SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to 
>>>>>>> think
>>>>>>> about how to handle gfxoff in this case.  gfxoff needs to be 
>>>>>>> disabled
>>>>>>> or we'll hang the chip if we try and read GC or SDMA registers via
>>>>>>> MMIO which will adversely affect the hang signature.
>>>>>>
>>>>>> Well this should execute right before a GPU reset, so I think it 
>>>>>> shouldn't matter if we hang the chip or not as long as the read 
>>>>>> comes back correctly (I remember a very long UVD debug session 
>>>>>> because of this).
>>>>>>
>>>>>> But in general I agree, we should just use RREG32() here and 
>>>>>> always encode the full register offset.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>> Can I use something like this:
>>>>>
>>>>> +                       reg_value[i] = 
>>>>> RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]
>>>>> + [adev->reset_dump_reg_list[i][1]]
>>>>> + [adev->reset_dump_reg_list[i][2]])
>>>>> +                                 + adev->reset_dump_reg_list[i][3]);
>>>>>
>>>>> ip --> adev->reset_dump_reg_list[i][0]
>>>>>
>>>>> inst --> adev->reset_dump_reg_list[i][1]
>>>>>
>>>>> BASE_IDX--> adev->reset_dump_reg_list[i][2]
>>>>>
>>>>> reg --> adev->reset_dump_reg_list[i][3]
>>>>
>>>> No, that won't work.
>>>>
>>>> What you need to do is to use the full 32bit address of the 
>>>> register. Userspace can worry about figuring out which ip, 
>>>> instance, base and reg to resolve into that address.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>> Thanks Christian.
>>>
>>> should I consider using gfxoff like below code or not required:
>>> amdgpu_gfx_off_ctrl(adev, false);
>>> amdgpu_gfx_off_ctrl(adev, true);
>>
>> That's a really good question I can't fully answer.
>>
>> I think we don't want that because the GPU is stuck when the dump is 
>> made, but better let Alex comment as well.
>>
>> Regards,
>> Christian.
>
>
> I had a quick look at the function amdgpu_gfx_off_ctrl, and it locks 
> this mutex internally:
> mutex_lock(&adev->gfx.gfx_off_mutex);
>
> and the reference state is tracked in:
> adev->gfx.gfx_off_state
>
> We can do something like this maybe:
> - If (adev->gfx_off_state == 0) {
>   trylock(gfx_off_mutex)
>   read_regs_now;
>   unlock_mutex();
> }
>
> How does it sounds ?

As far as I know that won't work. GFX_off is only disabled intentionally 
in very few places.

So we will probably never get a register trace with that.

Regards,
Christian.

>
> - Shashank
>>
>>>
>>>
>>> Regards,
>>>
>>> S.Amarnath
>>>>>
>>>>> which requires 4 values in user space for each register.
>>>>>
>>>>> using any existing macro like RREG32_SOC15** will not be able to 
>>>>> pass proper argument from user space (like ip##_HWIP or 
>>>>> reg##_BASE_IDX)
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>> + reg_value[i] = RREG32_SOC15_IP(GC, 
>>>>>>>> adev->reset_dump_reg_list[i]);
>>>>>>>> +               else
>>>>>>>> +                       reg_value[i] = 
>>>>>>>> RREG32(adev->reset_dump_reg_list[i]);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> + trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
>>>>>>>> reg_value, i);
>>>>>>>> +
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>   int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>>>>                           struct amdgpu_reset_context 
>>>>>>>> *reset_context)
>>>>>>>>   {
>>>>>>>> @@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct 
>>>>>>>> list_head *device_list_handle,
>>>>>>>> tmp_adev->gmc.xgmi.pending_reset = false;
>>>>>>>>                                  if 
>>>>>>>> (!queue_work(system_unbound_wq, &tmp_adev->xgmi_reset_work))
>>>>>>>>                                          r = -EALREADY;
>>>>>>>> -                       } else
>>>>>>>> +                       } else {
>>>>>>>> + amdgpu_reset_reg_dumps(tmp_adev);
>>>>>>>>                                  r = amdgpu_asic_reset(tmp_adev);
>>>>>>>> +                       }
>>>>>>>>
>>>>>>>>                          if (r) {
>>>>>>>> dev_err(tmp_adev->dev, "ASIC reset failed with error, %d for 
>>>>>>>> drm dev, %s",
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>>> index d855cb53c7e0..3fe33de3564a 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>>>> @@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>>>>>>>>                        __entry->seqno)
>>>>>>>>   );
>>>>>>>>
>>>>>>>> +TRACE_EVENT(amdgpu_reset_reg_dumps,
>>>>>>>> +           TP_PROTO(long *address, uint32_t *value, int length),
>>>>>>>> +           TP_ARGS(address, value, length),
>>>>>>>> +           TP_STRUCT__entry(
>>>>>>>> +                            __array(long, address, 128)
>>>>>>>> +                            __array(uint32_t, value, 128)
>>>>>>>> +                            __field(int, len)
>>>>>>>> +                            ),
>>>>>>>> +           TP_fast_assign(
>>>>>>>> + memcpy(__entry->address, address, 128);
>>>>>>>> +                          memcpy(__entry->value, value, 128);
>>>>>>>> +                          __entry->len = length;
>>>>>>>> +                          ),
>>>>>>>> +           TP_printk("amdgpu register dump offset: %s value: 
>>>>>>>> %s ",
>>>>>>>> + __print_array(__entry->address, __entry->len, 8),
>>>>>>>> + __print_array(__entry->value, __entry->len, 8)
>>>>>>>> +                    )
>>>>>>>> +);
>>>>>>>> +
>>>>>>>>   #undef AMDGPU_JOB_GET_TIMELINE_NAME
>>>>>>>>   #endif
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>
>>>>
>>