[PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset
Christian König
christian.koenig at amd.com
Thu Feb 10 07:38:21 UTC 2022
Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:
>
> On 2/10/2022 12:39 PM, Christian König wrote:
>> Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:
>>>
>>> On 2/9/2022 1:17 PM, Christian König wrote:
>>>> Am 08.02.22 um 16:28 schrieb Alex Deucher:
>>>>> On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
>>>>> <Amaranath.Somalapuram at amd.com> wrote:
>>>>>> Dump the list of register values to trace event on GPU reset.
>>>>>>
>>>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21
>>>>>> ++++++++++++++++++++-
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 19
>>>>>> +++++++++++++++++++
>>>>>> 2 files changed, 39 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 1e651b959141..057922fb7e37 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>> amdgpu_device *adev,
>>>>>> return r;
>>>>>> }
>>>>>>
>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>> +{
>>>>>> + int i;
>>>>>> + uint32_t reg_value[128];
>>>>>> +
>>>>>> + for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
>>>>>> + if (adev->asic_type >= CHIP_NAVI10)
>>>>> This check should be against CHIP_VEGA10. Also, this only allows for
>>>>> GC registers. If we wanted to dump other registers, we'd need a
>>>>> different macro. Might be better to just use RREG32 here for
>>>>> everything and then encode the full offset using
>>>>> SOC15_REG_ENTRY_OFFSET() or a similar macro. Also, we need to think
>>>>> about how to handle gfxoff in this case. gfxoff needs to be disabled
>>>>> or we'll hang the chip if we try and read GC or SDMA registers via
>>>>> MMIO which will adversely affect the hang signature.
>>>>
>>>> Well this should execute right before a GPU reset, so I think it
>>>> shouldn't matter if we hang the chip or not as long as the read
>>>> comes back correctly (I remember a very long UVD debug session
>>>> because of this).
>>>>
>>>> But in general I agree, we should just use RREG32() here and always
>>>> encode the full register offset.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>> Can I use something like this:
>>>
>>> + reg_value[i] =
>>> RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]
>>> + [adev->reset_dump_reg_list[i][1]]
>>> + [adev->reset_dump_reg_list[i][2]])
>>> + + adev->reset_dump_reg_list[i][3]);
>>>
>>> ip --> adev->reset_dump_reg_list[i][0]
>>>
>>> inst --> adev->reset_dump_reg_list[i][1]
>>>
>>> BASE_IDX--> adev->reset_dump_reg_list[i][2]
>>>
>>> reg --> adev->reset_dump_reg_list[i][3]
>>
>> No, that won't work.
>>
>> What you need to do is to use the full 32bit address of the register.
>> Userspace can worry about figuring out which ip, instance, base and
>> reg to resolve into that address.
>>
>> Regards,
>> Christian.
>>
> Thanks Christian.
>
> should I consider using gfxoff like below code or not required:
> amdgpu_gfx_off_ctrl(adev, false);
> amdgpu_gfx_off_ctrl(adev, true);
That's a really good question I can't fully answer.
I think we don't want that because the GPU is stuck when the dump is
made, but better let Alex comment as well.
Regards,
Christian.
>
>
> Regards,
>
> S.Amarnath
>>>
>>> which requires 4 values in user space for each register.
>>>
>>> using any existing macro like RREG32_SOC15** will not be able to
>>> pass proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)
>>>
>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>>> + reg_value[i] = RREG32_SOC15_IP(GC, adev->reset_dump_reg_list[i]);
>>>>>> + else
>>>>>> + reg_value[i] =
>>>>>> RREG32(adev->reset_dump_reg_list[i]);
>>>>>> + }
>>>>>> +
>>>>>> + trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list,
>>>>>> reg_value, i);
>>>>>> +
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>> struct amdgpu_reset_context
>>>>>> *reset_context)
>>>>>> {
>>>>>> @@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head
>>>>>> *device_list_handle,
>>>>>> tmp_adev->gmc.xgmi.pending_reset = false;
>>>>>> if
>>>>>> (!queue_work(system_unbound_wq, &tmp_adev->xgmi_reset_work))
>>>>>> r = -EALREADY;
>>>>>> - } else
>>>>>> + } else {
>>>>>> + amdgpu_reset_reg_dumps(tmp_adev);
>>>>>> r = amdgpu_asic_reset(tmp_adev);
>>>>>> + }
>>>>>>
>>>>>> if (r) {
>>>>>> dev_err(tmp_adev->dev, "ASIC reset failed with error, %d for drm
>>>>>> dev, %s",
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>> index d855cb53c7e0..3fe33de3564a 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>>> @@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>>>>>> __entry->seqno)
>>>>>> );
>>>>>>
>>>>>> +TRACE_EVENT(amdgpu_reset_reg_dumps,
>>>>>> + TP_PROTO(long *address, uint32_t *value, int length),
>>>>>> + TP_ARGS(address, value, length),
>>>>>> + TP_STRUCT__entry(
>>>>>> + __array(long, address, 128)
>>>>>> + __array(uint32_t, value, 128)
>>>>>> + __field(int, len)
>>>>>> + ),
>>>>>> + TP_fast_assign(
>>>>>> + memcpy(__entry->address, address, 128);
>>>>>> + memcpy(__entry->value, value, 128);
>>>>>> + __entry->len = length;
>>>>>> + ),
>>>>>> + TP_printk("amdgpu register dump offset: %s value: %s ",
>>>>>> + __print_array(__entry->address,
>>>>>> __entry->len, 8),
>>>>>> + __print_array(__entry->value, __entry->len, 8)
>>>>>> + )
>>>>>> +);
>>>>>> +
>>>>>> #undef AMDGPU_JOB_GET_TIMELINE_NAME
>>>>>> #endif
>>>>>>
>>>>>> --
>>>>>> 2.25.1
>>>>>>
>>>>
>>
More information about the amd-gfx
mailing list