[PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset
Somalapuram, Amaranath
asomalap at amd.com
Thu Feb 10 07:34:52 UTC 2022
On 2/10/2022 12:39 PM, Christian König wrote:
> Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:
>>
>> On 2/9/2022 1:17 PM, Christian König wrote:
>>> Am 08.02.22 um 16:28 schrieb Alex Deucher:
>>>> On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
>>>> <Amaranath.Somalapuram at amd.com> wrote:
>>>>> Dump the list of register values to trace event on GPU reset.
>>>>>
>>>>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21
>>>>> ++++++++++++++++++++-
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 19 +++++++++++++++++++
>>>>> 2 files changed, 39 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 1e651b959141..057922fb7e37 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct
>>>>> amdgpu_device *adev,
>>>>> return r;
>>>>> }
>>>>>
>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>> +{
>>>>> + int i;
>>>>> + uint32_t reg_value[128];
>>>>> +
>>>>> + for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
>>>>> + if (adev->asic_type >= CHIP_NAVI10)
>>>> This check should be against CHIP_VEGA10. Also, this only allows for
>>>> GC registers. If we wanted to dump other registers, we'd need a
>>>> different macro. Might be better to just use RREG32 here for
>>>> everything and then encode the full offset using
>>>> SOC15_REG_ENTRY_OFFSET() or a similar macro. Also, we need to think
>>>> about how to handle gfxoff in this case. gfxoff needs to be disabled
>>>> or we'll hang the chip if we try and read GC or SDMA registers via
>>>> MMIO which will adversely affect the hang signature.
>>>
>>> Well this should execute right before a GPU reset, so I think it
>>> shouldn't matter if we hang the chip or not as long as the read
>>> comes back correctly (I remember a very long UVD debug session
>>> because of this).
>>>
>>> But in general I agree, we should just use RREG32() here and always
>>> encode the full register offset.
>>>
>>> Regards,
>>> Christian.
>>>
>> Can I use something like this:
>>
>> + reg_value[i] =
>> RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]
>> + [adev->reset_dump_reg_list[i][1]]
>> + [adev->reset_dump_reg_list[i][2]])
>> + + adev->reset_dump_reg_list[i][3]);
>>
>> ip --> adev->reset_dump_reg_list[i][0]
>>
>> inst --> adev->reset_dump_reg_list[i][1]
>>
>> BASE_IDX--> adev->reset_dump_reg_list[i][2]
>>
>> reg --> adev->reset_dump_reg_list[i][3]
>
> No, that won't work.
>
> What you need to do is to use the full 32bit address of the register.
> Userspace can worry about figuring out which ip, instance, base and
> reg to resolve into that address.
>
> Regards,
> Christian.
>
Thanks Christian.
should I consider using gfxoff like below code or not required:
amdgpu_gfx_off_ctrl(adev, false);
amdgpu_gfx_off_ctrl(adev, true);
Regards,
S.Amarnath
>>
>> which requires 4 values in user space for each register.
>>
>> using any existing macro like RREG32_SOC15** will not be able to pass
>> proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)
>>
>>>
>>>>
>>>> Alex
>>>>
>>>>> + reg_value[i] = RREG32_SOC15_IP(GC,
>>>>> adev->reset_dump_reg_list[i]);
>>>>> + else
>>>>> + reg_value[i] =
>>>>> RREG32(adev->reset_dump_reg_list[i]);
>>>>> + }
>>>>> +
>>>>> + trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list,
>>>>> reg_value, i);
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>> struct amdgpu_reset_context *reset_context)
>>>>> {
>>>>> @@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head
>>>>> *device_list_handle,
>>>>> tmp_adev->gmc.xgmi.pending_reset = false;
>>>>> if
>>>>> (!queue_work(system_unbound_wq, &tmp_adev->xgmi_reset_work))
>>>>> r = -EALREADY;
>>>>> - } else
>>>>> + } else {
>>>>> + amdgpu_reset_reg_dumps(tmp_adev);
>>>>> r = amdgpu_asic_reset(tmp_adev);
>>>>> + }
>>>>>
>>>>> if (r) {
>>>>> dev_err(tmp_adev->dev, "ASIC
>>>>> reset failed with error, %d for drm dev, %s",
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> index d855cb53c7e0..3fe33de3564a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> @@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>>>>> __entry->seqno)
>>>>> );
>>>>>
>>>>> +TRACE_EVENT(amdgpu_reset_reg_dumps,
>>>>> + TP_PROTO(long *address, uint32_t *value, int length),
>>>>> + TP_ARGS(address, value, length),
>>>>> + TP_STRUCT__entry(
>>>>> + __array(long, address, 128)
>>>>> + __array(uint32_t, value, 128)
>>>>> + __field(int, len)
>>>>> + ),
>>>>> + TP_fast_assign(
>>>>> + memcpy(__entry->address, address, 128);
>>>>> + memcpy(__entry->value, value, 128);
>>>>> + __entry->len = length;
>>>>> + ),
>>>>> + TP_printk("amdgpu register dump offset: %s value: %s ",
>>>>> + __print_array(__entry->address,
>>>>> __entry->len, 8),
>>>>> + __print_array(__entry->value, __entry->len, 8)
>>>>> + )
>>>>> +);
>>>>> +
>>>>> #undef AMDGPU_JOB_GET_TIMELINE_NAME
>>>>> #endif
>>>>>
>>>>> --
>>>>> 2.25.1
>>>>>
>>>
>
More information about the amd-gfx
mailing list