[PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's

Khatri, Sunil sukhatri at amd.com
Fri Jul 26 14:41:41 UTC 2024


On 7/26/2024 7:53 PM, Khatri, Sunil wrote:
>
> On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
>>
>> On 7/26/2024 6:42 PM, Alex Deucher wrote:
>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri <sunil.khatri at amd.com> 
>>> wrote:
>>>> Problem:
>>>> IP dump right now is done post suspend of
>>>> all IP's which for some IP's could change power
>>>> state and software state too which we do not want
>>>> to reflect in the dump as it might not be same at
>>>> the time of hang.
>>>>
>>>> Solution:
>>>> IP should be dumped as close to the HW state when
>>>> the GPU was in hung state without trying to reinitialize
>>>> any resource.
>>>>
>>>> Signed-off-by: Sunil Khatri <sunil.khatri at amd.com>
>>> Acked-by: Alex Deucher <alexander.deucher at amd.com>
>>>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60 
>>>> +++++++++++-----------
>>>>   1 file changed, 30 insertions(+), 30 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 730dae77570c..74f6f15e73b5 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct 
>>>> amdgpu_device *adev)
>>>>          return ret;
>>>>   }
>>>>
>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>> +{
>>>> +       int i;
>>>> +
>>>> + lockdep_assert_held(&adev->reset_domain->sem);
>>>> +
>>>> +       for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>> +               adev->reset_info.reset_dump_reg_value[i] =
>>>> + RREG32(adev->reset_info.reset_dump_reg_list[i]);
>> A suspend also involves power/clock ungate. When reg dump is moved
>> earlier, I'm not sure if this read works for all. If it's left to
>> individual IP call backs, they could just do the same or better to move
>> these up before a dump.
> Suspend also put the status.hw = false and each IP in their respective 
> suspend state which i feel does change the state of the HW.
> To get the correct snapshot of the GPU register we should not be 
> fiddling with the HW IP at least till we capture the dump and that is 
> the intention behind the change.
>
> Do you think there is a problem in this approach?
>>
>>          amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>          amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> Adding this does sounds better to enable just before we dump the 
> registers but i am not very sure if ungating would be clean here or 
> not. i Could try quickly adding these two calls just before dump.
>
> All i am worried if it does change some register reflecting the 
> original state of registers at dump.
>
> What u suggest ?
> Regards
> Sunil
I quickly validated on Navi22 by adding below call just before we dump 
registers
if(!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
amdgpu_reset_reg_dumps(tmp_adev);
dev_info(tmp_adev->dev, "Dumping IP State\n");
     /* Trigger ip dump before we reset the asic */
for(i=0; i<tmp_adev->num_ip_blocks; i++)
if(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
                                     (void*)tmp_adev);
dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
}
If this sounds fine with you i am update that. Regards Sunil Khatri
>
>>
>> Thanks,
>> Lijo
>>
>>>> +
>>>> + 
>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>> + adev->reset_info.reset_dump_reg_value[i]);
>>>> +       }
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>>   int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>>                                   struct amdgpu_reset_context 
>>>> *reset_context)
>>>>   {
>>>>          int i, r = 0;
>>>>          struct amdgpu_job *job = NULL;
>>>> +       struct amdgpu_device *tmp_adev = reset_context->reset_req_dev;
>>>>          bool need_full_reset =
>>>>                  test_bit(AMDGPU_NEED_FULL_RESET, 
>>>> &reset_context->flags);
>>>>
>>>> @@ -5340,6 +5358,18 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>                          }
>>>>                  }
>>>>
>>>> +               if (!test_bit(AMDGPU_SKIP_COREDUMP, 
>>>> &reset_context->flags)) {
>>>> +                       amdgpu_reset_reg_dumps(tmp_adev);
>>>> +
>>>> +                       dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>> +                       /* Trigger ip dump before we reset the asic */
>>>> +                       for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>> +                               if 
>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>> + tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>>>> + (void *)tmp_adev);
>>>> +                       dev_info(tmp_adev->dev, "Dumping IP State 
>>>> Completed\n");
>>>> +               }
>>>> +
>>>>                  if (need_full_reset)
>>>>                          r = amdgpu_device_ip_suspend(adev);
>>>>                  if (need_full_reset)
>>>> @@ -5352,47 +5382,17 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>          return r;
>>>>   }
>>>>
>>>> -static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>> -{
>>>> -       int i;
>>>> -
>>>> - lockdep_assert_held(&adev->reset_domain->sem);
>>>> -
>>>> -       for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>> -               adev->reset_info.reset_dump_reg_value[i] =
>>>> - RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>> -
>>>> - 
>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>> - adev->reset_info.reset_dump_reg_value[i]);
>>>> -       }
>>>> -
>>>> -       return 0;
>>>> -}
>>>> -
>>>>   int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>                           struct amdgpu_reset_context *reset_context)
>>>>   {
>>>>          struct amdgpu_device *tmp_adev = NULL;
>>>>          bool need_full_reset, skip_hw_reset, vram_lost = false;
>>>>          int r = 0;
>>>> -       uint32_t i;
>>>>
>>>>          /* Try reset handler method first */
>>>>          tmp_adev = list_first_entry(device_list_handle, struct 
>>>> amdgpu_device,
>>>>                                      reset_list);
>>>>
>>>> -       if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>>>> -               amdgpu_reset_reg_dumps(tmp_adev);
>>>> -
>>>> -               dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>> -               /* Trigger ip dump before we reset the asic */
>>>> -               for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>> -                       if 
>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>> - tmp_adev->ip_blocks[i].version->funcs
>>>> -                               ->dump_ip_state((void *)tmp_adev);
>>>> -               dev_info(tmp_adev->dev, "Dumping IP State 
>>>> Completed\n");
>>>> -       }
>>>> -
>>>>          reset_context->reset_device_list = device_list_handle;
>>>>          r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
>>>>          /* If reset handler not implemented, continue; otherwise 
>>>> return */
>>>> -- 
>>>> 2.34.1
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240726/4976344a/attachment-0001.htm>


More information about the amd-gfx mailing list