[PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's

Fri Jul 26 17:16:00 UTC 2024

On 7/26/2024 8:36 PM, Lazar, Lijo wrote:
>
> On 7/26/2024 8:11 PM, Khatri, Sunil wrote:
>> On 7/26/2024 7:53 PM, Khatri, Sunil wrote:
>>> On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
>>>> On 7/26/2024 6:42 PM, Alex Deucher wrote:
>>>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri <sunil.khatri at amd.com>
>>>>> wrote:
>>>>>> Problem:
>>>>>> IP dump right now is done post suspend of
>>>>>> all IP's which for some IP's could change power
>>>>>> state and software state too which we do not want
>>>>>> to reflect in the dump as it might not be same at
>>>>>> the time of hang.
>>>>>>
>>>>>> Solution:
>>>>>> IP should be dumped as close to the HW state when
>>>>>> the GPU was in hung state without trying to reinitialize
>>>>>> any resource.
>>>>>>
>>>>>> Signed-off-by: Sunil Khatri <sunil.khatri at amd.com>
>>>>> Acked-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60
>>>>>> +++++++++++-----------
>>>>>>    1 file changed, 30 insertions(+), 30 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 730dae77570c..74f6f15e73b5 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct
>>>>>> amdgpu_device *adev)
>>>>>>           return ret;
>>>>>>    }
>>>>>>
>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>> +{
>>>>>> +       int i;
>>>>>> +
>>>>>> +       lockdep_assert_held(&adev->reset_domain->sem);
>>>>>> +
>>>>>> +       for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>>> +               adev->reset_info.reset_dump_reg_value[i] =
>>>>>> +
>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>> A suspend also involves power/clock ungate. When reg dump is moved
>>>> earlier, I'm not sure if this read works for all. If it's left to
>>>> individual IP call backs, they could just do the same or better to move
>>>> these up before a dump.
>>> Suspend also put the status.hw = false and each IP in their respective
>>> suspend state which i feel does change the state of the HW.
>>> To get the correct snapshot of the GPU register we should not be
>>> fiddling with the HW IP at least till we capture the dump and that is
>>> the intention behind the change.
>>>
>>> Do you think there is a problem in this approach?
>>>>           amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>>           amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>> Adding this does sounds better to enable just before we dump the
>>> registers but i am not very sure if ungating would be clean here or
>>> not. i Could try quickly adding these two calls just before dump.
>>>
>>> All i am worried if it does change some register reflecting the
>>> original state of registers at dump.
>>>
> I was thinking that if it includes some GFX regs and the hang happened
> because of some SDMA/VCN jobs which somehow keeps GFXOFF state intact.

For GFX and SDMA hangs we make sure to disable GFXOFF before so for 
those IP's
I am not worried and also based on my testing on NAVI22 for GPU hang 
their registers
seems to be read cleanly.
Another point that i was thinking is after a GPU hang no where till the 
point of dump
any registers are touched and that is what we need considering we are 
able to read the registers.

For VCN there is dynamic gating which is controlled by the FW if i am 
not wrong which makes the VCN
off and registers cant be read. Only in case of VCN hang i am assuming 
due to a Job pending VCN should be in power ON
state and out intent of reading the registers should work fine. Sadly i 
am unable to validate it as there arent any test readily
available to hang VCN.

>
> BTW, since there is already dump_ip state which could capture IP regs
> separately and handle their power/clock gate situations, do you think
> this generic one is still needed?
>
> For whatever we have implemented till now seems to work fine in case of GPU hang withotu fidling the
> power state explicitly. But Calling suspend of each IP surely seems to change some state in IPs and SW state too.
> So i think until we experience a real problem we should go without the generic UNGATE call for all IP's
>
> But i am not an expert of power, so whatever you suggest would make more sense for the driver.
Regards
Sunil Khatri
>
> Thanks,
> Lijo
>
>>> What u suggest ?
>>> Regards
>>> Sunil
>> I quickly validated on Navi22 by adding below call just before we dump
>> registers
>> if(!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>>                         
>>      amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>      amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>                         
>>      amdgpu_reset_reg_dumps(tmp_adev);
>>      dev_info(tmp_adev->dev, "Dumping IP State\n");
>>      /* Trigger ip dump before we reset the asic */
>>      for(i=0; i<tmp_adev->num_ip_blocks; i++)
>>          if(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>              tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>>                                      (void*)tmp_adev);
>>      dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
>> }
>> If this sounds fine with you i am update that. Regards Sunil Khatri
>>>> Thanks,
>>>> Lijo
>>>>
>>>>>> +
>>>>>> +
>>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>>>> +
>>>>>> adev->reset_info.reset_dump_reg_value[i]);
>>>>>> +       }
>>>>>> +
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>>    int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>>>>                                    struct amdgpu_reset_context
>>>>>> *reset_context)
>>>>>>    {
>>>>>>           int i, r = 0;
>>>>>>           struct amdgpu_job *job = NULL;
>>>>>> +       struct amdgpu_device *tmp_adev = reset_context->reset_req_dev;
>>>>>>           bool need_full_reset =
>>>>>>                   test_bit(AMDGPU_NEED_FULL_RESET,
>>>>>> &reset_context->flags);
>>>>>>
>>>>>> @@ -5340,6 +5358,18 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>> amdgpu_device *adev,
>>>>>>                           }
>>>>>>                   }
>>>>>>
>>>>>> +               if (!test_bit(AMDGPU_SKIP_COREDUMP,
>>>>>> &reset_context->flags)) {
>>>>>> +                       amdgpu_reset_reg_dumps(tmp_adev);
>>>>>> +
>>>>>> +                       dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>>> +                       /* Trigger ip dump before we reset the asic */
>>>>>> +                       for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>>>> +                               if
>>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>>> +
>>>>>> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>>>>>> +                                                       (void
>>>>>> *)tmp_adev);
>>>>>> +                       dev_info(tmp_adev->dev, "Dumping IP State
>>>>>> Completed\n");
>>>>>> +               }
>>>>>> +
>>>>>>                   if (need_full_reset)
>>>>>>                           r = amdgpu_device_ip_suspend(adev);
>>>>>>                   if (need_full_reset)
>>>>>> @@ -5352,47 +5382,17 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>> amdgpu_device *adev,
>>>>>>           return r;
>>>>>>    }
>>>>>>
>>>>>> -static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>> -{
>>>>>> -       int i;
>>>>>> -
>>>>>> -       lockdep_assert_held(&adev->reset_domain->sem);
>>>>>> -
>>>>>> -       for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>>> -               adev->reset_info.reset_dump_reg_value[i] =
>>>>>> -
>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>>>> -
>>>>>> -
>>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>>>> -
>>>>>> adev->reset_info.reset_dump_reg_value[i]);
>>>>>> -       }
>>>>>> -
>>>>>> -       return 0;
>>>>>> -}
>>>>>> -
>>>>>>    int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>>                            struct amdgpu_reset_context *reset_context)
>>>>>>    {
>>>>>>           struct amdgpu_device *tmp_adev = NULL;
>>>>>>           bool need_full_reset, skip_hw_reset, vram_lost = false;
>>>>>>           int r = 0;
>>>>>> -       uint32_t i;
>>>>>>
>>>>>>           /* Try reset handler method first */
>>>>>>           tmp_adev = list_first_entry(device_list_handle, struct
>>>>>> amdgpu_device,
>>>>>>                                       reset_list);
>>>>>>
>>>>>> -       if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>>>>>> -               amdgpu_reset_reg_dumps(tmp_adev);
>>>>>> -
>>>>>> -               dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>>> -               /* Trigger ip dump before we reset the asic */
>>>>>> -               for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>>>> -                       if
>>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>>> -                               tmp_adev->ip_blocks[i].version->funcs
>>>>>> -                               ->dump_ip_state((void *)tmp_adev);
>>>>>> -               dev_info(tmp_adev->dev, "Dumping IP State
>>>>>> Completed\n");
>>>>>> -       }
>>>>>> -
>>>>>>           reset_context->reset_device_list = device_list_handle;
>>>>>>           r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
>>>>>>           /* If reset handler not implemented, continue; otherwise
>>>>>> return */
>>>>>> -- 
>>>>>> 2.34.1
>>>>>>