[PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's

Fri Jul 26 19:21:27 UTC 2024

On 7/27/2024 12:13 AM, Alex Deucher wrote:
> On Fri, Jul 26, 2024 at 1:16 PM Khatri, Sunil <sukhatri at amd.com> wrote:
>>
>> On 7/26/2024 8:36 PM, Lazar, Lijo wrote:
>>> On 7/26/2024 8:11 PM, Khatri, Sunil wrote:
>>>> On 7/26/2024 7:53 PM, Khatri, Sunil wrote:
>>>>> On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
>>>>>> On 7/26/2024 6:42 PM, Alex Deucher wrote:
>>>>>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri <sunil.khatri at amd.com>
>>>>>>> wrote:
>>>>>>>> Problem:
>>>>>>>> IP dump right now is done post suspend of
>>>>>>>> all IP's which for some IP's could change power
>>>>>>>> state and software state too which we do not want
>>>>>>>> to reflect in the dump as it might not be same at
>>>>>>>> the time of hang.
>>>>>>>>
>>>>>>>> Solution:
>>>>>>>> IP should be dumped as close to the HW state when
>>>>>>>> the GPU was in hung state without trying to reinitialize
>>>>>>>> any resource.
>>>>>>>>
>>>>>>>> Signed-off-by: Sunil Khatri <sunil.khatri at amd.com>
>>>>>>> Acked-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>>>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60
>>>>>>>> +++++++++++-----------
>>>>>>>>     1 file changed, 30 insertions(+), 30 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 730dae77570c..74f6f15e73b5 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct
>>>>>>>> amdgpu_device *adev)
>>>>>>>>            return ret;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>>> +{
>>>>>>>> +       int i;
>>>>>>>> +
>>>>>>>> +       lockdep_assert_held(&adev->reset_domain->sem);
>>>>>>>> +
>>>>>>>> +       for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>>>>> +               adev->reset_info.reset_dump_reg_value[i] =
>>>>>>>> +
>>>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>>>> A suspend also involves power/clock ungate. When reg dump is moved
>>>>>> earlier, I'm not sure if this read works for all. If it's left to
>>>>>> individual IP call backs, they could just do the same or better to move
>>>>>> these up before a dump.
>>>>> Suspend also put the status.hw = false and each IP in their respective
>>>>> suspend state which i feel does change the state of the HW.
>>>>> To get the correct snapshot of the GPU register we should not be
>>>>> fiddling with the HW IP at least till we capture the dump and that is
>>>>> the intention behind the change.
>>>>>
>>>>> Do you think there is a problem in this approach?
>>>>>>            amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>>>>            amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>>>> Adding this does sounds better to enable just before we dump the
>>>>> registers but i am not very sure if ungating would be clean here or
>>>>> not. i Could try quickly adding these two calls just before dump.
>>>>>
>>>>> All i am worried if it does change some register reflecting the
>>>>> original state of registers at dump.
>>>>>
>>> I was thinking that if it includes some GFX regs and the hang happened
>>> because of some SDMA/VCN jobs which somehow keeps GFXOFF state intact.
>> For GFX and SDMA hangs we make sure to disable GFXOFF before so for
>> those IP's
>> I am not worried and also based on my testing on NAVI22 for GPU hang
>> their registers
>> seems to be read cleanly.
>> Another point that i was thinking is after a GPU hang no where till the
>> point of dump
>> any registers are touched and that is what we need considering we are
>> able to read the registers.
>>
>> For VCN there is dynamic gating which is controlled by the FW if i am
>> not wrong which makes the VCN
>> off and registers cant be read. Only in case of VCN hang i am assuming
>> due to a Job pending VCN should be in power ON
>> state and out intent of reading the registers should work fine. Sadly i
>> am unable to validate it as there arent any test readily
>> available to hang VCN.
> We want to take the register dump as early as possible in the reset
> sequence, ideally before any of the hw gets touched as part of the
> reset sequence.  Otherwise, the dump is not going to be useful.
>
> Alex

Sure Alex. I am also of the same view that we dont want to change any 
power status of any IP as it could change the values.

Regards
Sunil Khatri

>
>
>>> BTW, since there is already dump_ip state which could capture IP regs
>>> separately and handle their power/clock gate situations, do you think
>>> this generic one is still needed?
>>>
>>> For whatever we have implemented till now seems to work fine in case of GPU hang withotu fidling the
>>> power state explicitly. But Calling suspend of each IP surely seems to change some state in IPs and SW state too.
>>> So i think until we experience a real problem we should go without the generic UNGATE call for all IP's
>>>
>>> But i am not an expert of power, so whatever you suggest would make more sense for the driver.
>> Regards
>> Sunil Khatri
>>> Thanks,
>>> Lijo
>>>
>>>>> What u suggest ?
>>>>> Regards
>>>>> Sunil
>>>> I quickly validated on Navi22 by adding below call just before we dump
>>>> registers
>>>> if(!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>>>>
>>>>       amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>>       amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>>>
>>>>       amdgpu_reset_reg_dumps(tmp_adev);
>>>>       dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>       /* Trigger ip dump before we reset the asic */
>>>>       for(i=0; i<tmp_adev->num_ip_blocks; i++)
>>>>           if(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>               tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>>>>                                       (void*)tmp_adev);
>>>>       dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
>>>> }
>>>> If this sounds fine with you i am update that. Regards Sunil Khatri
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>>> +
>>>>>>>> +
>>>>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>>>>>> +
>>>>>>>> adev->reset_info.reset_dump_reg_value[i]);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>     int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>>>>>>                                     struct amdgpu_reset_context
>>>>>>>> *reset_context)
>>>>>>>>     {
>>>>>>>>            int i, r = 0;
>>>>>>>>            struct amdgpu_job *job = NULL;
>>>>>>>> +       struct amdgpu_device *tmp_adev = reset_context->reset_req_dev;
>>>>>>>>            bool need_full_reset =
>>>>>>>>                    test_bit(AMDGPU_NEED_FULL_RESET,
>>>>>>>> &reset_context->flags);
>>>>>>>>
>>>>>>>> @@ -5340,6 +5358,18 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>                            }
>>>>>>>>                    }
>>>>>>>>
>>>>>>>> +               if (!test_bit(AMDGPU_SKIP_COREDUMP,
>>>>>>>> &reset_context->flags)) {
>>>>>>>> +                       amdgpu_reset_reg_dumps(tmp_adev);
>>>>>>>> +
>>>>>>>> +                       dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>>>>> +                       /* Trigger ip dump before we reset the asic */
>>>>>>>> +                       for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>>>>>> +                               if
>>>>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>>>>> +
>>>>>>>> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>>>>>>>> +                                                       (void
>>>>>>>> *)tmp_adev);
>>>>>>>> +                       dev_info(tmp_adev->dev, "Dumping IP State
>>>>>>>> Completed\n");
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>>                    if (need_full_reset)
>>>>>>>>                            r = amdgpu_device_ip_suspend(adev);
>>>>>>>>                    if (need_full_reset)
>>>>>>>> @@ -5352,47 +5382,17 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>            return r;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> -static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>>>> -{
>>>>>>>> -       int i;
>>>>>>>> -
>>>>>>>> -       lockdep_assert_held(&adev->reset_domain->sem);
>>>>>>>> -
>>>>>>>> -       for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>>>>> -               adev->reset_info.reset_dump_reg_value[i] =
>>>>>>>> -
>>>>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>>>>>> -
>>>>>>>> -
>>>>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>>>>>> -
>>>>>>>> adev->reset_info.reset_dump_reg_value[i]);
>>>>>>>> -       }
>>>>>>>> -
>>>>>>>> -       return 0;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>>     int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>>>>                             struct amdgpu_reset_context *reset_context)
>>>>>>>>     {
>>>>>>>>            struct amdgpu_device *tmp_adev = NULL;
>>>>>>>>            bool need_full_reset, skip_hw_reset, vram_lost = false;
>>>>>>>>            int r = 0;
>>>>>>>> -       uint32_t i;
>>>>>>>>
>>>>>>>>            /* Try reset handler method first */
>>>>>>>>            tmp_adev = list_first_entry(device_list_handle, struct
>>>>>>>> amdgpu_device,
>>>>>>>>                                        reset_list);
>>>>>>>>
>>>>>>>> -       if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>>>>>>>> -               amdgpu_reset_reg_dumps(tmp_adev);
>>>>>>>> -
>>>>>>>> -               dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>>>>> -               /* Trigger ip dump before we reset the asic */
>>>>>>>> -               for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>>>>>> -                       if
>>>>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>>>>> -                               tmp_adev->ip_blocks[i].version->funcs
>>>>>>>> -                               ->dump_ip_state((void *)tmp_adev);
>>>>>>>> -               dev_info(tmp_adev->dev, "Dumping IP State
>>>>>>>> Completed\n");
>>>>>>>> -       }
>>>>>>>> -
>>>>>>>>            reset_context->reset_device_list = device_list_handle;
>>>>>>>>            r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
>>>>>>>>            /* If reset handler not implemented, continue; otherwise
>>>>>>>> return */
>>>>>>>> --
>>>>>>>> 2.34.1
>>>>>>>>