[PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload

Khatri, Sunil sukhatri at amd.com
Wed Apr 17 08:24:43 UTC 2024


On 4/17/2024 1:19 PM, Lazar, Lijo wrote:
>
> On 4/17/2024 1:14 PM, Khatri, Sunil wrote:
>> On 4/17/2024 1:06 PM, Khatri, Sunil wrote:
>>> devcoredump is used to debug gpu hangs/resets. So in normal process
>>> when there is a hang due to ring timeout or page fault we are doing a
>>> hard reset as soft reset fail in those cases. How are we making sure
>>> that the devcoredump is triggered in those cases and captured?
>>>
>>> Regards
>>> Sunil Khatri
>>>
>>> On 4/17/2024 9:43 AM, Ahmad Rehman wrote:
>>>> In passthrough environment, the driver triggers the mode-1 reset on
>>>> reload. The reset causes the core dump collection which is delayed task
>>>> and prevents driver from unloading until it is completed. Since we do
>>>> not need to collect data on "reset on reload" case, we can skip core
>>>> dump collection.
>>>>
>>>> v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.
>>>>
>>>> Signed-off-by: Ahmad Rehman <Ahmad.Rehman at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++--
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
>>>>    3 files changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 1b2e177bc2d6..c718982cffa8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head
>>>> *device_list_handle,
>>>>        /* Try reset handler method first */
>>>>        tmp_adev = list_first_entry(device_list_handle, struct
>>>> amdgpu_device,
>>>>                        reset_list);
>>>> -    amdgpu_reset_reg_dumps(tmp_adev);
>>>> +
>>>> +    if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))
>>>> +        amdgpu_reset_reg_dumps(tmp_adev);
>>>>          reset_context->reset_device_list = device_list_handle;
>>>>        r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
>>>> @@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head
>>>> *device_list_handle,
>>>>                      vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
>>>>    -                amdgpu_coredump(tmp_adev, vram_lost, reset_context);
>>>> +                if (!test_bit(AMDGPU_SKIP_COREDUMP,
>>>> &reset_context->flags))
>>>> +                    amdgpu_coredump(tmp_adev, vram_lost,
>>>> reset_context);
>>>>                      if (vram_lost) {
>>>>                        DRM_INFO("VRAM is lost due to GPU reset!\n");
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 6ea893ad9a36..c512f70b8272 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -2481,6 +2481,7 @@ static void
>>>> amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)
>>>>          /* Use a common context, just need to make sure full reset is
>>>> done */
>>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>> +    set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
>> If this is used for guests only can we better have a flag like
>> amdgpu_sriov_vf  for setting the skip coredump flag ??
>>
> A reset is not always triggered just because of hang. There are other
> cases like we want to do a reset after a suspend/resume cycle so that
> the device starts from a clean state. Those are intentionally triggered
> by driver. Also, there are case like RAS errors where we reset and that
> also really doesn't need a core dump. In all such cases, this flag is
> required, and this is one such case (this patch only addresses passthrough).


Thanks Lijo
Able to verify that in normal hangs dump is working.

Regards
Sunil

>
> Thanks,
> Lijo
>
>> Regards
>> Sunil khatri
>>
>>>>        r = amdgpu_do_asic_reset(&device_list, &reset_context);
>>>>          if (r) {
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>> index 66125d43cf21..b11d190ece53 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>> @@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS {
>>>>          AMDGPU_NEED_FULL_RESET = 0,
>>>>        AMDGPU_SKIP_HW_RESET = 1,
>>>> +    AMDGPU_SKIP_COREDUMP = 2,
>>>>    };
>>>>      struct amdgpu_reset_context {


More information about the amd-gfx mailing list