[PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload

Wed Apr 17 07:44:57 UTC 2024

On 4/17/2024 1:06 PM, Khatri, Sunil wrote:
> devcoredump is used to debug gpu hangs/resets. So in normal process 
> when there is a hang due to ring timeout or page fault we are doing a 
> hard reset as soft reset fail in those cases. How are we making sure 
> that the devcoredump is triggered in those cases and captured?
>
> Regards
> Sunil Khatri
>
> On 4/17/2024 9:43 AM, Ahmad Rehman wrote:
>> In passthrough environment, the driver triggers the mode-1 reset on
>> reload. The reset causes the core dump collection which is delayed task
>> and prevents driver from unloading until it is completed. Since we do
>> not need to collect data on "reset on reload" case, we can skip core
>> dump collection.
>>
>> v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.
>>
>> Signed-off-by: Ahmad Rehman <Ahmad.Rehman at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
>>   3 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 1b2e177bc2d6..c718982cffa8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head 
>> *device_list_handle,
>>       /* Try reset handler method first */
>>       tmp_adev = list_first_entry(device_list_handle, struct 
>> amdgpu_device,
>>                       reset_list);
>> -    amdgpu_reset_reg_dumps(tmp_adev);
>> +
>> +    if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))
>> +        amdgpu_reset_reg_dumps(tmp_adev);
>>         reset_context->reset_device_list = device_list_handle;
>>       r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
>> @@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head 
>> *device_list_handle,
>>                     vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
>>   -                amdgpu_coredump(tmp_adev, vram_lost, reset_context);
>> +                if (!test_bit(AMDGPU_SKIP_COREDUMP, 
>> &reset_context->flags))
>> +                    amdgpu_coredump(tmp_adev, vram_lost, 
>> reset_context);
>>                     if (vram_lost) {
>>                       DRM_INFO("VRAM is lost due to GPU reset!\n");
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 6ea893ad9a36..c512f70b8272 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2481,6 +2481,7 @@ static void 
>> amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)
>>         /* Use a common context, just need to make sure full reset is 
>> done */
>>       set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>> +    set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
If this is used for guests only can we better have a flag like 
amdgpu_sriov_vf  for setting the skip coredump flag ??

Regards
Sunil khatri

>>       r = amdgpu_do_asic_reset(&device_list, &reset_context);
>>         if (r) {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>> index 66125d43cf21..b11d190ece53 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>> @@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS {
>>         AMDGPU_NEED_FULL_RESET = 0,
>>       AMDGPU_SKIP_HW_RESET = 1,
>> +    AMDGPU_SKIP_COREDUMP = 2,
>>   };
>>     struct amdgpu_reset_context {