[PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

Lazar, Lijo lijo.lazar at amd.com
Fri Jan 10 03:38:56 UTC 2025



On 1/9/2025 8:27 PM, Kim, Jonathan wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Thursday, January 9, 2025 1:14 AM
>> To: Kim, Jonathan <Jonathan.Kim at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
>>
>>
>>
>> On 1/9/2025 1:31 AM, Jonathan Kim wrote:
>>> Per queue reset should be bypassed when gpu recovery is disabled
>>> with module parameter.
>>>
>>> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>> index cc66ebb7bae1..441568163e20 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>> @@ -1131,6 +1131,9 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct
>> amdgpu_device *adev,
>>>     uint32_t low, high;
>>>     uint64_t queue_addr = 0;
>>>
>>> +   if (!amdgpu_gpu_recovery)
>>> +           return 0;
>>> +
>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>
>>> @@ -1179,6 +1182,9 @@ uint64_t kgd_gfx_v9_hqd_reset(struct amdgpu_device
>> *adev,
>>>     uint32_t low, high, pipe_reset_data = 0;
>>>     uint64_t queue_addr = 0;
>>>
>>> +   if (!amdgpu_gpu_recovery)
>>> +           return 0;
>>> +
>>
>> I think the right place for this check is not inside callback, should be
>> from the place where the callback gets called.
> 
> I don't think it really matters.  We're going to have different reset types in the future that my come from different callers.
> It's probably easier to remember to put the bypass where the reset is actually happening.
> 

If I want to define something like amdgpu_gpu_recovery=2 (don't do queue
reset but perform other resets), then it matters.

Since this is a callback, keeping it at the wrapper place may be more
maintainable rather than keeping the check for gfx10/11/12 etc.

Thanks,
Lijo

> Jon
> 
>>
>> Thanks,
>> Lijo
>>
>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>
> 



More information about the amd-gfx mailing list