[PATCH] drm/amdgpu: Fix two reset triggered in a row

Wed Apr 24 06:19:07 UTC 2024

Am 23.04.24 um 20:05 schrieb Felix Kuehling:
>
> On 2024-04-23 01:50, Christian König wrote:
>> Am 22.04.24 um 21:45 schrieb Yunxiang Li:
>>> Reset request from KFD is missing a check for if a reset is already in
>>> progress, this causes a second reset to be triggered right after the
>>> previous one finishes. Add the check to align with the other reset 
>>> sources.
>>
>> NAK, that isn't how this should be handled.
>>
>> Instead all reset source which are handled by a previous reset should 
>> be canceled.
>>
>> In other words there should be a cancel_work(&adev->kfd.reset_work); 
>> somewhere in the KFD code. When this doesn't work correctly then that 
>> is somehow missing.
>>
>> If you see the use of amdgpu_in_reset() outside of the low level 
>> functions than that is clearly a bug.
> Do we need to do that for all reset workers in the driver separately? 
> I don't see where this is done for other reset workers.

Yeah, I think so. But we don't have so many reset workers if I'm not 
completely mistaken.

We have the KFD, FLR, the per engine one in the scheduler and IIRC one 
more for the CP (illegal operation and register write).

I'm not sure about the CP one, but all others should be handled 
correctly with the V2 patch as far as I can see.

Regards,
Christian.

>
> Regards,
>   Felix
>
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Yunxiang Li <Yunxiang.Li at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 3b4591f554f1..ce3dbb1cc2da 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -283,7 +283,7 @@ int amdgpu_amdkfd_post_reset(struct 
>>> amdgpu_device *adev)
>>>     void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
>>>   {
>>> -    if (amdgpu_device_should_recover_gpu(adev))
>>> +    if (amdgpu_device_should_recover_gpu(adev) && 
>>> !amdgpu_in_reset(adev))
>>>           amdgpu_reset_domain_schedule(adev->reset_domain,
>>>                            &adev->kfd.reset_work);
>>>   }
>>