[PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset

Thu Feb 15 14:46:53 UTC 2024

Am 15.02.24 um 15:43 schrieb Christian König:
> Am 15.02.24 um 15:36 schrieb Alex Deucher:
>> On Thu, Feb 15, 2024 at 2:53 AM Christian König
>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>> Well using this is in sysfs is a bug to begin with. This would prevent
>>> starting new applications and crashing applications which don't expect
>>> to get an -EPERM in return here.
>>>
>>> If we need to make operations mutual exclusive with resets then we need
>>> to take the appropriate locks and *not* work around by abusing
>>> amdgpu_in_reset().
>>>
>>> The functionality of amdgpu_in_reset() is just to check in lower level
>>> functions if we are inside the higher level reset thread and *not*
>>> protect anybody from concurrent access.
>>>
>>> I think we should probably completely nuke the underlying flag and 
>>> using
>>> the thread owner of the lock to prevent such an abuse.
>> Can we land some variant of this for now?
>
> I don't think so, it most likely will break existing use cases.
>
> What we might be able to do is to frame this with 
> amdgpu_device_lock_reset_domain() / amdgpu_device_unlock_reset_domain().
>
>> It fixes known issues and it's the same behavior we have in sysfs and 
>> debugfs already.
>
> Yeah, as I said that is broken to begin with. It's just that for sysfs 
> and debugfs nobody notices.

Wait a second, debugfs is actually doing the right thing in some functions:

         /* Avoid accidently unparking the sched thread during GPU reset */
         r = down_read_killable(&adev->reset_domain->sem);
         if (r)
                 goto pro_end;
...

         up_read(&adev->reset_domain->sem);

This needs to replace amdgpu_in_reset() in pretty much all debugfs and 
sysfs function.

Probably best to wrap that in some inline amdgpu_reset_* functions and 
document why those needs to be used.

Regards,
Christian.

>
> Regards,
> Christian.
>
>>    It's not
>> clear to me how this should best be handled.  We basically want to
>> block any access to the GPU (registers, firmwares, etc.) while the GPU
>> is going through a reset.  Just locking the reset domain doesn't seem
>> like the right solution.
>>
>> Alex
>>
>>> Regards,
>>> Christian.
>>>
>>> Am 12.02.24 um 21:56 schrieb Deucher, Alexander:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Ping?
>>>>
>>>>> -----Original Message-----
>>>>> From: Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>> Sent: Monday, January 29, 2024 10:56 AM
>>>>> To: amd-gfx at lists.freedesktop.org
>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>> Subject: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in 
>>>>> reset
>>>>>
>>>>> This avoids queries to read registers or query the SMU for 
>>>>> telemetry data while
>>>>> the GPU is in reset. This mirrors what we already do for sysfs.
>>>>>
>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>>>>    1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> index a2df3025a754..d522e99c6f81 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> @@ -607,6 +607,9 @@ int amdgpu_info_ioctl(struct drm_device *dev, 
>>>>> void
>>>>> *data, struct drm_file *filp)
>>>>>         int i, found, ret;
>>>>>         int ui32_size = sizeof(ui32);
>>>>>
>>>>> +     if (amdgpu_in_reset(adev))
>>>>> +             return -EPERM;
>>>>> +
>>>>>         if (!info->return_size || !info->return_pointer)
>>>>>                 return -EINVAL;
>>>>>
>>>>> -- 
>>>>> 2.42.0
>