[PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in reset
Christian König
ckoenig.leichtzumerken at gmail.com
Thu Feb 15 14:46:53 UTC 2024
Am 15.02.24 um 15:43 schrieb Christian König:
> Am 15.02.24 um 15:36 schrieb Alex Deucher:
>> On Thu, Feb 15, 2024 at 2:53 AM Christian König
>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>> Well using this is in sysfs is a bug to begin with. This would prevent
>>> starting new applications and crashing applications which don't expect
>>> to get an -EPERM in return here.
>>>
>>> If we need to make operations mutual exclusive with resets then we need
>>> to take the appropriate locks and *not* work around by abusing
>>> amdgpu_in_reset().
>>>
>>> The functionality of amdgpu_in_reset() is just to check in lower level
>>> functions if we are inside the higher level reset thread and *not*
>>> protect anybody from concurrent access.
>>>
>>> I think we should probably completely nuke the underlying flag and
>>> using
>>> the thread owner of the lock to prevent such an abuse.
>> Can we land some variant of this for now?
>
> I don't think so, it most likely will break existing use cases.
>
> What we might be able to do is to frame this with
> amdgpu_device_lock_reset_domain() / amdgpu_device_unlock_reset_domain().
>
>> It fixes known issues and it's the same behavior we have in sysfs and
>> debugfs already.
>
> Yeah, as I said that is broken to begin with. It's just that for sysfs
> and debugfs nobody notices.
Wait a second, debugfs is actually doing the right thing in some functions:
/* Avoid accidently unparking the sched thread during GPU reset */
r = down_read_killable(&adev->reset_domain->sem);
if (r)
goto pro_end;
...
up_read(&adev->reset_domain->sem);
This needs to replace amdgpu_in_reset() in pretty much all debugfs and
sysfs function.
Probably best to wrap that in some inline amdgpu_reset_* functions and
document why those needs to be used.
Regards,
Christian.
>
> Regards,
> Christian.
>
>> It's not
>> clear to me how this should best be handled. We basically want to
>> block any access to the GPU (registers, firmwares, etc.) while the GPU
>> is going through a reset. Just locking the reset domain doesn't seem
>> like the right solution.
>>
>> Alex
>>
>>> Regards,
>>> Christian.
>>>
>>> Am 12.02.24 um 21:56 schrieb Deucher, Alexander:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Ping?
>>>>
>>>>> -----Original Message-----
>>>>> From: Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>> Sent: Monday, January 29, 2024 10:56 AM
>>>>> To: amd-gfx at lists.freedesktop.org
>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>> Subject: [PATCH] drm/amdgpu: bail on INFO IOCTL if the GPU is in
>>>>> reset
>>>>>
>>>>> This avoids queries to read registers or query the SMU for
>>>>> telemetry data while
>>>>> the GPU is in reset. This mirrors what we already do for sysfs.
>>>>>
>>>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>>>> 1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> index a2df3025a754..d522e99c6f81 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> @@ -607,6 +607,9 @@ int amdgpu_info_ioctl(struct drm_device *dev,
>>>>> void
>>>>> *data, struct drm_file *filp)
>>>>> int i, found, ret;
>>>>> int ui32_size = sizeof(ui32);
>>>>>
>>>>> + if (amdgpu_in_reset(adev))
>>>>> + return -EPERM;
>>>>> +
>>>>> if (!info->return_size || !info->return_pointer)
>>>>> return -EINVAL;
>>>>>
>>>>> --
>>>>> 2.42.0
>
More information about the amd-gfx
mailing list