[PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU

Tue Nov 7 22:16:48 UTC 2023

On 2023-11-07 17:03, Alex Deucher wrote:
> On Mon, Nov 6, 2023 at 6:17 PM Felix Kuehling <felix.kuehling at amd.com> wrote:
>> On 2023-11-06 2:14, Lawrence Yiu wrote:
>>> After unbinding a GPU, KFD becomes locked and unusable, resulting in
>>> applications not being able to use ROCm for compute anymore and rocminfo
>>> outputting the following error message:
>>>
>>> ROCk module is loaded
>>> Unable to open /dev/kfd read-write: Invalid argument
>>>
>>> KFD remains locked even after rebinding the same GPU and a system reboot
>>> is required to unlock it. Fix this by not locking KFD during the GPU
>>> unbind process.
>>>
>>> Closes: https://github.com/RadeonOpenCompute/ROCm/issues/629
>>> Signed-off-by: Lawrence Yiu <lawyiu.dev at gmail.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdkfd/kfd_device.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 0a9cf9dfc224..c9436039e619 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -949,8 +949,8 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>        if (!kfd->init_complete)
>>>                return;
>>>
>>> -     /* for runtime suspend, skip locking kfd */
>>> -     if (!run_pm) {
>>> +     /* for runtime suspend or GPU unbind, skip locking kfd */
>>> +     if (!run_pm && !drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>>                mutex_lock(&kfd_processes_mutex);
>>>                count = ++kfd_locked;
>> This lock is meant to prevent new KFD processes from starting while a
>> GPU reset or suspend/resume is in progress. Just below it also suspends
>> the user mode queues of all processes to ensure the GPUs are idle before
>> suspending. It sounds like this is not applicable to the hot-unplug use
>> case. In particular, if there is no matching kgd2kfd_resume call, that
>> would lead to the symptom you describe, where KFD just gets stuck forever.
>>
>> What's the semantics of GPU hot unplug? Is it more like a GPU reset or
>> more like runtime-PM? In other words, do we need to notify processes
>> when a GPU goes away, or is there some other mechanism that ensures a
>> GPU is idle before being unplugged?
>>
> It's a separate PCI entry point (remove() in this case).  From a
> driver perspective we quiesce any outstanding DMA and then tear down
> the driver.  It's the same whether you are actually physically
> hotplugging the device or just unbinding the driver from the device.

It sounds like we should treat it like a GPU reset for KFD, where we 
notify user mode that the context is gone. Except that between pre-reset 
and post-reset the topology changes, so we don't bring the removed GPU 
back up. That may require some non-trivial changes in a bunch of places, 
if the kfd_process_device data structures still refer to a device that 
no longer exist.

Regards,
   Felix

>
> Alex
>
>> If it's more like runtime PM, then simply call kgd2kfd_suspend with
>> run_pm=true.
>>
>> If it's more like a GPU reset, you can't just remove this lock. User
>> mode won't be aware and will try to continue using the GPU. In the best
>> case applications will just soft hang. Instead you should probably
>> replace the kgd2kfd_suspend call with calls to kgd2kfd_pre_reset and
>> kgd2kfd_post_reset. That would idle the affected GPU, notify user mode
>> processes using the GPU that something is wrong, and resume all the GPUs
>> again. You'd need to be careful about the sequence between actual unplug
>> and post_reset. Not sure if post_reset would need changes to avoid
>> failing on the removed GPU.
>>
>> Regards,
>>     Felix
>>
>>
>>>                mutex_unlock(&kfd_processes_mutex);